
google-refine - issue #604
The common transform “Trim leading and trailing whitespace” doesn’t trim non-breaking spaces
What steps will reproduce the problem? 1. Have a cell with “ x”as its value (the space being a non-breaking space). 2. Apply the common transform “Trim leading and trailing whitespace” to its column.
What is the expected output? What do you see instead? The cell’s value should become “x”. Instead, it does not change.
What version of Google Refine are you using? 2.5
What operating system and browser are you using? Mac OS X 10.7.4 Chrome Version 21.0.1180.75
Is this problem specific to the type of browser you're using or it happens in all the browsers you tried? The same behavior appears in Safari 6.0 (7536.25).
Please provide any additional information below. Perhaps this is the intended behavior. But it does not reflect the function’s name (“Trim leading and trailing whitespace”).
Comment #1
Posted on Aug 14, 2012 by Swift BirdThis function implements an "old-school" definition of whitespace as described here: http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#trim()
There are lots of Unicode whitespace characters (em-space, en-space, thin space, nbsp, etc) which aren't included.
Comment #2
Posted on Aug 14, 2012 by Swift BirdUpdated to use the Guava method CharMatcher.WHITESPACE.trimFrom(s) in r2528. Still needs tests.
Comment #3
Posted on Aug 14, 2012 by Swift BirdI've added some basic tests. The two characters that the Guava method doesn't appear to handle are 0-width NBSP "\uFEFF" and Word Join "\u2060". I'm not going rely on Guava for now rather than trying to work around this.
Comment #4
Posted on Sep 18, 2012 by Swift Bird(No comment was entered for this change.)
Status: Fixed
Labels:
Type-Defect
Priority-Medium
Milestone-2.6