Clojure

clojure.string/trim uses different defn of whitespace as triml, trimr

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Completed
  • Affects Version/s: Release 1.6
  • Fix Version/s: Release 1.6
  • Component/s: None
  • Labels:
  • Patch:
    Code and Test
  • Approval:
    Ok

Description

clojure.string/triml and trimr use Character/isWhitespace to determine whether a character is whitespace, but trim uses some other definition of white space character. For example:

user=> (use 'clojure.string)
nil
user=> (def s "  \u2002  foo")
#'user/s
user=> (trim s)
"?  foo"
user=> (triml s)
"foo"

Cause: triml and trimr use Character/isWhitespace. trim uses String/trim which seems to define whitespace as any character less than or equal '\u0020'. The isWhitespace() definition is slightly different and includes other Unicode space characters.

Approach: The attached patch changes trim to use Character/isWhitespace. The isWhitespace version seems generally newer and more Unicode considerate so this was chosen over changing triml and trimr to match trim.

A few alternative implementations were considered with respect to longs, ints, etc. The patch opts to use the simplest possible code, eschewing any extreme performance measures. See the comments for more info if desired.

The patch also changes triml to only call .length on s once.

Patch: clj935-3.patch

Screened by: Stuart Sierra

  1. clj935-2.patch
    30/Aug/13 3:47 PM
    3 kB
    Alex Miller
  2. clj935-3.patch
    02/Dec/13 11:07 PM
    3 kB
    Alex Miller
  3. fix-trim-fns-different-whitespace-patch.txt
    21/Feb/12 1:29 PM
    3 kB
    Andy Fingerhut

Activity

Hide
Rich Hickey added a comment -

Is there a gist or anything of the perf tests?

Show
Rich Hickey added a comment - Is there a gist or anything of the perf tests?
Hide
Alex Miller added a comment -

Testing perf of ltrim on a 1M character blank string (tested on criterium w/ good JVM settings):

clojure.string/ltrim = 2.785597 ms
clj935-2.patch ltrim = 2.880528 ms
same as clj935-2 with unchecked-inc = 2.225098 ms
same as clj935-2 with unchecked-inc-int = 2.198700 ms

I tested the last 3 with unchecked-math both true and false and saw no difference between them.

Based on this, I will update the patch to use unchecked-inc as that seems to give even better performance than the prior impl.

Show
Alex Miller added a comment - Testing perf of ltrim on a 1M character blank string (tested on criterium w/ good JVM settings): clojure.string/ltrim = 2.785597 ms clj935-2.patch ltrim = 2.880528 ms same as clj935-2 with unchecked-inc = 2.225098 ms same as clj935-2 with unchecked-inc-int = 2.198700 ms I tested the last 3 with unchecked-math both true and false and saw no difference between them. Based on this, I will update the patch to use unchecked-inc as that seems to give even better performance than the prior impl.
Hide
Rich Hickey added a comment - - edited

did you try

(set! *unchecked-math* true)
?

Show
Rich Hickey added a comment - - edited did you try
(set! *unchecked-math* true)
?
Hide
Alex Miller added a comment -

In response to the questions regarding the "int" calls in the prior patch, I did some analysis and looked at alternative impls.

Prior patch: the calls to (int) were indeed causing primitive ints to be used in some cases (like letting the length). However, the loop index/rindex is always a prim long at best (and there is no way to type hint or alter that afaict). Because of this, casts get inserted to upcast the length to a long for comparison anyways. Using criterium and calling ltrim on a string of 1M spaces, I got a mean of 2.9 ms.

I tried a dirty trick of using a mutable int array for the index and that (while ugly) was 2.25 ms, so definitely faster. It did not seem fast enough to justify the dirtiness.

I also benchmarked a pure Java int version:

static public CharSequence triml(CharSequence s) {
  int len = s.length();
    for(int index=0; index < len; index++) {
      if(! Character.isWhitespace(s.charAt(index)))
        return s.subSequence(index, len).toString();
    }
  return "";
}

This code is actually smaller and faster - 0.92 ms.

I pondered this and in the end decided to leave it as the most straightforward Clojure impl. I think for the vast majority of trim calls, the difference will be negligible. For users needing higher trim performance, they should probably write an optimized version and tune their whitespace set. My second choice would be adding the Java implementations.

Marking screened.

Show
Alex Miller added a comment - In response to the questions regarding the "int" calls in the prior patch, I did some analysis and looked at alternative impls. Prior patch: the calls to (int) were indeed causing primitive ints to be used in some cases (like letting the length). However, the loop index/rindex is always a prim long at best (and there is no way to type hint or alter that afaict). Because of this, casts get inserted to upcast the length to a long for comparison anyways. Using criterium and calling ltrim on a string of 1M spaces, I got a mean of 2.9 ms. I tried a dirty trick of using a mutable int array for the index and that (while ugly) was 2.25 ms, so definitely faster. It did not seem fast enough to justify the dirtiness. I also benchmarked a pure Java int version:
static public CharSequence triml(CharSequence s) {
  int len = s.length();
    for(int index=0; index < len; index++) {
      if(! Character.isWhitespace(s.charAt(index)))
        return s.subSequence(index, len).toString();
    }
  return "";
}
This code is actually smaller and faster - 0.92 ms. I pondered this and in the end decided to leave it as the most straightforward Clojure impl. I think for the vast majority of trim calls, the difference will be negligible. For users needing higher trim performance, they should probably write an optimized version and tune their whitespace set. My second choice would be adding the Java implementations. Marking screened.
Hide
Aaron Bedra added a comment -

Bump on the discussion. This ticket seems blocked until we make a decision.

Show
Aaron Bedra added a comment - Bump on the discussion. This ticket seems blocked until we make a decision.
Hide
Andy Fingerhut added a comment -

Answer for Stuart: Looks like I was preserving the (int ...) form that was in triml before, perhaps somewhat mindlessly. Depending on Rich's answer, I can update the patch if desired.

Show
Andy Fingerhut added a comment - Answer for Stuart: Looks like I was preserving the (int ...) form that was in triml before, perhaps somewhat mindlessly. Depending on Rich's answer, I can update the patch if desired.
Hide
Stuart Halloway added a comment -

Question for Andy: why the (int ...) forms, when Clojure only works with primitive longs?

Show
Stuart Halloway added a comment - Question for Andy: why the (int ...) forms, when Clojure only works with primitive longs?
Hide
Stuart Halloway added a comment -

Question for Rich: do we want to be consistent across the different trim fns (per this patch) or consistent with Java, which uses one definition of whitespace in Character/isWhitespace, and a different definition in String/trim?

Show
Stuart Halloway added a comment - Question for Rich: do we want to be consistent across the different trim fns (per this patch) or consistent with Java, which uses one definition of whitespace in Character/isWhitespace, and a different definition in String/trim?

People

Vote (0)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: