<< Back to previous view

[CLJS-814] clojure.string/reverse breaks surrogate pairs Created: 12/Jun/14  Updated: 12/Jun/14

Status: Open
Project: ClojureScript
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Francis Avila Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: string

r2227, NOT under Rhino.

Attachments: Text File cljs-814.patch    
Patch: Code and Test


clojure.string/reverse will reverse a string by code units instead of code points which causes surrogate pairs to become reversed and unmatched (i.e., invalid utf-16).

For example, in clojurescript (clojure.core/reverse "a\uD834\uDD1Ec") will produce the invalid "c\uDD1E\uD834a" instead of "c\uD834\uDD1Ea". Clojure produces the correct result because the underlying Java String.reverse() keeps surrogate pairs together.

Note that clojurescript running under Rhino will produce the same (correct) result as clojure, probably because Rhino is using String.reverse() internally.

Attached patch gives clojure.string/reverse the exact same behavior in clj and cljs (including non-commutativity for strings with unmatched surrogates).

(Also, neither clojure nor clojurescript nor java reverse combining characters correctly--the combining character will "move" over a different letter. I suspect this is a WONTFIX for both clojure and clojurescript, but it is fixable with another regex replacement.)

Comment by Francis Avila [ 12/Jun/14 1:27 AM ]

Forget what I said about Rhino, it's just as broken there. I think I got my repls confused or something.

[CLJS-801] str macro emits unoptimizable js code Created: 27/Apr/14  Updated: 11/May/14  Resolved: 11/May/14

Status: Closed
Project: ClojureScript
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Francis Avila Assignee: Unassigned
Resolution: Completed Votes: 0
Labels: string


Attachments: Text File cljs-801.patch     Text File cljs-801-v2.patch     Text File cljs-801-v3.patch     Text File cljs-801-v4.patch    
Patch: Code and Test


Clojurescript's str macro emits javascript code which is inefficient and which the closure compiler cannot optimize.

Currently it emits code like {{[cljs.core.str(arg1),cljs.core.str(arg2)].join('')}}. The problems with this:

  1. The emitted function is the arity-dispatch str wrapper instead of the 1-arg implementation of str. The closure compiler cannot eliminate the dispatch.
  2. An intermediate array is always created; the closure compiler cannot optimize it out.
  3. The closure compiler can evaluate constant string expressions (e.g. 'a'+1 to 'a1'), but cannot in this case because it cannot eliminate the str call.

The attached patch rewrites the str macro to generate js code that looks like this:

(str arg1 "constant" \space true nil 123 456.78)

(""+cljs.core.str.cljs$core$IFn$_invoke$arity$1(arg1)+"constant "+true+123+456.78)

This has a number of benefits:

  1. No short-lived array or Array.join operation.
  2. No arity dispatch is invoked. I have also observed that it can (but won't necessarily) inline the function body.
  3. The compiler can perform constant evaluation. For example, in advanced mode the compiler will emit the above example as (""+w.c(a)+"constant true123456.78") where w.c is the munged cljs.core.str.cljs$core$IFn$_invoke$arity$1

Comment by Francis Avila [ 28/Apr/14 12:17 AM ]

Updated patch adds booleans to the str test case, and does not eagerly stringify bools anymore. (It emits as literals and lets the closure compiler decide to stringify at compile time if it wants.)

Comment by David Nolen [ 10/May/14 2:23 PM ]

Need the patch rebased to master.

Comment by Francis Avila [ 11/May/14 1:38 AM ]

Updated patch.

Comment by David Nolen [ 11/May/14 10:52 AM ]

Sorry because of the last commit to master this ticket will not apply, please rebase again. I'll refrain from adding tests until this one gets merged in

Comment by Francis Avila [ 11/May/14 11:22 AM ]

No problem, updated patch.

Comment by David Nolen [ 11/May/14 11:27 AM ]

fixed https://github.com/clojure/clojurescript/commit/c63db9544738fc094cc04680bbee1534cf436204

[CLJ-1449] Add starts-with? and ends-with? to clojure.string Created: 19/Jun/14  Updated: 20/Jun/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Bozhidar Batsov Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: string

Approval: Triaged


Add clojure.string/starts-with? and ends-with? similar to java.lang.String's startsWith/endsWith. In addition to making these easier to find and use, this provides a place to add a portable ClojureScript variant.

Comment by Alex Miller [ 19/Jun/14 12:53 PM ]

Re substring, there is a clojure.core/subs for this (predates the string ns I believe).

([s start] [s start end])
Returns the substring of s beginning at start inclusive, and ending
at end (defaults to length of string), exclusive.

Comment by Jozef Wagner [ 20/Jun/14 3:21 AM ]

As strings are collection of characters, you can use Clojure's sequence facilities to achieve such functionality:

user=> (= (first "asdf") \a)
user=> (= (last "asdf") \a)
Comment by Alex Miller [ 20/Jun/14 8:33 AM ]

Jozef, String.startsWith() checks for a prefix string, not just a prefix char.

Comment by Bozhidar Batsov [ 20/Jun/14 9:42 AM ]

Re substring, I know about subs, but it seems very odd that it's not in the string ns. After all most people will likely look for string-related functionality in clojure.string. I think it'd be best if `subs` was added to clojure.string and clojure.core/subs was deprecated.

[CLJ-1393] clojure.string/trim doesn't trim null character Created: 28/Mar/14  Updated: 28/Mar/14  Resolved: 28/Mar/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Ryan Fowler Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


CLJ-935 changed clojure.string/trim to not trim all the characters less than or equal to \u0020 as Java does.

I noticed this because base64 uses null characters to pad the end of encoding blocks.

Clojure 1.6.0's trim leaves the null character in:
user=> (.length (clojure.string/trim "\u0000"))

java.lang.String's trim takes it out:
user=> (.length (.trim "\u0000"))

Here are the first 21 unicode characters and what Character/isWhitespace says about them.

(dotimes [n 0x20] (printf "
u%04x - %b\n" n (Character/isWhitespace n)))
\u0000 - false
\u0001 - false
\u0002 - false
\u0003 - false
\u0004 - false
\u0005 - false
\u0006 - false
\u0007 - false
\u0008 - false
\u0009 - true
\u000a - true
\u000b - true
\u000c - true
\u000d - true
\u000e - false
\u000f - false
\u0010 - false
\u0011 - false
\u0012 - false
\u0013 - false
\u0014 - false
\u0015 - false
\u0016 - false
\u0017 - false
\u0018 - false
\u0019 - false
\u001a - false
\u001b - false
\u001c - true
\u001d - true
\u001e - true
\u001f - true

Comment by Alex Miller [ 28/Mar/14 12:27 PM ]

The choice was made in CLJ-935 to consistently define whitespace as Character.isWhitespace() across trim, triml, and trimr. There are many possible ways to define "space" (at least two as we see here). If your trimming needs differ from the standard library, then you'll probably need to define your own functions to trim your data. You can still use Java interop to call String.trim() directly if that happens to match your needs.

Comment by Ryan Fowler [ 28/Mar/14 1:03 PM ]

Indeed, it's an easy workaround to use Java interop once you figure out what your problem is.

It's just unintuitive that the character generally used for string termination isn't trimmed by clojure.string/trim.

[CLJ-1312] clojure.string/split on empty string includes empty string in results Created: 21/Dec/13  Updated: 21/Dec/13  Resolved: 21/Dec/13

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Russell Dunphy Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


Splitting a string using clojure.string/split with an empty regex includes the empty string in the results - is this expected behaviour?


Unable to find source-code formatter for language: clojure. Available languages are: javascript, sql, xhtml, actionscript, none, html, xml, java
user=> (clojure.string/split "abc" #"")
["" "a" "b" "c"]

Comment by Alex Miller [ 21/Dec/13 8:05 AM ]

Yes, I think so. This is a case where Clojure defers to the host (Java) for behavior. I think the way to interpret this is that the empty pattern matches all strings. Split checks left to right whether there is a next chunk of string that matches the pattern. The empty pattern matches at the beginning to a string of length 0. Something like that.

[CLJ-935] clojure.string/trim uses different defn of whitespace as triml, trimr Created: 21/Feb/12  Updated: 31/Jan/14  Resolved: 31/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: Release 1.6

Type: Defect Priority: Minor
Reporter: Andy Fingerhut Assignee: Unassigned
Resolution: Completed Votes: 0
Labels: string

Attachments: Text File clj935-2.patch     Text File clj935-3.patch     Text File fix-trim-fns-different-whitespace-patch.txt    
Patch: Code and Test
Approval: Ok


clojure.string/triml and trimr use Character/isWhitespace to determine whether a character is whitespace, but trim uses some other definition of white space character. For example:

user=> (use 'clojure.string)
user=> (def s "  \u2002  foo")
user=> (trim s)
"?  foo"
user=> (triml s)

Cause: triml and trimr use Character/isWhitespace. trim uses String/trim which seems to define whitespace as any character less than or equal '\u0020'. The isWhitespace() definition is slightly different and includes other Unicode space characters.

Approach: The attached patch changes trim to use Character/isWhitespace. The isWhitespace version seems generally newer and more Unicode considerate so this was chosen over changing triml and trimr to match trim.

A few alternative implementations were considered with respect to longs, ints, etc. The patch opts to use the simplest possible code, eschewing any extreme performance measures. See the comments for more info if desired.

The patch also changes triml to only call .length on s once.

Patch: clj935-3.patch

Screened by: Stuart Sierra

Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Rich: do we want to be consistent across the different trim fns (per this patch) or consistent with Java, which uses one definition of whitespace in Character/isWhitespace, and a different definition in String/trim?

Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Andy: why the (int ...) forms, when Clojure only works with primitive longs?

Comment by Andy Fingerhut [ 08/Jun/12 10:33 AM ]

Answer for Stuart: Looks like I was preserving the (int ...) form that was in triml before, perhaps somewhat mindlessly. Depending on Rich's answer, I can update the patch if desired.

Comment by Aaron Bedra [ 11/Apr/13 5:34 PM ]

Bump on the discussion. This ticket seems blocked until we make a decision.

Comment by Alex Miller [ 30/Aug/13 3:47 PM ]

In response to the questions regarding the "int" calls in the prior patch, I did some analysis and looked at alternative impls.

Prior patch: the calls to (int) were indeed causing primitive ints to be used in some cases (like letting the length). However, the loop index/rindex is always a prim long at best (and there is no way to type hint or alter that afaict). Because of this, casts get inserted to upcast the length to a long for comparison anyways. Using criterium and calling ltrim on a string of 1M spaces, I got a mean of 2.9 ms.

I tried a dirty trick of using a mutable int array for the index and that (while ugly) was 2.25 ms, so definitely faster. It did not seem fast enough to justify the dirtiness.

I also benchmarked a pure Java int version:

static public CharSequence triml(CharSequence s) {
  int len = s.length();
    for(int index=0; index < len; index++) {
      if(! Character.isWhitespace(s.charAt(index)))
        return s.subSequence(index, len).toString();
  return "";

This code is actually smaller and faster - 0.92 ms.

I pondered this and in the end decided to leave it as the most straightforward Clojure impl. I think for the vast majority of trim calls, the difference will be negligible. For users needing higher trim performance, they should probably write an optimized version and tune their whitespace set. My second choice would be adding the Java implementations.

Marking screened.

Comment by Rich Hickey [ 25/Oct/13 7:42 AM ]

did you try

(set! *unchecked-math* true)

Comment by Alex Miller [ 02/Dec/13 10:55 PM ]

Testing perf of ltrim on a 1M character blank string (tested on criterium w/ good JVM settings):

clojure.string/ltrim = 2.785597 ms
clj935-2.patch ltrim = 2.880528 ms
same as clj935-2 with unchecked-inc = 2.225098 ms
same as clj935-2 with unchecked-inc-int = 2.198700 ms

I tested the last 3 with unchecked-math both true and false and saw no difference between them.

Based on this, I will update the patch to use unchecked-inc as that seems to give even better performance than the prior impl.

Comment by Rich Hickey [ 03/Jan/14 9:28 AM ]

Is there a gist or anything of the perf tests?

Generated at Fri Jul 25 11:08:30 CDT 2014 using JIRA 4.4#649-r158309.