<< Back to previous view

[CLJS-814] clojure.string/reverse breaks surrogate pairs Created: 12/Jun/14  Updated: 12/Jun/14

Status: Open
Project: ClojureScript
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Francis Avila Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: string
Environment:

r2227, NOT under Rhino.


Attachments: Text File cljs-814.patch    
Patch: Code and Test

 Description   

clojure.string/reverse will reverse a string by code units instead of code points which causes surrogate pairs to become reversed and unmatched (i.e., invalid utf-16).

For example, in clojurescript (clojure.core/reverse "a\uD834\uDD1Ec") will produce the invalid "c\uDD1E\uD834a" instead of "c\uD834\uDD1Ea". Clojure produces the correct result because the underlying Java String.reverse() keeps surrogate pairs together.

Note that clojurescript running under Rhino will produce the same (correct) result as clojure, probably because Rhino is using String.reverse() internally.

Attached patch gives clojure.string/reverse the exact same behavior in clj and cljs (including non-commutativity for strings with unmatched surrogates).

(Also, neither clojure nor clojurescript nor java reverse combining characters correctly--the combining character will "move" over a different letter. I suspect this is a WONTFIX for both clojure and clojurescript, but it is fixable with another regex replacement.)



 Comments   
Comment by Francis Avila [ 12/Jun/14 1:27 AM ]

Forget what I said about Rhino, it's just as broken there. I think I got my repls confused or something.





[CLJS-801] str macro emits unoptimizable js code Created: 27/Apr/14  Updated: 11/May/14  Resolved: 11/May/14

Status: Closed
Project: ClojureScript
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Francis Avila Assignee: Unassigned
Resolution: Completed Votes: 0
Labels: string
Environment:

r2202


Attachments: Text File cljs-801.patch     Text File cljs-801-v2.patch     Text File cljs-801-v3.patch     Text File cljs-801-v4.patch    
Patch: Code and Test

 Description   

Clojurescript's str macro emits javascript code which is inefficient and which the closure compiler cannot optimize.

Currently it emits code like {{[cljs.core.str(arg1),cljs.core.str(arg2)].join('')}}. The problems with this:

  1. The emitted function is the arity-dispatch str wrapper instead of the 1-arg implementation of str. The closure compiler cannot eliminate the dispatch.
  2. An intermediate array is always created; the closure compiler cannot optimize it out.
  3. The closure compiler can evaluate constant string expressions (e.g. 'a'+1 to 'a1'), but cannot in this case because it cannot eliminate the str call.

The attached patch rewrites the str macro to generate js code that looks like this:

(str arg1 "constant" \space true nil 123 456.78)

(""+cljs.core.str.cljs$core$IFn$_invoke$arity$1(arg1)+"constant "+true+123+456.78)

This has a number of benefits:

  1. No short-lived array or Array.join operation.
  2. No arity dispatch is invoked. I have also observed that it can (but won't necessarily) inline the function body.
  3. The compiler can perform constant evaluation. For example, in advanced mode the compiler will emit the above example as (""+w.c(a)+"constant true123456.78") where w.c is the munged cljs.core.str.cljs$core$IFn$_invoke$arity$1


 Comments   
Comment by Francis Avila [ 28/Apr/14 12:17 AM ]

Updated patch adds booleans to the str test case, and does not eagerly stringify bools anymore. (It emits as literals and lets the closure compiler decide to stringify at compile time if it wants.)

Comment by David Nolen [ 10/May/14 2:23 PM ]

Need the patch rebased to master.

Comment by Francis Avila [ 11/May/14 1:38 AM ]

Updated patch.

Comment by David Nolen [ 11/May/14 10:52 AM ]

Sorry because of the last commit to master this ticket will not apply, please rebase again. I'll refrain from adding tests until this one gets merged in

Comment by Francis Avila [ 11/May/14 11:22 AM ]

No problem, updated patch.

Comment by David Nolen [ 11/May/14 11:27 AM ]

fixed https://github.com/clojure/clojurescript/commit/c63db9544738fc094cc04680bbee1534cf436204





[CLJ-1483] Clarify the usage of replace(-first) with a function Created: 27/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Trivial
Reporter: Bozhidar Batsov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: docstring, string

Attachments: Text File 0001-Clarify-the-usage-of-replace-first-with-pattern-func.patch    
Patch: Code
Approval: Triaged

 Description   

The documentation of replace and replace-first didn't feature any example usage of the pattern + function combo so I've added one.






[CLJ-1449] Add starts-with? ends-with? contains? to clojure.string Created: 19/Jun/14  Updated: 02/Oct/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Bozhidar Batsov Assignee: Unassigned
Resolution: Unresolved Votes: 11
Labels: string

Attachments: Text File clj-1449-basic-v1.patch     Text File clj-1449-more-v1.patch    
Patch: Code
Approval: Triaged

 Description   

Add clojure.string/starts-with? ends-with? and contains?, similar to java.lang.String's startsWith/endsWith/contains. In addition to making these easier to find and use, this provides a place to add a portable ClojureScript variant.



 Comments   
Comment by Alex Miller [ 19/Jun/14 12:53 PM ]

Re substring, there is a clojure.core/subs for this (predates the string ns I believe).

clojure.core/subs
([s start] [s start end])
Returns the substring of s beginning at start inclusive, and ending
at end (defaults to length of string), exclusive.

Comment by Jozef Wagner [ 20/Jun/14 3:21 AM ]

As strings are collection of characters, you can use Clojure's sequence facilities to achieve such functionality:

user=> (= (first "asdf") \a)
true
user=> (= (last "asdf") \a)
false
Comment by Alex Miller [ 20/Jun/14 8:33 AM ]

Jozef, String.startsWith() checks for a prefix string, not just a prefix char.

Comment by Bozhidar Batsov [ 20/Jun/14 9:42 AM ]

Re substring, I know about subs, but it seems very odd that it's not in the string ns. After all most people will likely look for string-related functionality in clojure.string. I think it'd be best if `subs` was added to clojure.string and clojure.core/subs was deprecated.

Comment by Pierre Masci [ 01/Aug/14 5:27 AM ]

Hi, I was thinking the same about starts-with and .ends-with, as well as (.indexOf s "c") and (.lastIndexOf "c").

I read the whole Java String API recently, and these 4 functions seem to be the only ones that don't have an equivalent in Clojure.
It would be nice to have them.

Andy Fingerhut who maintains the Clojure Cheatsheet told me: "I maintain the cheatsheet, and I put .indexOf and .lastIndexOf on there since they are probably the most common thing I saw asked about that is in the Java API but not the Clojure API, for strings."
Which shows that there is a demand.

Because Clojure is being hosted on several platforms, and might be hosted on more in the future, I think these functions should be part of the de-facto ecosystem.

Comment by Andy Fingerhut [ 30/Aug/14 3:39 PM ]

Updating summary line and description to add contains? as well. I can back this off if it changes your mind about triaging it, Alex.

Comment by Andy Fingerhut [ 30/Aug/14 3:40 PM ]

Patch clj-1449-basic-v1.patch dated Aug 30 2014 adds starts-with? ends-with? contains? functions to clojure.string.

Patch clj-1449-more-v1.patch is the same, except it also replaces several Java method calls with calls to these Clojure functions.

Comment by Andy Fingerhut [ 05/Sep/14 1:02 PM ]

Patch clj-1449-basic-v1.patch dated Sep 5 2014 is identical to the patch I added recently called clj-1149-basic-v1.patch. It is simply renamed without the typo'd ticket number in the file name.





[CLJ-1393] clojure.string/trim doesn't trim null character Created: 28/Mar/14  Updated: 28/Mar/14  Resolved: 28/Mar/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Ryan Fowler Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


 Description   

CLJ-935 changed clojure.string/trim to not trim all the characters less than or equal to \u0020 as Java does.

I noticed this because base64 uses null characters to pad the end of encoding blocks.

Clojure 1.6.0's trim leaves the null character in:
user=> (.length (clojure.string/trim "\u0000"))
1

java.lang.String's trim takes it out:
user=> (.length (.trim "\u0000"))
0

Here are the first 21 unicode characters and what Character/isWhitespace says about them.

(dotimes [n 0x20] (printf "
u%04x - %b\n" n (Character/isWhitespace n)))
\u0000 - false
\u0001 - false
\u0002 - false
\u0003 - false
\u0004 - false
\u0005 - false
\u0006 - false
\u0007 - false
\u0008 - false
\u0009 - true
\u000a - true
\u000b - true
\u000c - true
\u000d - true
\u000e - false
\u000f - false
\u0010 - false
\u0011 - false
\u0012 - false
\u0013 - false
\u0014 - false
\u0015 - false
\u0016 - false
\u0017 - false
\u0018 - false
\u0019 - false
\u001a - false
\u001b - false
\u001c - true
\u001d - true
\u001e - true
\u001f - true



 Comments   
Comment by Alex Miller [ 28/Mar/14 12:27 PM ]

The choice was made in CLJ-935 to consistently define whitespace as Character.isWhitespace() across trim, triml, and trimr. There are many possible ways to define "space" (at least two as we see here). If your trimming needs differ from the standard library, then you'll probably need to define your own functions to trim your data. You can still use Java interop to call String.trim() directly if that happens to match your needs.

Comment by Ryan Fowler [ 28/Mar/14 1:03 PM ]

Indeed, it's an easy workaround to use Java interop once you figure out what your problem is.

It's just unintuitive that the character generally used for string termination isn't trimmed by clojure.string/trim.





[CLJ-1312] clojure.string/split on empty string includes empty string in results Created: 21/Dec/13  Updated: 07/Sep/14  Resolved: 21/Dec/13

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Russell Dunphy Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


 Description   

Splitting a string using clojure.string/split with an empty regex includes the empty string in the results - is this expected behaviour?

Example:

Unable to find source-code formatter for language: clojure. Available languages are: javascript, sql, xhtml, actionscript, none, html, xml, java
user=> (clojure.string/split "abc" #"")
["" "a" "b" "c"]


 Comments   
Comment by Alex Miller [ 21/Dec/13 8:05 AM ]

Yes, I think so. This is a case where Clojure defers to the host (Java) for behavior. I think the way to interpret this is that the empty pattern matches all strings. Split checks left to right whether there is a next chunk of string that matches the pattern. The empty pattern matches at the beginning to a string of length 0. Something like that.

Comment by Mark Engelberg [ 07/Sep/14 12:27 PM ]

This bug is a real problem, because it works differently on Windows than on Linux. On Windows, clojure.string/split behaves exactly as you'd expect:

user=> (clojure.string/split "abc" #"")
["a" "b" "c"]

Only on Linux do you get the strange behavior where the empty string shows up at the beginning of the list.

I recently had a student that got burned by this in some webserver code that relied on splitting using the empty regex. It performed flawlessly on her local Windows machine, but mysteriously broke when she uploaded the uberwar to the cloud. The bug was very difficult to track down.

If this were a bug on both Windows and Linux, at least you could plan around it. But right now, it's an obstacle to Clojure's capability of running consistently across platforms.

Comment by Mark Engelberg [ 07/Sep/14 12:40 PM ]

Upon further research, I've found that this is not a Windows/Linux issue, rather it's a difference between Java 7 and Java 8. On Java 8, splitting with the empty string no longer produces a sequence that begins with an empty string.

As you said before, this is just a gotcha relating to Java, not a Clojure issue.





[CLJ-935] clojure.string/trim uses different defn of whitespace as triml, trimr Created: 21/Feb/12  Updated: 31/Jan/14  Resolved: 31/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: Release 1.6

Type: Defect Priority: Minor
Reporter: Andy Fingerhut Assignee: Unassigned
Resolution: Completed Votes: 0
Labels: string

Attachments: Text File clj935-2.patch     Text File clj935-3.patch     Text File fix-trim-fns-different-whitespace-patch.txt    
Patch: Code and Test
Approval: Ok

 Description   

clojure.string/triml and trimr use Character/isWhitespace to determine whether a character is whitespace, but trim uses some other definition of white space character. For example:

user=> (use 'clojure.string)
nil
user=> (def s "  \u2002  foo")
#'user/s
user=> (trim s)
"?  foo"
user=> (triml s)
"foo"

Cause: triml and trimr use Character/isWhitespace. trim uses String/trim which seems to define whitespace as any character less than or equal '\u0020'. The isWhitespace() definition is slightly different and includes other Unicode space characters.

Approach: The attached patch changes trim to use Character/isWhitespace. The isWhitespace version seems generally newer and more Unicode considerate so this was chosen over changing triml and trimr to match trim.

A few alternative implementations were considered with respect to longs, ints, etc. The patch opts to use the simplest possible code, eschewing any extreme performance measures. See the comments for more info if desired.

The patch also changes triml to only call .length on s once.

Patch: clj935-3.patch

Screened by: Stuart Sierra



 Comments   
Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Rich: do we want to be consistent across the different trim fns (per this patch) or consistent with Java, which uses one definition of whitespace in Character/isWhitespace, and a different definition in String/trim?

Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Andy: why the (int ...) forms, when Clojure only works with primitive longs?

Comment by Andy Fingerhut [ 08/Jun/12 10:33 AM ]

Answer for Stuart: Looks like I was preserving the (int ...) form that was in triml before, perhaps somewhat mindlessly. Depending on Rich's answer, I can update the patch if desired.

Comment by Aaron Bedra [ 11/Apr/13 5:34 PM ]

Bump on the discussion. This ticket seems blocked until we make a decision.

Comment by Alex Miller [ 30/Aug/13 3:47 PM ]

In response to the questions regarding the "int" calls in the prior patch, I did some analysis and looked at alternative impls.

Prior patch: the calls to (int) were indeed causing primitive ints to be used in some cases (like letting the length). However, the loop index/rindex is always a prim long at best (and there is no way to type hint or alter that afaict). Because of this, casts get inserted to upcast the length to a long for comparison anyways. Using criterium and calling ltrim on a string of 1M spaces, I got a mean of 2.9 ms.

I tried a dirty trick of using a mutable int array for the index and that (while ugly) was 2.25 ms, so definitely faster. It did not seem fast enough to justify the dirtiness.

I also benchmarked a pure Java int version:

static public CharSequence triml(CharSequence s) {
  int len = s.length();
    for(int index=0; index < len; index++) {
      if(! Character.isWhitespace(s.charAt(index)))
        return s.subSequence(index, len).toString();
    }
  return "";
}

This code is actually smaller and faster - 0.92 ms.

I pondered this and in the end decided to leave it as the most straightforward Clojure impl. I think for the vast majority of trim calls, the difference will be negligible. For users needing higher trim performance, they should probably write an optimized version and tune their whitespace set. My second choice would be adding the Java implementations.

Marking screened.

Comment by Rich Hickey [ 25/Oct/13 7:42 AM ]

did you try

(set! *unchecked-math* true)
?

Comment by Alex Miller [ 02/Dec/13 10:55 PM ]

Testing perf of ltrim on a 1M character blank string (tested on criterium w/ good JVM settings):

clojure.string/ltrim = 2.785597 ms
clj935-2.patch ltrim = 2.880528 ms
same as clj935-2 with unchecked-inc = 2.225098 ms
same as clj935-2 with unchecked-inc-int = 2.198700 ms

I tested the last 3 with unchecked-math both true and false and saw no difference between them.

Based on this, I will update the patch to use unchecked-inc as that seems to give even better performance than the prior impl.

Comment by Rich Hickey [ 03/Jan/14 9:28 AM ]

Is there a gist or anything of the perf tests?





Generated at Fri Oct 31 07:22:32 CDT 2014 using JIRA 4.4#649-r158309.