<< Back to previous view

[CLJ-1483] Clarify the usage of replace(-first) with a function Created: 27/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Trivial
Reporter: Bozhidar Batsov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: docstring, string

Attachments: Text File 0001-Clarify-the-usage-of-replace-first-with-pattern-func.patch    
Patch: Code
Approval: Triaged

 Description   

The documentation of replace and replace-first didn't feature any example usage of the pattern + function combo so I've added one.






[CLJ-1449] Add starts-with? ends-with? contains? to clojure.string Created: 19/Jun/14  Updated: 30/Aug/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Bozhidar Batsov Assignee: Unassigned
Resolution: Unresolved Votes: 6
Labels: string

Attachments: Text File clj-1149-basic-v1.patch     Text File clj-1449-more-v1.patch    
Approval: Triaged

 Description   

Add clojure.string/starts-with? ends-with? and contains?, similar to java.lang.String's startsWith/endsWith/contains. In addition to making these easier to find and use, this provides a place to add a portable ClojureScript variant.



 Comments   
Comment by Alex Miller [ 19/Jun/14 12:53 PM ]

Re substring, there is a clojure.core/subs for this (predates the string ns I believe).

clojure.core/subs
([s start] [s start end])
Returns the substring of s beginning at start inclusive, and ending
at end (defaults to length of string), exclusive.

Comment by Jozef Wagner [ 20/Jun/14 3:21 AM ]

As strings are collection of characters, you can use Clojure's sequence facilities to achieve such functionality:

user=> (= (first "asdf") \a)
true
user=> (= (last "asdf") \a)
false
Comment by Alex Miller [ 20/Jun/14 8:33 AM ]

Jozef, String.startsWith() checks for a prefix string, not just a prefix char.

Comment by Bozhidar Batsov [ 20/Jun/14 9:42 AM ]

Re substring, I know about subs, but it seems very odd that it's not in the string ns. After all most people will likely look for string-related functionality in clojure.string. I think it'd be best if `subs` was added to clojure.string and clojure.core/subs was deprecated.

Comment by Pierre Masci [ 01/Aug/14 5:27 AM ]

Hi, I was thinking the same about starts-with and .ends-with, as well as (.indexOf s "c") and (.lastIndexOf "c").

I read the whole Java String API recently, and these 4 functions seem to be the only ones that don't have an equivalent in Clojure.
It would be nice to have them.

Andy Fingerhut who maintains the Clojure Cheatsheet told me: "I maintain the cheatsheet, and I put .indexOf and .lastIndexOf on there since they are probably the most common thing I saw asked about that is in the Java API but not the Clojure API, for strings."
Which shows that there is a demand.

Because Clojure is being hosted on several platforms, and might be hosted on more in the future, I think these functions should be part of the de-facto ecosystem.

Comment by Andy Fingerhut [ 30/Aug/14 3:39 PM ]

Updating summary line and description to add contains? as well. I can back this off if it changes your mind about triaging it, Alex.

Comment by Andy Fingerhut [ 30/Aug/14 3:40 PM ]

Patch clj-1449-basic-v1.patch dated Aug 30 2014 adds starts-with? ends-with? contains? functions to clojure.string.

Patch clj-1449-more-v1.patch is the same, except it also replaces several Java method calls with calls to these Clojure functions.





[CLJ-1393] clojure.string/trim doesn't trim null character Created: 28/Mar/14  Updated: 28/Mar/14  Resolved: 28/Mar/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Ryan Fowler Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


 Description   

CLJ-935 changed clojure.string/trim to not trim all the characters less than or equal to \u0020 as Java does.

I noticed this because base64 uses null characters to pad the end of encoding blocks.

Clojure 1.6.0's trim leaves the null character in:
user=> (.length (clojure.string/trim "\u0000"))
1

java.lang.String's trim takes it out:
user=> (.length (.trim "\u0000"))
0

Here are the first 21 unicode characters and what Character/isWhitespace says about them.

(dotimes [n 0x20] (printf "
u%04x - %b\n" n (Character/isWhitespace n)))
\u0000 - false
\u0001 - false
\u0002 - false
\u0003 - false
\u0004 - false
\u0005 - false
\u0006 - false
\u0007 - false
\u0008 - false
\u0009 - true
\u000a - true
\u000b - true
\u000c - true
\u000d - true
\u000e - false
\u000f - false
\u0010 - false
\u0011 - false
\u0012 - false
\u0013 - false
\u0014 - false
\u0015 - false
\u0016 - false
\u0017 - false
\u0018 - false
\u0019 - false
\u001a - false
\u001b - false
\u001c - true
\u001d - true
\u001e - true
\u001f - true



 Comments   
Comment by Alex Miller [ 28/Mar/14 12:27 PM ]

The choice was made in CLJ-935 to consistently define whitespace as Character.isWhitespace() across trim, triml, and trimr. There are many possible ways to define "space" (at least two as we see here). If your trimming needs differ from the standard library, then you'll probably need to define your own functions to trim your data. You can still use Java interop to call String.trim() directly if that happens to match your needs.

Comment by Ryan Fowler [ 28/Mar/14 1:03 PM ]

Indeed, it's an easy workaround to use Java interop once you figure out what your problem is.

It's just unintuitive that the character generally used for string termination isn't trimmed by clojure.string/trim.





[CLJ-1312] clojure.string/split on empty string includes empty string in results Created: 21/Dec/13  Updated: 21/Dec/13  Resolved: 21/Dec/13

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Russell Dunphy Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


 Description   

Splitting a string using clojure.string/split with an empty regex includes the empty string in the results - is this expected behaviour?

Example:

Unable to find source-code formatter for language: clojure. Available languages are: javascript, sql, xhtml, actionscript, none, html, xml, java
user=> (clojure.string/split "abc" #"")
["" "a" "b" "c"]


 Comments   
Comment by Alex Miller [ 21/Dec/13 8:05 AM ]

Yes, I think so. This is a case where Clojure defers to the host (Java) for behavior. I think the way to interpret this is that the empty pattern matches all strings. Split checks left to right whether there is a next chunk of string that matches the pattern. The empty pattern matches at the beginning to a string of length 0. Something like that.





[CLJ-935] clojure.string/trim uses different defn of whitespace as triml, trimr Created: 21/Feb/12  Updated: 31/Jan/14  Resolved: 31/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: Release 1.6

Type: Defect Priority: Minor
Reporter: Andy Fingerhut Assignee: Unassigned
Resolution: Completed Votes: 0
Labels: string

Attachments: Text File clj935-2.patch     Text File clj935-3.patch     Text File fix-trim-fns-different-whitespace-patch.txt    
Patch: Code and Test
Approval: Ok

 Description   

clojure.string/triml and trimr use Character/isWhitespace to determine whether a character is whitespace, but trim uses some other definition of white space character. For example:

user=> (use 'clojure.string)
nil
user=> (def s "  \u2002  foo")
#'user/s
user=> (trim s)
"?  foo"
user=> (triml s)
"foo"

Cause: triml and trimr use Character/isWhitespace. trim uses String/trim which seems to define whitespace as any character less than or equal '\u0020'. The isWhitespace() definition is slightly different and includes other Unicode space characters.

Approach: The attached patch changes trim to use Character/isWhitespace. The isWhitespace version seems generally newer and more Unicode considerate so this was chosen over changing triml and trimr to match trim.

A few alternative implementations were considered with respect to longs, ints, etc. The patch opts to use the simplest possible code, eschewing any extreme performance measures. See the comments for more info if desired.

The patch also changes triml to only call .length on s once.

Patch: clj935-3.patch

Screened by: Stuart Sierra



 Comments   
Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Rich: do we want to be consistent across the different trim fns (per this patch) or consistent with Java, which uses one definition of whitespace in Character/isWhitespace, and a different definition in String/trim?

Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Andy: why the (int ...) forms, when Clojure only works with primitive longs?

Comment by Andy Fingerhut [ 08/Jun/12 10:33 AM ]

Answer for Stuart: Looks like I was preserving the (int ...) form that was in triml before, perhaps somewhat mindlessly. Depending on Rich's answer, I can update the patch if desired.

Comment by Aaron Bedra [ 11/Apr/13 5:34 PM ]

Bump on the discussion. This ticket seems blocked until we make a decision.

Comment by Alex Miller [ 30/Aug/13 3:47 PM ]

In response to the questions regarding the "int" calls in the prior patch, I did some analysis and looked at alternative impls.

Prior patch: the calls to (int) were indeed causing primitive ints to be used in some cases (like letting the length). However, the loop index/rindex is always a prim long at best (and there is no way to type hint or alter that afaict). Because of this, casts get inserted to upcast the length to a long for comparison anyways. Using criterium and calling ltrim on a string of 1M spaces, I got a mean of 2.9 ms.

I tried a dirty trick of using a mutable int array for the index and that (while ugly) was 2.25 ms, so definitely faster. It did not seem fast enough to justify the dirtiness.

I also benchmarked a pure Java int version:

static public CharSequence triml(CharSequence s) {
  int len = s.length();
    for(int index=0; index < len; index++) {
      if(! Character.isWhitespace(s.charAt(index)))
        return s.subSequence(index, len).toString();
    }
  return "";
}

This code is actually smaller and faster - 0.92 ms.

I pondered this and in the end decided to leave it as the most straightforward Clojure impl. I think for the vast majority of trim calls, the difference will be negligible. For users needing higher trim performance, they should probably write an optimized version and tune their whitespace set. My second choice would be adding the Java implementations.

Marking screened.

Comment by Rich Hickey [ 25/Oct/13 7:42 AM ]

did you try

(set! *unchecked-math* true)
?

Comment by Alex Miller [ 02/Dec/13 10:55 PM ]

Testing perf of ltrim on a 1M character blank string (tested on criterium w/ good JVM settings):

clojure.string/ltrim = 2.785597 ms
clj935-2.patch ltrim = 2.880528 ms
same as clj935-2 with unchecked-inc = 2.225098 ms
same as clj935-2 with unchecked-inc-int = 2.198700 ms

I tested the last 3 with unchecked-math both true and false and saw no difference between them.

Based on this, I will update the patch to use unchecked-inc as that seems to give even better performance than the prior impl.

Comment by Rich Hickey [ 03/Jan/14 9:28 AM ]

Is there a gist or anything of the perf tests?





Generated at Tue Sep 02 04:53:48 CDT 2014 using JIRA 4.4#649-r158309.