<< Back to previous view

[CLJ-1889] Add optional predicate to string trim functions that determines if a character should be trimmed Created: 27/Jan/16  Updated: 15/May/17

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.7, Release 1.8
Fix Version/s: None

Type: Feature Priority: Minor
Reporter: Tamas Szabo Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: string

Attachments: Text File CLJ-1889-trim-enhancement.patch    
Approval: Triaged


The proposal is that the trim functions (trim, triml, and trimr) would get a second arity with a function trim?:

[trim? ^CharSequence s]

trim? comes first to support partial.

New doc string would be:

"Removes characters from both ends of string. 
 If trim? is omitted white space is removed. When supplied it accepts 
 a character and returns true if the character should be removed."

Example test:

(deftest t-trim 
  (is (= "foo" (s/trim "  foo  \r\n"))) 
  (is (= "bar" (s/trim "\u2000bar\t \u2002"))) 

  ;; Additional test 
  (is (= "bar" (s/trim "$%#\u2000bar\t \u2002%$#" 
                       #(or (Character/isWhitespace %) ((set "$#%") %))))))

Similar to Python's strip - https://docs.python.org/2/library/stdtypes.html#str.strip

Approach: The proposed solution isn't very DRY but it follows the design guidelines at the top of the file, more exactly point 3:

"3. Functions take advantage of String implementation details to
write high-performing loop/recurs instead of using higher-order
functions. (This is not idiomatic in general-purpose application

First I had a solution in which I replaced Character/isWhitespace from the current implementation by calling pred. pred was defaulted to an is-whitespace? function.
That code is of course nicer, even trim-newline could just call into trimr, removing a lot of duplication, but it adds the overhead of always calling a function, instead of calling Character/isWhitespace directly.

The only way I can see to have optimised and DRYer code is to use macros, but I don't think that it will necessary lead to nicer code.

Given the existing design style of the other functions in string.clj I felt that the best solution would be to just simply duplicate in favour of optimised code.

Comment by Tamas Szabo [ 27/Jan/16 2:44 PM ]

Proposed solution. Code + tests.

Comment by Tamas Szabo [ 27/Jan/16 3:42 PM ]

Added new patch that renames pred to trim?

Comment by Andy Fingerhut [ 27/Jan/16 6:27 PM ]

Note that Java, and thus Clojure/Java, uses UTF-16 encoding for strings in memory. Thus if you wanted to trim a set of Unicode code points from the beginning and/or end of a string, the API of trim? taking a single 16-bit Java character is not enough information to determine whether it should be trimmed or not.

If you want to handle that generality, it would require a more complex implementation, which checks whether the first/last character is one half of a code point that is encoded as 2 16-bit Java characters, and pass a 32-bit int to trim?, or something similar to that.

I have no objections if these API enhancements are made without enabling testing against an arbitrary Unicode code point. In the past, similar suggestions have been rejected in Clojure's built-in lib, e.g. CLJ-945

Comment by Tamas Szabo [ 28/Jan/16 1:56 AM ]

Yes, the UTF-16 encoding and Character representing either a codepoint or a half-codepoint is a bit of a mess, isn't it?

In the Java String and Character API's the methods that accept char, handle only characters in the Basic Multilingual Plane.
trim? accepts a character, so following the same behavior it will work only for removing characters in the Basic Multilingual Plane.

I think even this would be fine, but additionally because the high/low surrogates and the BMP characters are disjoint, you could actually use the same implementation to remove Unicode code points that aren't in the BMP. You can just say that both the high and low code unit of the codepoint are "unwanted".

𝄞 is "\uD834\uDD1E"

user=> (trimr (set " \uD834\uDD1E") "example string  𝄞  ")
"example string"
Comment by Andy Fingerhut [ 28/Jan/16 5:11 AM ]

Agreed, but probably better to anti-recommend such an implementation of trimr for removing such things, because it would also remove only one UTF-16 Java character out of 2 high/low surrogates if it matched a member of the set, even if the other surrogate didn't match anything in the set, which would leave behind a malformed UTF-16 string.

Again, probably best to either not include this in the implementation at all, and at most warn about it in the docs, or to handle it in the implementation by checking for high/low surrogates in the loop(s).

Comment by Tamas Szabo [ 28/Jan/16 6:00 AM ]

Yes, you're right. That solution won't work in all cases, so it can't be recommended.

I am slightly inclined towards having trim? accept chars and work only for removing BMP characters. This will arguably be enough for the majority of the use cases.
The other solution can be used for all use cases, but then trim? will have to accept int, or 2 chars, or a string, so trim? would be less intuitive (although closer to the real world ), and writing those trim? functions would be less user friendly.

That being said, I am happy to change the implementation to do that if it is required.

Currently, I'm not even sure if the enhancement will be accepted or rejected or what the process for that is.

[CLJ-1857] clojure.string/split docstring does not match the behavior of parameter "limit" Created: 27/Nov/15  Updated: 09/Feb/16

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.7
Fix Version/s: None

Type: Enhancement Priority: Trivial
Reporter: Miikka Koskinen Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: string

Attachments: File CLJ_1857_split_docs_update.diff    

([s re] [s re limit])
  Splits string on a regular expression.  Optional argument limit is
  the maximum number of splits. Not lazy. Returns vector of the splits.

What happens is that limit is the maximum number of parts returned, not the number of splits done. If limit is 1, no splits are done, while I'd expect at most one split to be done. It's a bit of a matter of terminology, but I think that the text could be clarified. Based on ClojureDocs examples, I'm not the only one who was confused.

user=> (str/split "1 2 3" #" ")
["1" "2" "3"]
user=> (str/split "1 2 3" #" " 1)
["1 2 3"]
user=> (str/split "1 2 3" #" " 2)
["1" "2 3"]

Comment by Alex Miller [ 29/Nov/15 2:52 PM ]

To me, the last sentence indicates that "split" (as a noun) is being used to refer to the parts resulting from splitting but there is some ambiguity in the prior sentence. Would "parts" be better?

I don't get your point on the clojuredocs examples - those make sense to me.

Comment by Stephen Hopper [ 09/Feb/16 10:00 PM ]

The docs are a bit ambiguous as "splits" is a verb in the first sentence, but a noun in the other two occurrences. I believe the ClojureDocs example being referred to is likely this one (mostly because of the comment in it):

; Note that the 'limit' arg is the maximum number of strings to
; return (not the number of splits)
user=> (str/split "q1w2e3r4t5y6u7i8o9p0" #"\d+" 5)
["q" "w" "e" "r" "t5y6u7i8o9p0"]

Because split is the name of the function and the action being performed, I think it makes sense to leave it as the verb in the first sentence and replace the other two occurrences with "parts". Does that sound reasonable?

[CLJ-1608] add split-at to clojure.string Created: 03/Dec/14  Updated: 03/Dec/14  Resolved: 03/Dec/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Dmitr Sotnikov Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string

Attachments: Text File string.clj.patch    
Patch: Code


Add clojure.string/split-at similar to clojure.core/split-at that accepts a string and a number indicating the position where the string should be split. The function returns a vector containing two strings, first containing the characters from 0-n-1, and second n-length.

Comment by Alex Miller [ 03/Dec/14 9:48 PM ]

I do not think this is an operation that is fundamental (it can be easily composed from existing functions like count and subs) or represents a portability opportunity by being a function available on jvm and js with host performance benefits. It is a non-goal for clojure.string to contain every potentially useful string function.

Comment by Dmitr Sotnikov [ 03/Dec/14 11:04 PM ]

Makes sense, thanks for the clarification.

[CLJ-1483] Clarify the usage of replace(-first) with a function Created: 27/Jul/14  Updated: 29/Jul/14

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Enhancement Priority: Trivial
Reporter: Bozhidar Batsov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: docstring, string

Attachments: Text File 0001-Clarify-the-usage-of-replace-first-with-pattern-func.patch    
Patch: Code
Approval: Triaged


The documentation of replace and replace-first didn't feature any example usage of the pattern + function combo so I've added one.

[CLJ-1449] Add clojure.string functions for portability to ClojureScript Created: 19/Jun/14  Updated: 03/Sep/15  Resolved: 03/Sep/15

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: Release 1.8

Type: Enhancement Priority: Major
Reporter: Bozhidar Batsov Assignee: Nola Stowe
Resolution: Completed Votes: 29
Labels: ft, string

Attachments: Text File add_functions_to_strings-2.patch     Text File add_functions_to_strings-3.patch     Text File add_functions_to_strings-4.patch     Text File add_functions_to_strings-5.patch     Text File add_functions_to_strings-6.patch     Text File add_functions_to_strings-7.patch     Text File add_functions_to_strings.patch     Text File clj-1449-more-v1.patch    
Patch: Code and Test
Approval: Ok


It would be useful if a few common functions from Java's String were available as Clojure functions for the purposes of increasing portability to other Clojure platforms like ClojureScript.

The functions below also cover the vast majority of cases where Clojure users currently drop into Java interop for String calls - this tends to be an issue for discoverability and learning. While the goal of this ticket is increased portability, improving that is a nice secondary benefit.

Proposed clojure.string fn java.lang.String method
index-of indexOf
last-index-of lastIndexOf
starts-with? startsWith
ends-with? endsWith
includes? contains


  • clj-1449-7.patch - uses nil to indicate not-found in index-of

Performance: Tested the following with criterium for execution mean time (all times in ns).

Java Clojure Java Clojure (-4 patch) Clojure (-6 patch)
(.indexOf "banana" "n") (clojure.string/index-of "banana" "n") 8.70 9.03 9.27
(.indexOf "banana" "n" 1) (clojure.string/index-of "banana" "n" 1) 7.29 7.61 7.66
(.indexOf "banana" (int \n)) (clojure.string/index-of "banana" \n) 5.34 6.20 6.20
(.indexOf "banana" (int \n) 1) (clojure.string/index-of "banana" \n 1) 5.38 6.19 6.24
(.startsWith "apple" "a") (clojure.string/starts-with? "apple" "a") 4.84 5.71 5.65
(.endsWith "apple" "e") (clojure.string/ends-with? "apple" "e") 4.82 5.68 5.70
(.contains "baked apple pie" "apple") (clojure.string/includes? "baked apple pie" "apple") 10.78 11.99 12.17

Screened by: Stu, who prefers nil over -1 - add_functions_to_strings-6.patch

Note: In both Java and JavaScript, indexOf will return -1 to indicate not-found, so this is (at least in these two cases) the status quo. The benefit of nil return in Clojure is that it can be used as a found/not-found condition. The benefit of a -1 return is that it matches Java/JavaScript and that the return type can be hinted as a primitive long, allowing you to feed it into some other arithmetic or string operation without boxing.

Comment by Alex Miller [ 19/Jun/14 12:53 PM ]

Re substring, there is a clojure.core/subs for this (predates the string ns I believe).

([s start] [s start end])
Returns the substring of s beginning at start inclusive, and ending
at end (defaults to length of string), exclusive.

Comment by Jozef Wagner [ 20/Jun/14 3:21 AM ]

As strings are collection of characters, you can use Clojure's sequence facilities to achieve such functionality:

user=> (= (first "asdf") \a)
user=> (= (last "asdf") \a)
Comment by Alex Miller [ 20/Jun/14 8:33 AM ]

Jozef, String.startsWith() checks for a prefix string, not just a prefix char.

Comment by Bozhidar Batsov [ 20/Jun/14 9:42 AM ]

Re substring, I know about subs, but it seems very odd that it's not in the string ns. After all most people will likely look for string-related functionality in clojure.string. I think it'd be best if `subs` was added to clojure.string and clojure.core/subs was deprecated.

Comment by Pierre Masci [ 01/Aug/14 5:27 AM ]

Hi, I was thinking the same about starts-with and .ends-with, as well as (.indexOf s "c") and (.lastIndexOf "c").

I read the whole Java String API recently, and these 4 functions seem to be the only ones that don't have an equivalent in Clojure.
It would be nice to have them.

Andy Fingerhut who maintains the Clojure Cheatsheet told me: "I maintain the cheatsheet, and I put .indexOf and .lastIndexOf on there since they are probably the most common thing I saw asked about that is in the Java API but not the Clojure API, for strings."
Which shows that there is a demand.

Because Clojure is being hosted on several platforms, and might be hosted on more in the future, I think these functions should be part of the de-facto ecosystem.

Comment by Andy Fingerhut [ 30/Aug/14 3:39 PM ]

Updating summary line and description to add contains? as well. I can back this off if it changes your mind about triaging it, Alex.

Comment by Andy Fingerhut [ 30/Aug/14 3:40 PM ]

Patch clj-1449-basic-v1.patch dated Aug 30 2014 adds starts-with? ends-with? contains? functions to clojure.string.

Patch clj-1449-more-v1.patch is the same, except it also replaces several Java method calls with calls to these Clojure functions.

Comment by Andy Fingerhut [ 05/Sep/14 1:02 PM ]

Patch clj-1449-basic-v1.patch dated Sep 5 2014 is identical to the patch I added recently called clj-1149-basic-v1.patch. It is simply renamed without the typo'd ticket number in the file name.

Comment by Yehonathan Sharvit [ 02/Dec/14 3:09 PM ]

What about an implementation that works also in cljs?

Comment by Bozhidar Batsov [ 02/Dec/14 3:11 PM ]

Once this is added to Clojure it will be implemented in ClojureScript as well.

Comment by Yehonathan Sharvit [ 02/Dec/14 3:22 PM ]

Great! Any idea when it will be added to Clojure?
Also, will it be automatically added to Clojurescript or someone will have to write a particular code for it.
The suggested patch relies on Java so I am curious to understand who is going to port the patch to cljs.

Comment by Bozhidar Batsov [ 02/Dec/14 3:27 PM ]

No idea when/if this will get merged. Upvote the ticket to improve the odds of this happening sooner.
Someone on the ClojureScript team will have to implement this in terms of JavaScript.

Comment by Alex Miller [ 02/Dec/14 4:01 PM ]

Some things that would be helpful:

1) It would be better to combine the two patches into a single patch - I think changing current uses into new users is a good thing to include. Also, please keep track of the "current" patch in the description.
2) Patch needs tests.
3) Per the instructions at the top of the clojure.string ns (and the rest of the functions), the majority of these functions are implemented to take the broader CharSequence interface. Similar to those implementations, you will need to provide a CharSequence implementation while also calling into the String functions when you actually have a String.
4) Consider return type hints - I'm not sure they're necessary here, but I would examine bytecode for typical calling situations to see if it would be helpful.
5) Check performance implications of the new versions vs the old with a good tool (like criterium). You've put an additional var invocation and (soon) type check in the calling path for these functions. I think providing a portable target is worth a small cost, but it would be good to know what the cost actually is.

I don't expect we will look at this until after 1.7 is released.

Comment by Andy Fingerhut [ 02/Dec/14 8:05 PM ]

Alex, all your comments make sense.

If you think a ready-and-waiting patch that does those things would improve the odds of the ticket being vetted by Rich, please let us know.

My guess is that his decision will be based upon the description, not any proposed patches. If that is your belief also, I'll wait until he makes that decision before working on a patch. Of course, any other contributor is welcome to work on one if they like.

Comment by Alex Miller [ 02/Dec/14 8:40 PM ]

Well nothing is certain of course, but I keep a special report of things I've "screened" prior to vetting that makes possible moving something straight from Triaged all the way through into Screened/Ok when Rich is able to look at them. This is a good candidate if things were in pristine condition.

That said, I don't know whether Rich will approve it or not, so it's up to you. I think the argument for portability is a strong one and complements the feature expression.

Comment by Alex Miller [ 14/May/15 8:55 AM ]

I'd love to have a really high-quality patch on this ticket to consider for 1.8 that took into account my comments above.

Also, it occurred to me that I don't think this should be called "contains?", overlapping the core function with a different meaning (contains value vs contains key). Maybe "includes?"?

Comment by Andy Fingerhut [ 14/May/15 9:14 AM ]

clojure.string already has 2 name conflicts with clojure.core, for which most people probably do something like (require '[clojure.string :as str]) to avoid that:

user=> (use 'clojure.string)
WARNING: reverse already refers to: #'clojure.core/reverse in namespace: user, being replaced by: #'clojure.string/reverse
WARNING: replace already refers to: #'clojure.core/replace in namespace: user, being replaced by: #'clojure.string/replace
Comment by Alex Miller [ 14/May/15 10:05 AM ]

I'm not concerned about overlapping the name. I'm concerned about overlapping it and meaning something different, particularly vs one of the most confusing functions in core already. clojure.core/contains? is better than linear time key search. clojure.string/whatever will be a linear time subsequence match.

Comment by Alex Miller [ 14/May/15 10:18 AM ]

Ruby uses "include?" for this.

Comment by Devin Walters [ 19/May/15 4:56 PM ]

I agree with Alex's comment about the overlap. Personally, I prefer the way "includes?" reads over "include?", but IMO either one is better than adding to the "contains?" confusion.

Comment by Bozhidar Batsov [ 20/May/15 12:10 AM ]

I'm fine with `includes?`. Ruby is famous for the bad English used in its core library.

Comment by Andrew Rosa [ 17/Jun/15 12:29 PM ]

Andy Fingerhut, since you are the author of the original patch do you desire to continue the work on it? If you prefer (or even need) to focus on different stuff, I would like to tackle this issue.

Comment by Andy Fingerhut [ 17/Jun/15 1:08 PM ]

Andrew Rosa, I am perfectly happy if someone else wants to work on a revised patch for this. If you are interested, go for it! And from Alex's comments, I believe my initial patch covers such a small fraction of what he wants, do not worry about keeping my name associated with any patch you submit – if yours is accepted, my patch will not have helped you in getting there.

Comment by Nola Stowe [ 03/Jul/15 9:01 AM ]

Andrew Rosa, I am interested on working on this. Are you currently working it?

Comment by Andrew Rosa [ 03/Jul/15 12:08 PM ]

Thanks Andy Fingerhut, didn't saw the answer here. I'm happy that Nola Stowe could take this one.

Comment by Nola Stowe [ 08/Jul/15 12:46 PM ]

Updated patch to rename methods as specified and add tests.

Comment by Bozhidar Batsov [ 08/Jul/15 3:26 PM ]

A few remarks:

1. added should say 1.8, not 1.9.
2. The docstrings could be improved a bit - normally they should end with a . and refer to all the params of the function in question.

Comment by Nola Stowe [ 08/Jul/15 3:29 PM ]

Thanks Bozhidar Batsov .. I will, and I also got feedback from the slack clojure-dev group (Alex, Ghadi, AndrewHr) so I will be submitting an improved patch soon

Comment by Andy Fingerhut [ 09/Jul/15 11:57 AM ]

Also, I am not sure, but Alex Miller made this comment earlier: "Per the instructions at the top of the clojure.string ns, functions should take the broader CharSequence interface instead of String. Similar to existing clojure.string functions, you will need to provide a CharSequence implementation while also calling into the String functions when you actually have a String."

Looking at the Java classes and interfaces, one of the classes that implements CharSequence is CharBuffer, and it has no indexOf or lastIndexOf methods. If one were to call the proposed index-of or last-index-of with a CharBuffer, I believe you would get an exception for an unknown method.

I don't know if Alex's sentence is essential to any patch to be considered, but wanted to point out that it seems like it effectively means writing your own implementation of each of these methods for some of the classes implementing the CharSequence interface.

Comment by Nola Stowe [ 09/Jul/15 4:15 PM ]

Andy, yes you are right. Alex helped me with my current changes (not posted yet) and where needed I cast the value to a string so that it would have the indexOf methods.

Comment by Nola Stowe [ 10/Jul/15 7:43 AM ]

Added type hints, using StringBuffer instead of string in tests.

Comment by Nola Stowe [ 10/Jul/15 11:53 PM ]

Use critium and did some bench marking, I am not familiar enough to say "hey its good or no, too slow". I tried an optimization but it didn't make much difference.

notes here:


Appreciate any feedback

Comment by Alex Miller [ 13/Jul/15 10:37 AM ]

Nola, some comments:

1) In index-of and last-index-of, I think we should use (instance? Character value) instead of char? - this will avoid a function invocation.

2) In index-of and last-index-of, in the branches where you know you have a Character, we should just invoke the proper function on the character instance directly rather than calling int. In this case that is (.charValue ^Character value). However, this function returns char so you still want the ^int hint.

3) In index-of and last-index-of, we should do a couple of things to maximally utilize primitive support for from-index. We should type-hint the incoming value as a ^long (only long and double are specialized cases in Clojure, not int). We should then use the function unchecked-int to optimally convert the long primitive to an int primitive.

4) We should type-hint the return value to a primitive long to avoid boxing the return.

Putting 1+2+3+4 together:

(defn index-of
  "Return index of value (string or char) in s, optionally searching forward from from-index."
  {:added "1.8"}
  (^long [^CharSequence s value]
   (if (instance? Character value)
     (.indexOf (.toString s) ^int (.charValue ^Character value))
     (.indexOf (.toString s) ^String value)))
  (^long [^CharSequence s value ^long from-index]
  (if (instance? Character value)
    (.indexOf (.toString s) ^int (.charValue ^Character value) (unchecked-int from-index))
    (.indexOf (.toString s) ^String value (unchecked-int from-index)))))

Similar changes in last-index-of.

5) In starts-with? and ends-with?, I would move the type hint for substr to the parameter declaration. This is more of a style preference on my part, no functional difference I believe.

6) On includes? I would do the same as #5, except .contains actually takes a CharSequence for the substr, so I would change the type hint from ^String to the broader ^CharSequence.

7) Can you edit the description and add a table with the "execution time mean" number for direct Java vs new Clojure fn for the patch after these changes? Please add tests for the from-index variants of index-of and last-index-of as well.

Comment by Nola Stowe [ 14/Jul/15 5:46 PM ]

optimizations as suggested by Alex Miller

Comment by Andy Fingerhut [ 14/Jul/15 5:50 PM ]

Nola, not a big deal, but Rich has requested that patch file names end with ".patch" or ".diff", I think because whatever he uses to view them recognizes that suffix and puts it in a different viewing/editing mode.

Comment by Nola Stowe [ 14/Jul/15 6:03 PM ]

Andy, so no need to keep around patches when used in comparisons?

Comment by Andy Fingerhut [ 14/Jul/15 8:31 PM ]

It is ok to have multiple versions of patches attached, but names like add_functions_to_strings-2.patch etc. would be preferred over names like add_functions_to_strings.patch-2, so the suffix is ".patch".

Comment by Nola Stowe [ 14/Jul/15 8:55 PM ]

original patch submitted by Nola

Comment by Nola Stowe [ 14/Jul/15 8:56 PM ]

Optimizations suggested by Alex

Comment by Alex Miller [ 14/Jul/15 9:44 PM ]

Nola, I re-did the timings on my machine - these were done with Java 1.8 and no Leiningen involved.

Comment by Alex Miller [ 14/Jul/15 9:57 PM ]

-4 patch is identical to -3 but removes one errant ^long type hint

Otherwise, looks good to me!

Comment by Alex Miller [ 17/Jul/15 10:24 PM ]

Rich, since you didn't put any comments on here, I'm not sure if there's anything else you wanted, so marking screened.

Comment by Stuart Halloway [ 19/Jul/15 7:36 AM ]

I dislike the approach here for several reasons:

  • floats Java-isms (-1 for missing instead of nil?) up into the Clojure API
  • makes narrow string fns out of things that could/should be seq fns

Stepping back, this is all library work. Why should it be in core? If this started in a contrib, it could evolve much more freely.

Comment by Alex Miller [ 19/Jul/15 8:49 AM ]

re Java-isms: totally fair comment worth changing

re narrow string fns: if these were seq fns, they would presumably take seqables of chars and return seqables of chars. When I have strings, I want to work with strings (as with clojure.string/join and clojure.string/reverse vs their seq equivalents).

re library: these particular functions are the most commonly used string functions where current users drop to Java interop. Note that all 5 of these functions have over time been added to the cheatsheet (http://clojure.org/cheatsheet) because they are commonly used. I believe there is significant value in including them in core and standardizing their name and behavior across to CLJS.

Comment by Bozhidar Batsov [ 19/Jul/15 10:29 AM ]

I'm obviously biased, but I totally agree with everything Alex said. Having one core string library and a separate contrib string library is just impractical. With clojure.string already part of the language it's much more sensible to extend it where/when needed instead of encouraging the proliferation of tons of alternative libraries (there are already a couple of those out there, mostly because essential functions are missing). Initial designs are rarely perfect, there's nothing wrong in revisiting and improving them.

I believe this is one of the top voted tickets, which clearly indicates that the Clojure community is quite supportive of this additions.

Comment by Michael Blume [ 19/Jul/15 12:26 PM ]

None of these functions return Strings so they could very easily be seq functions which use the String methods as a fast path when called with Strings, the trouble is then they wouldn't necessarily belong in clojure.string.

Comment by Bozhidar Batsov [ 19/Jul/15 1:49 PM ]

Such an argument can be made for existing functions like clojure.core/subs and clojure.string/reverse - we could have just went with generic functions that operate on seqs and converted the results to strings, but we didn't. It might be just me, but I'm really thinking this is being overanalysed. Sometimes it's really preferable to deal with some things as strings (not to mention more efficient).

Comment by Rich Hickey [ 20/Jul/15 9:39 AM ]

I completely agree with Alex and Bozhidar re: strings vs seqs. Other langs that have only lists and use them for strings are really bad at strings. So - string-specific fns are ok. What about the Java-isms?

Comment by Bozhidar Batsov [ 20/Jul/15 10:08 AM ]

We should address them for sure. Let's wait for a couple of days for Nola to update the patch. If she's too busy I'll do the update myself.

Comment by Nola Stowe [ 20/Jul/15 10:50 AM ]

I will be happy to work on it Saturday. Is nil an acceptable response when index-of "apple" "p" is not found? The other functions return booleans and I think those are ok.

Comment by Bozhidar Batsov [ 20/Jul/15 10:55 AM ]

Yep, we should go with nil.

Comment by Nola Stowe [ 25/Jul/15 4:33 PM ]

Changed index-of and last-index-of to return nil instead of -1 (java-ism), thus had to remove the type hint for the return value.

Seems to have slowed down the function compared to having the function return -1. My benchmarks: https://www.evernote.com/l/ABL2wead73BJZYAkpH5yxsfX9JM8cpUiNhw

Comment by Nola Stowe [ 25/Jul/15 9:05 PM ]

Updated patch to include return value in doc string for index-of last-index-of, otherwise same as last update.

Comment by Alex Miller [ 26/Jul/15 8:40 AM ]

Removing the hint causes the return type to be Object, which forces boxing in the "found" cases.

Comment by Nola Stowe [ 26/Jul/15 8:56 AM ]

Is it worth it to remove the java-ism of returning -1 when not found? Personally, I don't really like a function returning one of two things... found index or nil ... I prefer it being an int at all times.

Comment by Alex Miller [ 26/Jul/15 9:00 AM ]

There is one whitespace issue in the -5 patch:

[~/code/clojure (master)]$ git apply ~/Downloads/add_functions_to_strings-5.patch
/Users/alex/Downloads/add_functions_to_strings-5.patch:34: trailing whitespace.
  (let [result
warning: 1 line adds whitespace errors.

I updated the timings in the description to include both -4 and -5 - the boxing there does have a significant impact.

Comment by Nola Stowe [ 26/Jul/15 9:28 AM ]

Fixed whitespace issue.

Comment by Alex Miller [ 26/Jul/15 9:32 AM ]

Added -6 patch and timings that address most of the performance impact.

Comment by Stuart Halloway [ 27/Jul/15 9:35 AM ]

Alex: Note that all of these functions return numbers and booleans, not strings nor seqs of anything. This is partially why they were left out of the original string implementation.

It isn't clear to me why index-of should be purely a string fn, I regularly need index-of with other things, and both my book and JoC use index-of as an example of a nice-to-have.

Comment by Nicola Mometto [ 27/Jul/15 9:44 AM ]

Stuart, I don't think the fact that those functions don't return string is really relevant, the important fact is that they operate on strings (just like c.set/superset? or subset? don't return sets) and are frequently used on them.

Comment by Bozhidar Batsov [ 27/Jul/15 9:51 AM ]

I'm with Nicola on this one - we have to be practical. There's a reason why this is the top-voted JIRA ticket - Clojure programmers want those functions. That's says plenty.

Anyways, having a generic `index-of` is definitely not a bad idea.

Comment by Stuart Halloway [ 31/Jul/15 11:27 AM ]

How does this compare to making them all seq fns and adding them to clojure.core?

Comment by Andy Fingerhut [ 31/Jul/15 11:39 AM ]

Given Rich's comment earlier "I completely agree with Alex and Bozhidar re: strings vs seqs. Other langs that have only lists and use them for strings are really bad at strings. So - string-specific fns are ok.", do we need to write a separate patch that makes these all into seq fns and do performance comparisons between those vs. the string-specific ones?

Comment by Stuart Halloway [ 31/Jul/15 1:26 PM ]


Definitely not! I want to have a design conversation.

We have protocols, and could use them to provide core fns that are specialized to strings, so it isn't clear to me why we would have to be really bad at it.

Comment by Stuart Halloway [ 07/Aug/15 2:19 PM ]

Discussed with Rich. He does not want sequence fns with these performance characteristics, so string-specific fns it shall be.

Comment by Rich Hickey [ 08/Aug/15 10:11 AM ]

So which patch is screened?

Comment by Alex Miller [ 08/Aug/15 11:40 AM ]

Stu screened add_functions_to_strings-6.patch where index-of returns nil on not-found. The add_functions_to_strings-4.patch is almost identical but returns -1 from index-of (which matches Java and JavaScript index-of behavior and may be slightly faster for callers that can be monomorphic on primitive longs?).

Comment by Rich Hickey [ 24/Aug/15 11:27 AM ]

nil-returning index fns should document that return when not found

Comment by Nola Stowe [ 24/Aug/15 11:56 AM ]

adds better doc string to indicated nil is return when not found for index-of and last-index-of.

Comment by Alex Miller [ 24/Aug/15 11:59 AM ]

Added clj-1449-7.patch that is identical to -6 but adds statement of return expectation on index-of and last-index-of.

Comment by Alex Miller [ 24/Aug/15 12:19 PM ]

jinx - same patch! Replacing with Nola's.

Comment by Colin Taylor [ 30/Aug/15 6:48 PM ]

Bit late on this - but .equalsIgnoreCase (=ic? equals-ic ? ) is currently a visible wart for our portability.

Comment by Alex Miller [ 30/Aug/15 10:03 PM ]

Hi Colin, that's a good idea but you have missed the window on this ticket - feel free to file a different one.

Comment by Tom Marble [ 03/Sep/15 3:46 PM ]

I have just proposed an analogous patch to ClojureScript under

[CLJ-1393] clojure.string/trim doesn't trim null character Created: 28/Mar/14  Updated: 28/Mar/14  Resolved: 28/Mar/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Ryan Fowler Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


CLJ-935 changed clojure.string/trim to not trim all the characters less than or equal to \u0020 as Java does.

I noticed this because base64 uses null characters to pad the end of encoding blocks.

Clojure 1.6.0's trim leaves the null character in:
user=> (.length (clojure.string/trim "\u0000"))

java.lang.String's trim takes it out:
user=> (.length (.trim "\u0000"))

Here are the first 21 unicode characters and what Character/isWhitespace says about them.

(dotimes [n 0x20] (printf "
u%04x - %b\n" n (Character/isWhitespace n)))
\u0000 - false
\u0001 - false
\u0002 - false
\u0003 - false
\u0004 - false
\u0005 - false
\u0006 - false
\u0007 - false
\u0008 - false
\u0009 - true
\u000a - true
\u000b - true
\u000c - true
\u000d - true
\u000e - false
\u000f - false
\u0010 - false
\u0011 - false
\u0012 - false
\u0013 - false
\u0014 - false
\u0015 - false
\u0016 - false
\u0017 - false
\u0018 - false
\u0019 - false
\u001a - false
\u001b - false
\u001c - true
\u001d - true
\u001e - true
\u001f - true

Comment by Alex Miller [ 28/Mar/14 12:27 PM ]

The choice was made in CLJ-935 to consistently define whitespace as Character.isWhitespace() across trim, triml, and trimr. There are many possible ways to define "space" (at least two as we see here). If your trimming needs differ from the standard library, then you'll probably need to define your own functions to trim your data. You can still use Java interop to call String.trim() directly if that happens to match your needs.

Comment by Ryan Fowler [ 28/Mar/14 1:03 PM ]

Indeed, it's an easy workaround to use Java interop once you figure out what your problem is.

It's just unintuitive that the character generally used for string termination isn't trimmed by clojure.string/trim.

[CLJ-1360] clojure.string/split strips trailing delimiters Created: 18/Feb/14  Updated: 24/Mar/16

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Tim McCormack Assignee: Unassigned
Resolution: Unresolved Votes: 2
Labels: docstring, string

Attachments: Text File clj-1360.patch    
Patch: Code
Approval: Prescreened


clojure.string/split and clojure.string/split-lines inherit the bizarre default behavior of java.lang.String#split(String,int) in stripping trailing consecutive delimiters:

(clojure.string/split "banana" #"an")
⇒ ["b" "" "a"]
(clojure.string/split "banana" #"na")
⇒ ["ba"]
(clojure.string/split "nanabanana" #"na")
⇒ ["" "" "ba"]

In the case of split-lines, processing a file line by line and rejoining results in truncation of trailing newlines in the file. In both cases, the behavior is surprising and cannot be inferred from the docstrings. A workaround for split is to pass a limit of -1.

Proposed: As current users may be relying on the current behavior, the attached merely updates the docstring to warn of this behavior and suggest use of -1 as a limit to workaround.

Patch: clj-1360.patch

Comment by Andy Fingerhut [ 18/Feb/14 10:51 AM ]

Probably documenting would be safer than changing the behavior at this point, given that some people may actually rely on the current behavior after testing, deploying, etc.

I don't currently have a suggestion for a modified doc string, but note that there are examples of this behavior and how one can use an extra "-1" limit argument at the end to get all split strings: http://clojuredocs.org/clojure_core/clojure.string/split

Comment by Crispin Wellington [ 21/May/15 10:46 PM ]

This bug just bit me. +1 to be fixed. If we just document and leave the behavior as is, then we have a surprising and inconsistent behaving split (why are inner empty values kept, but outer ones stripped?) that is different to every other split you've ever used. The optional -1 limit argument looks hacky but a fix could keep this -1 argument working.

EDIT: this looks to be java's string class behavior: http://stackoverflow.com/questions/2170557/split-method-of-string-class-does-not-include-trailing-empty-strings
Would be nice if limit defaulted to -1 on that type of clojure.string/split call.

Comment by Stuart Halloway [ 19/Jul/15 8:03 AM ]

This is really gross, and the original developer has been punched in the neck. (Ow.)

I hate the Java leakage, but given that this is already out there, and that people are likely already relying on both the default and the negative-arg behavior, I think the least bad bet is to document precisely the semantics we have.

[CLJ-1312] clojure.string/split on empty string includes empty string in results Created: 21/Dec/13  Updated: 17/Jun/16  Resolved: 21/Dec/13

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Russell Dunphy Assignee: Unassigned
Resolution: Declined Votes: 0
Labels: string


Splitting a string using clojure.string/split with an empty regex includes the empty string in the results - is this expected behaviour?


Unable to find source-code formatter for language: clojure. Available languages are: javascript, sql, xhtml, actionscript, none, html, xml, java
user=> (clojure.string/split "abc" #"")
["" "a" "b" "c"]

Comment by Alex Miller [ 21/Dec/13 8:05 AM ]

Yes, I think so. This is a case where Clojure defers to the host (Java) for behavior. I think the way to interpret this is that the empty pattern matches all strings. Split checks left to right whether there is a next chunk of string that matches the pattern. The empty pattern matches at the beginning to a string of length 0. Something like that.

Comment by Mark Engelberg [ 07/Sep/14 12:27 PM ]

This bug is a real problem, because it works differently on Windows than on Linux. On Windows, clojure.string/split behaves exactly as you'd expect:

user=> (clojure.string/split "abc" #"")
["a" "b" "c"]

Only on Linux do you get the strange behavior where the empty string shows up at the beginning of the list.

I recently had a student that got burned by this in some webserver code that relied on splitting using the empty regex. It performed flawlessly on her local Windows machine, but mysteriously broke when she uploaded the uberwar to the cloud. The bug was very difficult to track down.

If this were a bug on both Windows and Linux, at least you could plan around it. But right now, it's an obstacle to Clojure's capability of running consistently across platforms.

Comment by Mark Engelberg [ 07/Sep/14 12:40 PM ]

Upon further research, I've found that this is not a Windows/Linux issue, rather it's a difference between Java 7 and Java 8. On Java 8, splitting with the empty string no longer produces a sequence that begins with an empty string.

As you said before, this is just a gotcha relating to Java, not a Clojure issue.

Comment by Ahmed Fasih [ 17/Jun/16 1:08 PM ]

This happens in ClojureScript 1.8.51:

(clojure.string/split "qwe" #"") ; => ["" "q" "w" "e"]

[CLJ-935] clojure.string/trim uses different defn of whitespace as triml, trimr Created: 21/Feb/12  Updated: 31/Jan/14  Resolved: 31/Jan/14

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6
Fix Version/s: Release 1.6

Type: Defect Priority: Minor
Reporter: Andy Fingerhut Assignee: Unassigned
Resolution: Completed Votes: 0
Labels: string

Attachments: Text File clj935-2.patch     Text File clj935-3.patch     Text File fix-trim-fns-different-whitespace-patch.txt    
Patch: Code and Test
Approval: Ok


clojure.string/triml and trimr use Character/isWhitespace to determine whether a character is whitespace, but trim uses some other definition of white space character. For example:

user=> (use 'clojure.string)
user=> (def s "  \u2002  foo")
user=> (trim s)
"?  foo"
user=> (triml s)

Cause: triml and trimr use Character/isWhitespace. trim uses String/trim which seems to define whitespace as any character less than or equal '\u0020'. The isWhitespace() definition is slightly different and includes other Unicode space characters.

Approach: The attached patch changes trim to use Character/isWhitespace. The isWhitespace version seems generally newer and more Unicode considerate so this was chosen over changing triml and trimr to match trim.

A few alternative implementations were considered with respect to longs, ints, etc. The patch opts to use the simplest possible code, eschewing any extreme performance measures. See the comments for more info if desired.

The patch also changes triml to only call .length on s once.

Patch: clj935-3.patch

Screened by: Stuart Sierra

Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Rich: do we want to be consistent across the different trim fns (per this patch) or consistent with Java, which uses one definition of whitespace in Character/isWhitespace, and a different definition in String/trim?

Comment by Stuart Halloway [ 08/Jun/12 10:22 AM ]

Question for Andy: why the (int ...) forms, when Clojure only works with primitive longs?

Comment by Andy Fingerhut [ 08/Jun/12 10:33 AM ]

Answer for Stuart: Looks like I was preserving the (int ...) form that was in triml before, perhaps somewhat mindlessly. Depending on Rich's answer, I can update the patch if desired.

Comment by Aaron Bedra [ 11/Apr/13 5:34 PM ]

Bump on the discussion. This ticket seems blocked until we make a decision.

Comment by Alex Miller [ 30/Aug/13 3:47 PM ]

In response to the questions regarding the "int" calls in the prior patch, I did some analysis and looked at alternative impls.

Prior patch: the calls to (int) were indeed causing primitive ints to be used in some cases (like letting the length). However, the loop index/rindex is always a prim long at best (and there is no way to type hint or alter that afaict). Because of this, casts get inserted to upcast the length to a long for comparison anyways. Using criterium and calling ltrim on a string of 1M spaces, I got a mean of 2.9 ms.

I tried a dirty trick of using a mutable int array for the index and that (while ugly) was 2.25 ms, so definitely faster. It did not seem fast enough to justify the dirtiness.

I also benchmarked a pure Java int version:

static public CharSequence triml(CharSequence s) {
  int len = s.length();
    for(int index=0; index < len; index++) {
      if(! Character.isWhitespace(s.charAt(index)))
        return s.subSequence(index, len).toString();
  return "";

This code is actually smaller and faster - 0.92 ms.

I pondered this and in the end decided to leave it as the most straightforward Clojure impl. I think for the vast majority of trim calls, the difference will be negligible. For users needing higher trim performance, they should probably write an optimized version and tune their whitespace set. My second choice would be adding the Java implementations.

Marking screened.

Comment by Rich Hickey [ 25/Oct/13 7:42 AM ]

did you try

(set! *unchecked-math* true)

Comment by Alex Miller [ 02/Dec/13 10:55 PM ]

Testing perf of ltrim on a 1M character blank string (tested on criterium w/ good JVM settings):

clojure.string/ltrim = 2.785597 ms
clj935-2.patch ltrim = 2.880528 ms
same as clj935-2 with unchecked-inc = 2.225098 ms
same as clj935-2 with unchecked-inc-int = 2.198700 ms

I tested the last 3 with unchecked-math both true and false and saw no difference between them.

Based on this, I will update the patch to use unchecked-inc as that seems to give even better performance than the prior impl.

Comment by Rich Hickey [ 03/Jan/14 9:28 AM ]

Is there a gist or anything of the perf tests?

Generated at Tue Sep 19 08:30:33 CDT 2017 using JIRA 4.4#649-r158309.