ClojureScript

clojure.string/split behavior when string and regex are the same

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Completed
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

clojure.string/split does not behave the same as in Clojure when provided string and regex are the same.

in Clojure

user=> (clojure.string/split "aa" #"aa")
[]

in ClojureScript

user=> (clojure.string/split "aa" #"aa")
["", ""]

I am not expecting equivalent behavior when dealing with regex in general, but this case seams quite simple/common.

Not sure if this should be adressed via documentation or an adhoc fix?

Activity

Hide
David Nolen added a comment -

A little more information would be helpful - is this because of JavaScript or something else? Many thanks.

Show
David Nolen added a comment - A little more information would be helpful - is this because of JavaScript or something else? Many thanks.
Hide
Julien Eluard added a comment -

Sure! Sorry about that.

Yes it's due to the JavaScript split implementation. Executing "aa".split("aa") returns ["", ""].

I couldn't find any resource documenting this specific behavior. It behaves this way on Chrome, Chrome Canary, Firefox, Safari and phantomjs so this must be some king of standard!

Show
Julien Eluard added a comment - Sure! Sorry about that. Yes it's due to the JavaScript split implementation. Executing "aa".split("aa") returns ["", ""]. I couldn't find any resource documenting this specific behavior. It behaves this way on Chrome, Chrome Canary, Firefox, Safari and phantomjs so this must be some king of standard!
Hide
Julien Eluard added a comment -

Some more related observations:

for (clojure.string/split "ab" #"a")
=> both Clojure and ClojureScript return ["" "b"]

for (clojure.string/split "ab" #"b")
=> ClojureScript returns ["a" ""] while Clojure returns ["a"]

I guess it would be too much expensive to tests those variations and correct them.

Is there a list detailing some known differences between Clojure and ClojureScript when dealing with regular expressions? Sounds like the right way to go. I would be interested in contributing to it.

Show
Julien Eluard added a comment - Some more related observations: for (clojure.string/split "ab" #"a") => both Clojure and ClojureScript return ["" "b"] for (clojure.string/split "ab" #"b") => ClojureScript returns ["a" ""] while Clojure returns ["a"] I guess it would be too much expensive to tests those variations and correct them. Is there a list detailing some known differences between Clojure and ClojureScript when dealing with regular expressions? Sounds like the right way to go. I would be interested in contributing to it.
Hide
David Nolen added a comment -

Julien the place to do this is probably on the ClojureScript wiki on Github where other differences are listed.

Show
David Nolen added a comment - Julien the place to do this is probably on the ClojureScript wiki on Github where other differences are listed.
Hide
Julien Eluard added a comment -

I created a simplistic project to track the differences I notice: https://github.com/jeluard/cloclo.

Show
Julien Eluard added a comment - I created a simplistic project to track the differences I notice: https://github.com/jeluard/cloclo.
Hide
David Nolen added a comment - - edited

Reconciling this regex difference between hosts is not in scope, but we need more information on this ticket to move forward.

Show
David Nolen added a comment - - edited Reconciling this regex difference between hosts is not in scope, but we need more information on this ticket to move forward.
Hide
Julien Eluard added a comment -

I did some more research.

Note that differences here are not directly related to regexp behaviour (pattern is only composed of characters) but more about how empty strings are handled (both input and result).

Given the contract for clojure.core/split is not very well defined I guess it is safe to assume Java behavior is the contract (it directly relies on String#split). Unfortunately String#split contract is not much more helpful (especially regarding the limit cases we are considering here).
EcmaScript 5 definition (section 15.5.4.14 p. 148) of split (directly used by ClojureScript version of clojure.core/split) is pretty well defined but hard to de-cypher. To complicate things further IE is known to strip empty strings from result (GWT implementation has some more details).
That said I found an EcmaScript test that validates ClojureScript result for (clojure.string/split "ab" #"b").

In general it looks like Clojure split results differ depending on wether a limit is provided or not (for the cases detailed here). This is not the case in ClojureScript.

One thing to consider is that ClojureScript split has 2 different implementations depending on the limit usage. Maybe an option would be to only rely on the custom implementation and make sure it matches Java split behavior (not considering regexp differences)? I doubt performance would be a problem here.

PS: I discovered a bug when mixing empty strings and limit: (clojure.string/split "abc" #"" 5) => ["" "" "" "" "abc"]

Show
Julien Eluard added a comment - I did some more research. Note that differences here are not directly related to regexp behaviour (pattern is only composed of characters) but more about how empty strings are handled (both input and result). Given the contract for clojure.core/split is not very well defined I guess it is safe to assume Java behavior is the contract (it directly relies on String#split). Unfortunately String#split contract is not much more helpful (especially regarding the limit cases we are considering here). EcmaScript 5 definition (section 15.5.4.14 p. 148) of split (directly used by ClojureScript version of clojure.core/split) is pretty well defined but hard to de-cypher. To complicate things further IE is known to strip empty strings from result (GWT implementation has some more details). That said I found an EcmaScript test that validates ClojureScript result for (clojure.string/split "ab" #"b"). In general it looks like Clojure split results differ depending on wether a limit is provided or not (for the cases detailed here). This is not the case in ClojureScript. One thing to consider is that ClojureScript split has 2 different implementations depending on the limit usage. Maybe an option would be to only rely on the custom implementation and make sure it matches Java split behavior (not considering regexp differences)? I doubt performance would be a problem here. PS: I discovered a bug when mixing empty strings and limit: (clojure.string/split "abc" #"" 5) => ["" "" "" "" "abc"]
Hide
David Nolen added a comment -

Thanks for doing the research, I'm ok with taking a patch that unifies clojure.string/split behavior thanks.

Show
David Nolen added a comment - Thanks for doing the research, I'm ok with taking a patch that unifies clojure.string/split behavior thanks.
Hide
Julien Eluard added a comment -

Attached fixsplit.diff unifies both behaviors and fixes:

  • split with limit=0 (trailing empty strings are discarded)
  • split with empty regex

I introduced an extra method for the empty regex case as it can't work with current split implementation (the value for s in the loop/recur stays the same triggering a loop up to limit=1).

Show
Julien Eluard added a comment - Attached fixsplit.diff unifies both behaviors and fixes:
  • split with limit=0 (trailing empty strings are discarded)
  • split with empty regex
I introduced an extra method for the empty regex case as it can't work with current split implementation (the value for s in the loop/recur stays the same triggering a loop up to limit=1).

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved: