ClojureScript

re-matches is incorrect

Details

  • Type: Defect Defect
  • Status: Open Open
  • Priority: Minor Minor
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

The re-matches function does not have the correct semantics: it performs a search (not match) against the string and returns nil if the string and matched-string are unequal. This is not the same as true matching, which is like inserting "^" and "$" at the beginning and end of the pattern.

Example in Clojure:

user=> (re-find #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
"0"
user=> (re-matches #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
"0x1"

Compare Clojurescript:

ClojureScript:cljs.user> (re-find  #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
"0"
ClojureScript:cljs.user> (re-matches #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
nil

This bug is (one of the) reasons why CLJS-775.

I'm not completely sure what to do here. My first thought is to have re-matches inspect the -source property of its regex input, wrap the string with "^$", then carefully copy all flags over to a new regexp.

Questions:

  1. Are there any valid patterns where this is not safe? E.g., where we could not put ^ first? Is "^^abc$$" ok?
  2. Can we avoid cloning if ^ and $ are already the first and last chars of the pattern?
  3. How does multiline mode play in to this, if at all?
  4. regexinstance.lastIndex is a piece of mutability on regex instances (or the RegExp global on older browsers) which is used as a string offset for multiple invocations of exec() on the same string. I have no idea what to do if re-* gets a regex with the global flag set. (BTW, this is a very good reason to reject CLJS-150: allowing clojure to accept the global flag makes regular expression objects stateful, and would completely screw up re-seq for example.)

Activity

David Nolen made changes -
Field Original Value New Value
Priority Major [ 3 ] Minor [ 4 ]
Hide
Francis Avila added a comment - - edited

I would like to propose a somewhat radical suggestion that would: fix this issue and CLJS-810, put us in a better position to resolve CLJS-485 CLJS-746 CLJS-794 (clojure.string/replace woes), allow us to add some regex-as-a-value niceties to patterns in js (CLJS-67 and CLJS-68), and bring clojurescript's regular expression handling closer to clojure's by implementing more of the re-* functions.

Example implementation (not a patch) at this cljsfiddle: http://cljsfiddle.net/fiddle/favila.regexp

Essential points:

  1. Create a Pattern object, created by re-pattern, which provides methods to create regexps for search (re-find) or exact match (re-matches) or repeated searches (re-seq, re-matcher + re-find). Each of these must be a different RegExp object in javascript even though they are similar regular expression strings. The re-find and re-matches patterns can be cached. All can generate RegExps lazily.
  2. regular expression literals emit these Pattern objects instead of RegExp objects.
  3. Create a Matcher object to correspond to the currently-unimplemented re-matcher. It combines a global-flagged RegExp object, a search string, and a done flag. If it keeps the last match (similar to java), cljs can also implement re-groups.
  4. Make re-seq use the Matcher object and thus the .lastIndex that native RegExps provide for global matches. (Its implementation no longer requires string slicing after every match.)
  5. If re-find is given a native RegExp object instead of a pattern, it will use it as-is. This matches current behavior.
  6. If re-matches is given a native RegExp object and it isn't suitable for exact-matching, a new RegExp is cloned from the input RegExp with ^ and $ prepended and appended and the global flag added. (This technique is used in clojure.string/replace, but incorrectly.)

Thoughts?

Show
Francis Avila added a comment - - edited I would like to propose a somewhat radical suggestion that would: fix this issue and CLJS-810, put us in a better position to resolve CLJS-485 CLJS-746 CLJS-794 (clojure.string/replace woes), allow us to add some regex-as-a-value niceties to patterns in js (CLJS-67 and CLJS-68), and bring clojurescript's regular expression handling closer to clojure's by implementing more of the re-* functions. Example implementation (not a patch) at this cljsfiddle: http://cljsfiddle.net/fiddle/favila.regexp Essential points:
  1. Create a Pattern object, created by re-pattern, which provides methods to create regexps for search (re-find) or exact match (re-matches) or repeated searches (re-seq, re-matcher + re-find). Each of these must be a different RegExp object in javascript even though they are similar regular expression strings. The re-find and re-matches patterns can be cached. All can generate RegExps lazily.
  2. regular expression literals emit these Pattern objects instead of RegExp objects.
  3. Create a Matcher object to correspond to the currently-unimplemented re-matcher. It combines a global-flagged RegExp object, a search string, and a done flag. If it keeps the last match (similar to java), cljs can also implement re-groups.
  4. Make re-seq use the Matcher object and thus the .lastIndex that native RegExps provide for global matches. (Its implementation no longer requires string slicing after every match.)
  5. If re-find is given a native RegExp object instead of a pattern, it will use it as-is. This matches current behavior.
  6. If re-matches is given a native RegExp object and it isn't suitable for exact-matching, a new RegExp is cloned from the input RegExp with ^ and $ prepended and appended and the global flag added. (This technique is used in clojure.string/replace, but incorrectly.)
Thoughts?

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated: