<< Back to previous view

[CLJS-776] re-matches is incorrect Created: 28/Feb/14  Updated: 02/Dec/14

Status: Open
Project: ClojureScript
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Minor
Reporter: Francis Avila Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

The re-matches function does not have the correct semantics: it performs a search (not match) against the string and returns nil if the string and matched-string are unequal. This is not the same as true matching, which is like inserting "^" and "$" at the beginning and end of the pattern.

Example in Clojure:

user=> (re-find #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
"0"
user=> (re-matches #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
"0x1"

Compare Clojurescript:

ClojureScript:cljs.user> (re-find  #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
"0"
ClojureScript:cljs.user> (re-matches #"0|[1-9][0-9]+|0[xX][0-9a-zA-Z]+" "0x1")
nil

This bug is (one of the) reasons why CLJS-775.

I'm not completely sure what to do here. My first thought is to have re-matches inspect the -source property of its regex input, wrap the string with "^$", then carefully copy all flags over to a new regexp.

Questions:

  1. Are there any valid patterns where this is not safe? E.g., where we could not put ^ first? Is "^^abc$$" ok?
  2. Can we avoid cloning if ^ and $ are already the first and last chars of the pattern?
  3. How does multiline mode play in to this, if at all?
  4. regexinstance.lastIndex is a piece of mutability on regex instances (or the RegExp global on older browsers) which is used as a string offset for multiple invocations of exec() on the same string. I have no idea what to do if re-* gets a regex with the global flag set. (BTW, this is a very good reason to reject CLJS-150: allowing clojure to accept the global flag makes regular expression objects stateful, and would completely screw up re-seq for example.)


 Comments   
Comment by Francis Avila [ 24/Jun/14 7:37 AM ]

I would like to propose a somewhat radical suggestion that would: fix this issue and CLJS-810, put us in a better position to resolve CLJS-485 CLJS-746 CLJS-794 (clojure.string/replace woes), allow us to add some regex-as-a-value niceties to patterns in js (CLJS-67 and CLJS-68), and bring clojurescript's regular expression handling closer to clojure's by implementing more of the re-* functions.

Example implementation (not a patch) at this cljsfiddle: http://cljsfiddle.net/fiddle/favila.regexp

Essential points:

  1. Create a Pattern object, created by re-pattern, which provides methods to create regexps for search (re-find) or exact match (re-matches) or repeated searches (re-seq, re-matcher + re-find). Each of these must be a different RegExp object in javascript even though they are similar regular expression strings. The re-find and re-matches patterns can be cached. All can generate RegExps lazily.
  2. regular expression literals emit these Pattern objects instead of RegExp objects.
  3. Create a Matcher object to correspond to the currently-unimplemented re-matcher. It combines a global-flagged RegExp object, a search string, and a done flag. If it keeps the last match (similar to java), cljs can also implement re-groups.
  4. Make re-seq use the Matcher object and thus the .lastIndex that native RegExps provide for global matches. (Its implementation no longer requires string slicing after every match.)
  5. If re-find is given a native RegExp object instead of a pattern, it will use it as-is. This matches current behavior.
  6. If re-matches is given a native RegExp object and it isn't suitable for exact-matching, a new RegExp is cloned from the input RegExp with ^ and $ prepended and appended and the global flag added. (This technique is used in clojure.string/replace, but incorrectly.)

Thoughts?

Comment by David Nolen [ 02/Dec/14 5:46 AM ]

This sounds interesting but I'm somewhat concerned about the interop story. I think people will expect functions to take regular RegExps as well as Pattern. You haven't mentioned this issue here?

Comment by Francis Avila [ 02/Dec/14 1:16 PM ]

I apologize if some of my thoughts are vague; I haven't thought about this in a while.

First note that a narrow class of RegExps are effectively "pure": If they do a full-string match (e.g. start with ^ and end with $) and have the global flag set to false then their lastIndex will always seem to be 0.

Interop possibilities:

  • Patterns and RegExps can be created from one another, so coercion is always an option. E.g. re-pattern can accept a RegExp and some other (cljs-specific) function can coerce from Pattern or Matcher to RegExp. (Or maybe re-matcher can return a RegExp-compatible object--see below.)
  • RegExp given to cljs re-*: "Pure" regexes can be used directly, otherwise we create a Pattern and/or Matcher. (I don't remember the details, but the fiddle should cover them.)
  • Pattern used as a RegExp: Patterns can expose all the properties of RegExp instances. If the pattern is pure, it can implement .test and .exec. .lastIndex will always be 0. Not sure what to do about impure patterns: throw exception, act pure anyway, return a new object?
  • Matcher used as a RegExp: A Matcher can exactly replicate a RegExp instance, maybe even use the same prototype. Using it like a RegExp will mutate the object and disturb its internal state, but as long as it's either used consistently as a RegExp or consistently as a Matcher this won't matter. Notes:
    • Matcher holds the matched string in Java. Javascript trusts you to always supply the same string (e.g. in a while loop).
    • Java's Matcher holds the last match (used by re-groups). Javascript's RegExp does not.
    • Javascript's RegExp will automatically reset when lastIndex reaches the end of the source string. Java's Matcher won't.
    • Matcher must be a wrapper and not a normal RegExp because of these three extra bits of state.
    • The return value of re-matcher is only consumed by the 1-arg form of re-find and re-groups.
    • re-seq can use a matcher internally, but since it is private it doesn't have to.
    • Should other Java methods of Matcher be implemented?
  • Pattern given to String.prototype.match, .replace, .search, and .split: I haven't thought about this. Considerations:
    • Problem code is any cljs code using an object created via pattern literals or re-pattern and using it as an argument to these String methods. If they use clojure.string methods instead they would be fine.
    • Such code is also impossible in java clojure: only (.split s "pattern-str") is the same in java/clj and js/cljs and it will continue to work (without flags) on both platforms. Possibly we could just make people fix such code since it is platform-specific, but I need to see how widespread this is.
    • The fix for such code is to either:
      • Use a pattern->regexp coercion function we will provide.
      • Construct those regexps directly with js/RegExp.
      • Use clojure.string functions instead of String methods. This also has the advantage of being portable between clj and cljs.
    • Possibly we can patch the RegExp constructor or mess with the String prototype chain to do pattern->regexp coercion automatically, but this is a violent solution.

Correct me if I'm wrong, but in Clojure (java) code it is extremely uncommon to use Pattern and Match methods or to use them with String methods directly. For the most part they are treated as opaque objects used via re-* and clojure.string/*. Code written in the same style in cljs would be unaffected, and we can declare any other use as platform-specific and require explicit creation of RegExps (and don't bother to make Matcher or Pattern act like RegExps). This is my preferred approach for interop if there isn't too much use of RegExp.prototype.test, .exec, and String.prototype.match, .replace, .search, and .split.

Generated at Thu Dec 18 08:33:56 CST 2014 using JIRA 4.4#649-r158309.