tools.reader

Reader supports poorly defined regexes that break code

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Declined
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

(str #"/")   => "\/"
(str #"\/")  => "\\/"
(str #"\\/") => "\\\\/"

But what does

"\\/"
mean here?

Looking at Clojure execution of these regexes:

(re-find #"/"   "\/") => "/"
(re-find #"\/"  "\/") => "/"
(re-find #"\\/" "\/") => "\\/"

ie.

#"\/"
behaves exactly like
#"/"

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for

"\/"

Alternatively, if

 #"\/" 
remains supported (for familiarity with users used to /.../ syntax), then the reader should emit
"/"
not
"\\/"
as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399

Activity

Chris Truter made changes -
Field Original Value New Value
Description I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

(str #"/") => "\/"
(str #"\/") => "\\/"
(str #"\\/") => "\\\\/"

But what does "\\/" mean here?

Looking at Clojure execution of these regexes:

{code}
(re-find #"/" "\/") => "/"
(re-find #"\/" "\/") => "/"
(re-find #"\\/" "\/") => "\\/"
{code}

ie. #"\/" behaves exactly like #"/".

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for "\/".

Alternatively, if #"\/" remains supported (for familiarity with users used to /.../ syntax), then the reader should emit "/" not "\\/" as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399
I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

{code}
(str #"/") => "\/"
(str #"\/") => "\\/"
(str #"\\/") => "\\\\/"
{code}

But what does "\\/" mean here?

Looking at Clojure execution of these regexes:

{code}
(re-find #"/" "\/") => "/"
(re-find #"\/" "\/") => "/"
(re-find #"\\/" "\/") => "\\/"
{code}

ie. #"\/" behaves exactly like #"/".

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for "\/".

Alternatively, if #"\/" remains supported (for familiarity with users used to /.../ syntax), then the reader should emit "/" not "\\/" as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399
Hide
Chris Truter added a comment -

Happy to provide the patch, whichever way you lean for the solution.

Note that JIRA has done some strange formatting on some strings here, not sure how to solve this (perhaps best to read the markup directly)

Show
Chris Truter added a comment - Happy to provide the patch, whichever way you lean for the solution. Note that JIRA has done some strange formatting on some strings here, not sure how to solve this (perhaps best to read the markup directly)
Nicola Mometto made changes -
Description I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

{code}
(str #"/") => "\/"
(str #"\/") => "\\/"
(str #"\\/") => "\\\\/"
{code}

But what does "\\/" mean here?

Looking at Clojure execution of these regexes:

{code}
(re-find #"/" "\/") => "/"
(re-find #"\/" "\/") => "/"
(re-find #"\\/" "\/") => "\\/"
{code}

ie. #"\/" behaves exactly like #"/".

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for "\/".

Alternatively, if #"\/" remains supported (for familiarity with users used to /.../ syntax), then the reader should emit "/" not "\\/" as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399
I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

{code}
(str #"/") => "\/"
(str #"\/") => "\\/"
(str #"\\/") => "\\\\/"
{code}

But what does {noformat}"\\/"{noformat} mean here?

Looking at Clojure execution of these regexes:

{code}
(re-find #"/" "\/") => "/"
(re-find #"\/" "\/") => "/"
(re-find #"\\/" "\/") => "\\/"
{code}

ie. {noformat}#"\/"{noformat} behaves exactly like {noformat}#"/"{noformat}

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for {noformat}"\/"{noformat}.

Alternatively, if {noformat} #"\/" {noformat} remains supported (for familiarity with users used to /.../ syntax), then the reader should emit {noformat}"/"{noformat} not {noformat}"\\/"{noformat} as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399
Nicola Mometto made changes -
Description I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

{code}
(str #"/") => "\/"
(str #"\/") => "\\/"
(str #"\\/") => "\\\\/"
{code}

But what does {noformat}"\\/"{noformat} mean here?

Looking at Clojure execution of these regexes:

{code}
(re-find #"/" "\/") => "/"
(re-find #"\/" "\/") => "/"
(re-find #"\\/" "\/") => "\\/"
{code}

ie. {noformat}#"\/"{noformat} behaves exactly like {noformat}#"/"{noformat}

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for {noformat}"\/"{noformat}.

Alternatively, if {noformat} #"\/" {noformat} remains supported (for familiarity with users used to /.../ syntax), then the reader should emit {noformat}"/"{noformat} not {noformat}"\\/"{noformat} as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399
I ran into a strange case where CLJS emitted invalid code based on a poorly formatted regex that escaped / incorrectly.

Looking at such a regex, along with two similar but well formed regexes, passing through tools.reader:

{code}
(str #"/") => "\/"
(str #"\/") => "\\/"
(str #"\\/") => "\\\\/"
{code}

But what does {noformat}"\\/"{noformat} mean here?

Looking at Clojure execution of these regexes:

{code}
(re-find #"/" "\/") => "/"
(re-find #"\/" "\/") => "/"
(re-find #"\\/" "\/") => "\\/"
{code}

ie. {noformat}#"\/"{noformat} behaves exactly like {noformat}#"/"{noformat}

Things get more unfortunate once CLJS get's involved, it does not expect the "heisen" regex - and the "dangling escape" ends up capturing the forward slash's escape, ie. an prematurely terminating regex is emitted.

Despite Clojure's existing "fortuitous" behaviour, perhaps the correct behaviour is to throw a reader exception for such regexes, as it does for {noformat}"\/"{noformat}

Alternatively, if {noformat} #"\/" {noformat} remains supported (for familiarity with users used to /.../ syntax), then the reader should emit {noformat}"/"{noformat} not {noformat}"\\/"{noformat} as the string value of the literal, ie. this "tolerance" should be part of the reader semantics rather than a concern for emitters.

See http://dev.clojure.org/jira/browse/CLJS-1399
Hide
Nicola Mometto added a comment -

I'm not sure I understand what's being proposed here. The implementation of the regex reader in cljs is identical to the one in clj and all the regex examples you posted should be valid regex literals. If cljs doesn't compile the latter that's an issue in the cljs emitter, I can't change the clj syntax to workaround a bug in the cljs emitter.

Show
Nicola Mometto added a comment - I'm not sure I understand what's being proposed here. The implementation of the regex reader in cljs is identical to the one in clj and all the regex examples you posted should be valid regex literals. If cljs doesn't compile the latter that's an issue in the cljs emitter, I can't change the clj syntax to workaround a bug in the cljs emitter.
Nicola Mometto made changes -
Status Open [ 1 ] Closed [ 6 ]
Resolution Completed [ 1 ]
Nicola Mometto made changes -
Resolution Completed [ 1 ]
Status Closed [ 6 ] Reopened [ 4 ]
Nicola Mometto made changes -
Status Reopened [ 4 ] Closed [ 6 ]
Resolution Declined [ 2 ]
Hide
Nicola Mometto added a comment -

Note that it is a feature of the clj reader that

#"\/"
is a valid regex while
 "\/"
is not a valid string..

Show
Nicola Mometto added a comment - Note that it is a feature of the clj reader that
#"\/"
is a valid regex while
 "\/"
is not a valid string..
Hide
Chris Truter added a comment -

Hi Nicola - thanks for taking the time to reformat the issue and consider it.

Initially planned to amend the cljs emitter, but David was of the opinion this should be "fixed" in tools.reader.

The situation where a basic literal, valid in Clojure, will silently break an entire cljs build is a very poor experience. Since it would be simple to avert through a simple reader or emitter change, I feel that one of those is necessary.

When you say that reading the form is a feature, that's what I refering to by tolerance. But if practically

#"\/"
means #"/", it seems more natural to normalize during reading. Perhaps I'm missing an understanding of why Clojure normalizes during emission rather, but to me this seems to blur semantics of the regex's underlying string (4 backslashes is a literal, 2 backslashes is a character escape, but forward slash is not an escapable character in Java or Javascript).

I'll submit the emitter patch back at http://dev.clojure.org/jira/browse/CLJS-1399. If you remain resolute that these regexes must be read in a backwards compatible manner, perhaps you could help make the argument to vett that ticket instead?

Show
Chris Truter added a comment - Hi Nicola - thanks for taking the time to reformat the issue and consider it. Initially planned to amend the cljs emitter, but David was of the opinion this should be "fixed" in tools.reader. The situation where a basic literal, valid in Clojure, will silently break an entire cljs build is a very poor experience. Since it would be simple to avert through a simple reader or emitter change, I feel that one of those is necessary. When you say that reading the form is a feature, that's what I refering to by tolerance. But if practically
#"\/"
means #"/", it seems more natural to normalize during reading. Perhaps I'm missing an understanding of why Clojure normalizes during emission rather, but to me this seems to blur semantics of the regex's underlying string (4 backslashes is a literal, 2 backslashes is a character escape, but forward slash is not an escapable character in Java or Javascript). I'll submit the emitter patch back at http://dev.clojure.org/jira/browse/CLJS-1399. If you remain resolute that these regexes must be read in a backwards compatible manner, perhaps you could help make the argument to vett that ticket instead?

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: