Clojure

Clarify and align valid symbol and keyword rules for Clojure (and edn)

Details

  • Type: Enhancement Enhancement
  • Status: Open Open
  • Priority: Major Major
  • Resolution: Unresolved
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
  • Approval:
    Triaged

Description

Known areas of under-specificity (http://clojure.org/reader#The%20Reader--Reader%20forms):

  • symbols (and keywords) description do not mention constituent characters that are currently in use by Clojure functions such as <, >, =, $ (for Java inner classes), & (&form and &env in macros), % (stated to be valid in edn spec)
  • keywords currently accept leading numeric characters which is at odds with the spec - see CLJ-1286

References:

Activity

Hide
Andy Fingerhut added a comment - - edited

The Clojure reader documentation also does not mention the following symbols as valid constituent characters. They are all mentioned as valid symbol constituent characters in the EDN readme here: https://github.com/edn-format/edn#symbols

dollar sign - used in Clojure/JVM to separate Java subclass names from class names, e.g. java.util.Map$Entry
percent sign - not sure why this is part of edn spec. In Clojure it seems only to be used inside #() for args like % %1 %&
ampersand - like in &form and &env in macro definitions
equals - clojure.core/= and many others
less-than - clojure.core/< clojure.core/<=
greater-than - clojure.core/> clojure.core/>=

I don't know whether Clojure and edn specs should be the same in this regard, but it seemed worth mentioning for this ticket.

Show
Andy Fingerhut added a comment - - edited The Clojure reader documentation also does not mention the following symbols as valid constituent characters. They are all mentioned as valid symbol constituent characters in the EDN readme here: https://github.com/edn-format/edn#symbols dollar sign - used in Clojure/JVM to separate Java subclass names from class names, e.g. java.util.Map$Entry percent sign - not sure why this is part of edn spec. In Clojure it seems only to be used inside #() for args like % %1 %& ampersand - like in &form and &env in macro definitions equals - clojure.core/= and many others less-than - clojure.core/< clojure.core/<= greater-than - clojure.core/> clojure.core/>= I don't know whether Clojure and edn specs should be the same in this regard, but it seemed worth mentioning for this ticket.
Hide
Andy Fingerhut added a comment -

Alex, Rich made this comment on CLJ-17 in 2011: "Runtime validation off the table for perf reasons. cemerick's suggestion that arbitrary symbol support will render them valid is sound, but arbitrary symbol support is a different ticket/idea." I am not aware of any tickets that propose the enhancement of allowing arbitrary symbols to be supported by Clojure, e.g. via a syntax like

#|white space and arbitrary #$@)$~))@ chars here|

Do you think it is reasonable to create an enhancement ticket for supporting arbitrary characters in symbols and keywords?

Show
Andy Fingerhut added a comment - Alex, Rich made this comment on CLJ-17 in 2011: "Runtime validation off the table for perf reasons. cemerick's suggestion that arbitrary symbol support will render them valid is sound, but arbitrary symbol support is a different ticket/idea." I am not aware of any tickets that propose the enhancement of allowing arbitrary symbols to be supported by Clojure, e.g. via a syntax like
#|white space and arbitrary #$@)$~))@ chars here|
Do you think it is reasonable to create an enhancement ticket for supporting arbitrary characters in symbols and keywords?
Hide
Alex Miller added a comment -

Sure. I looked into this a bit as a digression off of feature expressions and #| has been reserved for this potential use. However, there are many tricky issues with it and I do not expect this to happen soon - more likely to be something we're pushed to do when necessary for some other reason.

Show
Alex Miller added a comment - Sure. I looked into this a bit as a digression off of feature expressions and #| has been reserved for this potential use. However, there are many tricky issues with it and I do not expect this to happen soon - more likely to be something we're pushed to do when necessary for some other reason.
Hide
Herwig Hochleitner added a comment -

Wrong ticket, but to anybody thinking about #|arbitrary symbols (or strings)|, please do consider making the delimiters configurable, as in mime multipart.

Show
Herwig Hochleitner added a comment - Wrong ticket, but to anybody thinking about #|arbitrary symbols (or strings)|, please do consider making the delimiters configurable, as in mime multipart.
Hide
Andy Fingerhut added a comment - - edited

I've created a design page for now. I'm sure it does not list many of the tricky issues you have found. I'd be happy to take a shot at documenting them if you have any notes you are willing to share.

http://dev.clojure.org/pages/viewpage.action?pageId=11862058

Show
Andy Fingerhut added a comment - - edited I've created a design page for now. I'm sure it does not list many of the tricky issues you have found. I'd be happy to take a shot at documenting them if you have any notes you are willing to share. http://dev.clojure.org/pages/viewpage.action?pageId=11862058
Hide
Andy Fingerhut added a comment -

Herwig, can you edit the design page linked in my previous comment, to add a reference or example to precisely how mime multipart allows delimiters to be configurable, and why you believe fixed delimeters would be a bad idea?

Show
Andy Fingerhut added a comment - Herwig, can you edit the design page linked in my previous comment, to add a reference or example to precisely how mime multipart allows delimiters to be configurable, and why you believe fixed delimeters would be a bad idea?
Hide
Herwig Hochleitner added a comment -

I've commented on the design page.

Show
Herwig Hochleitner added a comment - I've commented on the design page.
Hide
Alex Miller added a comment -

Removed a couple of issues that have been clarified on the reader page and are no longer issues.

Show
Alex Miller added a comment - Removed a couple of issues that have been clarified on the reader page and are no longer issues.
Hide
Nicola Mometto added a comment -

Related to CLJ-1530

Show
Nicola Mometto added a comment - Related to CLJ-1530
Hide
Adam Frey added a comment -

Related to this: The Clojure reader will not accept symbols and keywords that contain consecutive colons (See LispReader.java), although that is permitted by the current EDN spec. Here is a GitHub issue regarding consecutive colons. I would like to qualify why consecutive colons are disallowed, and sync up the Clojure Reader and the EDN spec on this.

Show
Adam Frey added a comment - Related to this: The Clojure reader will not accept symbols and keywords that contain consecutive colons (See LispReader.java), although that is permitted by the current EDN spec. Here is a GitHub issue regarding consecutive colons. I would like to qualify why consecutive colons are disallowed, and sync up the Clojure Reader and the EDN spec on this.
Hide
Herwig Hochleitner added a comment -

The updated reader spec says that a symbol can contain a single / to separate the namespace. It also mentions a bare / to be the division function.
So what about clojure.core//? That still got to be a readable symbol right? So is that an exception to the 'single /' rule?
Will foo.bar// also be readable? What about foo//bar?

Show
Herwig Hochleitner added a comment - The updated reader spec says that a symbol can contain a single / to separate the namespace. It also mentions a bare / to be the division function. So what about clojure.core//? That still got to be a readable symbol right? So is that an exception to the 'single /' rule? Will foo.bar// also be readable? What about foo//bar?
Hide
Francis Avila added a comment - - edited

Another source of ambiguity I see is that it's unclear whether the first colon of a keyword is the first character of the keyword (and therefore of the symbol) or whether it is something special and the spec really describes what happens from the second character onward. This matters because the specification for a keyword is (in both edn and reader specs) given in terms of differences from symbols. I think many of the strange keyword edge cases (including legality of :1 vs :a/1) stem from this ambiguity, and different tickets/patches seem to choose one or the other underlying assumption. See this comment for more examples.

Possibly we can use tagged literals for keywords and symbols to create or print these forms when they are not readable and simplify the reader spec for their literal forms. E.g. instead of producing complicated parse rules to ensure clojure.core// or :1 are legal, just make the literal form simple and have users write something like #sym["clojure.core" "/"] or #kyw "1" (and have the printer print these) when they hit these edge cases.

Show
Francis Avila added a comment - - edited Another source of ambiguity I see is that it's unclear whether the first colon of a keyword is the first character of the keyword (and therefore of the symbol) or whether it is something special and the spec really describes what happens from the second character onward. This matters because the specification for a keyword is (in both edn and reader specs) given in terms of differences from symbols. I think many of the strange keyword edge cases (including legality of :1 vs :a/1) stem from this ambiguity, and different tickets/patches seem to choose one or the other underlying assumption. See this comment for more examples. Possibly we can use tagged literals for keywords and symbols to create or print these forms when they are not readable and simplify the reader spec for their literal forms. E.g. instead of producing complicated parse rules to ensure clojure.core// or :1 are legal, just make the literal form simple and have users write something like #sym["clojure.core" "/"] or #kyw "1" (and have the printer print these) when they hit these edge cases.
Hide
Alex Miller added a comment -

I would say : (and : are syntactic markers and the spec describes the characters following it. But I agree it would be nice for this to be more explicit. The (incorrect) regex in LispReader does not help either.

The tagged literal idea is an interesting alternative to the | | syntax that has been reserved for possible future support for invalid characters in keywords and symbols. But I think the idea is out of scope for this ticket, which is really about clarifying the spec.

Show
Alex Miller added a comment - I would say : (and : are syntactic markers and the spec describes the characters following it. But I agree it would be nice for this to be more explicit. The (incorrect) regex in LispReader does not help either. The tagged literal idea is an interesting alternative to the | | syntax that has been reserved for possible future support for invalid characters in keywords and symbols. But I think the idea is out of scope for this ticket, which is really about clarifying the spec.
Hide
Steven Yi added a comment -

Coming to this late, I had mentioned on the user mailing list in:

https://groups.google.com/forum/#!topic/clojure/CwZHu1Eszbk

that # is currently allowed as part of a symbol name, such that:

(let a# 4 b#a 3 (println a# b#a))

will print "4 3".

  1. is also employed in auto-gensyms and discussed in http://clojure.org/reference/reader#syntax-quote as part of a symbol's name. From the mailing list thread, # was noted as "may be allowed now, but could be changed later". I would appreciate if it is more clearly described as a special case/reserved, and would ask that its use be restricted in the reader to prevent users from using it now and potentially have code break later.
Show
Steven Yi added a comment - Coming to this late, I had mentioned on the user mailing list in: https://groups.google.com/forum/#!topic/clojure/CwZHu1Eszbk that # is currently allowed as part of a symbol name, such that: (let a# 4 b#a 3 (println a# b#a)) will print "4 3".
  1. is also employed in auto-gensyms and discussed in http://clojure.org/reference/reader#syntax-quote as part of a symbol's name. From the mailing list thread, # was noted as "may be allowed now, but could be changed later". I would appreciate if it is more clearly described as a special case/reserved, and would ask that its use be restricted in the reader to prevent users from using it now and potentially have code break later.

People

Vote (20)
Watch (8)

Dates

  • Created:
    Updated: