Skip to end of metadata
Go to start of metadata

The problem

Clojure as of 2015 and earlier allows the creation of symbols and keywords that can be printed, but not read back in.  For example:

    user=> (-> "a@b" keyword pr-str)
":a@b"
user=> (-> "a@b" keyword pr-str read-string)
:a

Only a few characters are officially documented as being supported in symbols and keywords at http://clojure.org/reader, as of Jun 1 2015

    * + ! - _ ? allowed
    non repeating : in the middle allowed
    . or : at beginning or end reserved for Clojure only
    / by itself names division function

However, Clojure 1.6.0 itself uses characters that are not on the list of supported characters, e.g. in symbols such as inc' <= >=

An approach 

First, document a larger set of characters as officially supported in symbols and keywords, so that Clojure itself is limited by its own documented restrictions.  See http://dev.clojure.org/jira/browse/CLJ-1527

Next, document and implement support for a much larger set of characters in symbols and keywords, perhaps arbitrary, using #|| syntax, inspired by Common Lisp's || syntax for including arbitrary characters in its symbols.

Since Clojure 1.7.0's reader gives an error if given #|| as input, one approach to preserve some inter-version print/read compatibility is to introduce read support for #|| in one version of Clojure, and only optional print support (which is not the default behavior).

In a later release, e.g. 1 or 2 years later, change the default print behavior to print #|| by default, when the symbol or keyword includes characters that are not on the short list of approved characters.  Performance in detecting which symbols and keywords need this treatment would be a factor, and perhaps introducing an internal flag into the Symbol and Keyword classes indicating whether they needed #|| when printing them would be warranted.  Options are to set the flag when the symbol/keyword is constructed, or if that would slow down construction too much, it could be determined only when the symbol/keyword is first printed, and cached for future printing.

In yet another later release, change the behavior of reading so that if a symbol/keyword has non-allowed characters outside of #||, it is an error.

Only printing and reading would be affected here, not the behavior of functions like symbol, keyword, intern, etc., which would continue to behave as they do now, allowing everything.

Syntax

The new printed-as-readable syntax for symbols and keywords with normally-not-allowed characters begins with #|.  All characters up to the next | character are part of the symbol or keyword's name.  A vertical bar in the keyword/symbol's name is escaped as backslash then vertical bar, and a backslash is escaped as two backslashes.  No other characters between #| and the final | require or allow escaping.

TBD: Should the #|| syntax be allowed independently for the namespace and name portions?  For example, should #|my weird namespace|/#|my weird name| be supported, even allowing / characters within the namespace and name?

TBD: Probably still need restrictions such as no leading : in symbol names, reserving those to designate keywords, and a few others documented at clojure.org/reader.  Those should be explicitly documented here.

Labels:
  1. Jun 01, 2015

    Rather than escaping the delimiter \| if it occurs within the symbol, I'd much rather see a way to have a configurable delimiter, similar to how it's done in mime multipart. We already have a couple of data types with fixed delimiters: () [] {} #{} "". I don't see the point of adding one specially for symbols, since non-standard symbols can be easily supported with a reader tag #symbol "like this" or with a #symbol ["namespace" "like this"]. However, I am interested in syntax with configurable delimiters, since that opens the door to embedding arbitrary data within a clojure file.

    An imaginary syntax for this: #[delim|this is #[| |] delimited binary data|delim]

    EDIT: As requested by Andy here: http://dev.clojure.org/jira/browse/CLJ-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=39220#comment-39220, a short description of how MIME Multipart works:

    The message is a subset of RFC-822 with the following content-type header, and structured as follows:

    This allows arbitrary binary data to be encoded in emails, without the need to quote it.

    EDIT: Another question was, why I think fixed delimiters would be a bad idea. I don't think they are a bad idea in general, quoting in "\"strings\"" works reasonably well. IIUC, #|sym bols| have been proposed, because .net uses names that don't fall into clojure's symbol syntax. The reason I'm critical towards that proposal is two-fold:

    1.) what will happen when we want to support a platform, that needs | within its symbols for some reason. Sure those can be \| quoted there, but this option already exists with using the " delimiter + a reader tag.

    2.) configurable delimiters would add an equally substantial feature with much more generality and would resolve similar discussions towards alternate string delimiters.

    1. Jun 01, 2015

      Herwig, if you want arbitrary binary data embedded in a text stream in a format that can be printed and read by Clojure, while that has some similarities to allowing all of Unicode to be a part of symbols and keywords, it seems to me like a separate desire best fulfilled without the restrictions that will likely be necessary for symbols and keywords (even with explicit support for a larger character set).

      1. Jun 01, 2015

        If the discussion should be about arbitrary characters in just symbols and keywords, I'll say add reader tags and be done with it. Sure, even #sym"" and #key"" will be more characters to type than #|| and :#||, but to me, that advantage doesn't offset the added complexity. Observe, that clojure currently has just one data literal with escaping rules different from clojure's regular \c \h \a \r \a \c \t \e \r escaping: "strings\nwith different syntax for \\newlines and other control characters"

        What's the point of adding another one, with distinct rules? How much will ClojureCLR gain from this, over a #sym "weird symbol reader tag"?

        EDIT: reparented as a reply to Andy's message