Skip to end of metadata
Go to start of metadata
This page is work-in-progress. Until finalization, normative language in this page should be considered a proposal. https://github.com/bendlas/data.xml

Revisions

[r1] 20160804 Encode xml namespaces directly into keyword namespaces

The runtime-global registry for xmlns <-> cljns mappings, provided by declare-ns and alias-ns, poses a problem of governance: If two libraries want to be composable (a very basic requirement), they need to agree on their declare-ns clauses. Even worse: user source code is hardcoded against whatever mapping, their library chose to provide. The only "fix" for that would be maintaining a mapping of "known" uris to clojure namespaces within data.xml, but that is unsatisfactory, as the work of assigning unique names is already provided by various registries, such as iana, and the list of xml uris, that people might want to use with data.xml is quite large.

At the same time, we want to keep using ::xmlns-alias/keywords. This can be achieved, by encoding the uri directly into the keyword, after substituting clojure's syntax characters. Percent-encoding, along with the substitution rules given in https://www.ietf.org/rfc/rfc3987.txt

https://groups.google.com/forum/#!topic/clojure/Txj3suj2B3s

[r2] 20161107 Justification for qname encoding

Premise. We want a 1:1 mapping between our xml encoding and the xml infoset, for any bit contributing to value equality.
=> This demands encoding the full xmlns uri into every qname, as opposed to some registry scheme. [r1] is driven by this.

Premise. We want to keep keywords because of ::alias/shorthands.
=> The ns uri will have to fit into a keyword namespace because of the natural mapping premise ; )

Premise. We want to stay within valid edn for (comp read-string pr-str) compatibility.
=> This requires some sort of encoding for ns uris. 

Premise. We want to retain some readability for debugging purposes.
=> This rules out base64 and other formats geared towards efficiently encoding binary data.

This leaves us with either shifting problematic characters into unicode or using an escaping scheme.

Unicode shifting is appealing, because it retains a maximum of readability.
On the other hand, it leaves endless bikeshedding possibility about: target alphabet, target alphabet of target alphabet, ...; lookalike characters vs not. Whatever scheme we chose based on this, it would be unfamiliar to anybody but heavy data.xml users and in sum, this doesn't look like a winnable game.

For escaping schemes, there are a couple of well known choices, most of them based on \ as an escape character, which rules them out for use in edn keywords.
Arguably the best known escaping scheme, though, is rfc3987 percent-encoding, also known as url-encoding. This happens to be an almost perfect fit for transcribing human-readable strings into clojure keyword namespaces:
1.) It retains a moderate amount of readability which gets better through reusable knowledge. You probably don't even have to check, when I tell you that %2F means /
2.) Reserved characters in an uri segment, seem to be a happy superset of clojure's reserved characters. In particular, it reserves : and /, but leaves . (which leads to a curious, but not entirely unappealing mapping to java package trees.)
     Hence, using urlencoding, will even save us from shipping codecs (we need to make sure though, that java's URLEncoder fully agrees with javascript's encodeURIComponent)
3.) % got allowed in clojure 1.5.0. The jury on http://dev.clojure.org/jira/browse/CLJ-1527 is still out, but Rich Hickey's talk about not breaking APIs could be interpreted to mean that it will stay allowed.
     Bumping the required clojure version from 1.4.0 to 1.5.0 is a slight drawback, but that will amortize as people keep upgrading their software. It should also be possible to work with a useful a subset of data.xml on 1.4.0, even some namespacing support as % won't be readable, but still constructable.

Runtime data structures

canonical representation

Even if the emitter accepts a slightly larger set of representations, the parser should produce a very uniform data structure, which should map xml infoset equality to clojure equality and match, what a user would write by hand.

Unfortunately, percent-encoding uri-namespaces don't quite fit the bill on user-friendliness, but outside of clojure's kw-aliasing facilities, this can still be fixed by using reader tags.

xml elements

Elements are represented as maps with keys #{:tag :attrs :content}. The canonical representation is a clojure.data.xml.node/Element defrecord, exposed through the constructors element and element*.

xml names

̶I̶n̶ ̶t̶h̶e̶ ̶g̶e̶n̶e̶r̶a̶l̶ ̶c̶a̶s̶e̶,̶ ̶x̶m̶l̶ ̶n̶a̶m̶e̶s̶ ̶a̶r̶e̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶e̶d̶ ̶a̶s̶ ̶(̶Q̶N̶a̶m̶e̶s̶)̶[̶h̶t̶t̶p̶:̶/̶/̶d̶o̶c̶s̶.̶o̶r̶a̶c̶l̶e̶.̶c̶o̶m̶/̶j̶a̶v̶a̶e̶e̶/̶1̶.̶4̶/̶a̶p̶i̶/̶j̶a̶v̶a̶x̶/̶x̶m̶l̶/̶n̶a̶m̶e̶s̶p̶a̶c̶e̶/̶Q̶N̶a̶m̶e̶.̶h̶t̶m̶l̶]̶ ̶o̶r̶,̶ ̶i̶f̶ ̶t̶h̶e̶y̶ ̶h̶a̶v̶e̶ ̶n̶o̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶ ̶u̶r̶i̶,̶ ̶a̶s̶ ̶k̶e̶y̶w̶o̶r̶d̶.̶
̶d̶a̶t̶a̶.̶x̶m̶l̶ ̶h̶a̶s̶ ̶a̶ ̶f̶a̶c̶i̶l̶i̶t̶y̶ ̶t̶o̶ ̶a̶s̶s̶o̶c̶i̶a̶t̶e̶ ̶c̶l̶o̶j̶u̶r̶e̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶x̶m̶l̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶ ̶u̶r̶i̶s̶.̶ ̶W̶h̶i̶c̶h̶ ̶a̶l̶l̶o̶w̶s̶ ̶c̶l̶o̶j̶u̶r̶e̶'̶s̶ ̶s̶h̶o̶r̶t̶h̶a̶n̶d̶-̶s̶y̶n̶t̶a̶x̶ ̶f̶o̶r̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶d̶ ̶k̶e̶y̶w̶o̶r̶d̶s̶ ̶t̶o̶ ̶b̶e̶ ̶u̶s̶e̶d̶:̶

Xml qnames are uniformly encoded into keywords, by urlencoding the xmlns uri into the keyword namespace. For names in the empty namespace, non-namespaced keywords are used.

<foo/> => {:tag :foo}

<n:foo xmlns:n="NO:NO/NO" /> => {:tag :xmlns.NO%3ANO%2FNO/foo}

Similar to xml serialization, the kw-ns :xmlns/... and :xml/... are given special treatment: Even though you can still emit them, by giving their full namespace uri, their canonical representation is the short form.

Additionally accepted qname types in the emitter:

xml attributes

Are stored in hash-maps. The parser removes xmlns attributes from the attr hash and stores them in metadata (accessible via clojure.data.xml/element-nss).

The namespace environment can be augmented by associating :xmlns and :xmlns/<prefix> attributes.

Labels: