This page is work-in-progress. Until finalization, normative language in this page should be considered a proposal. https://github.com/bendlas/data.xml

Revisions

[r1] 20160804 Encode xml namespaces directly into keyword namespaces

The runtime-global registry for xmlns <-> cljns mappings, provided by declare-ns and alias-ns, poses a problem of governance: If two libraries want to be composable (a very basic requirement), they need to agree on their declare-ns clauses. Even worse: user source code is hardcoded against whatever mapping, their library chose to provide. The only "fix" for that would be maintaining a mapping of "known" uris to clojure namespaces within data.xml, but that is unsatisfactory, as the work of assigning unique names is already provided by various registries, such as iana, and the list of xml uris, that people might want to use with data.xml is quite large.

At the same time, we want to keep using ::xmlns-alias/keywords. This can be achieved, by encoding the uri directly into the keyword, after substituting clojure's syntax characters. Percent-encoding, along with the substitution rules given in https://www.ietf.org/rfc/rfc3987.txt

https://groups.google.com/forum/#!topic/clojure/Txj3suj2B3s

[r2] 20161107 Justification for qname encoding

Premise. We want a 1:1 mapping between our xml encoding and the xml infoset, for any bit contributing to value equality.
=> This demands encoding the full xmlns uri into every qname, as opposed to some registry scheme. [r1] is driven by this.

Premise. We want to keep keywords because of ::alias/shorthands.
=> The ns uri will have to fit into a keyword namespace because of the natural mapping premise ; )

Premise. We want to stay within valid edn for (comp read-string pr-str) compatibility.
=> This requires some sort of encoding for ns uris. 

Premise. We want to retain some readability for debugging purposes.
=> This rules out base64 and other formats geared towards efficiently encoding binary data.

This leaves us with either shifting problematic characters into unicode or using an escaping scheme.

Unicode shifting is appealing, because it retains a maximum of readability.
On the other hand, it leaves endless bikeshedding possibility about: target alphabet, target alphabet of target alphabet, ...; lookalike characters vs not. Whatever scheme we chose based on this, it would be unfamiliar to anybody but heavy data.xml users and in sum, this doesn't look like a winnable game.

For escaping schemes, there are a couple of well known choices, most of them based on \ as an escape character, which rules them out for use in edn keywords.
Arguably the best known escaping scheme, though, is rfc3987 percent-encoding, also known as url-encoding. This happens to be an almost perfect fit for transcribing human-readable strings into clojure keyword namespaces:
1.) It retains a moderate amount of readability which gets better through reusable knowledge. You probably don't even have to check, when I tell you that %2F means /
2.) Reserved characters in an uri segment, seem to be a happy superset of clojure's reserved characters. In particular, it reserves : and /, but leaves . (which leads to a curious, but not entirely unappealing mapping to java package trees.)
     Hence, using urlencoding, will even save us from shipping codecs (we need to make sure though, that java's URLEncoder fully agrees with javascript's encodeURIComponent)
3.) % got allowed in clojure 1.5.0. The jury on http://dev.clojure.org/jira/browse/CLJ-1527 is still out, but Rich Hickey's talk about not breaking APIs could be interpreted to mean that it will stay allowed.
     Bumping the required clojure version from 1.4.0 to 1.5.0 is a slight drawback, but that will amortize as people keep upgrading their software. It should also be possible to work with a useful a subset of data.xml on 1.4.0, even some namespacing support as % won't be readable, but still constructable.

[r3] 20171226 :xmlns "..." attributes transform non-namespaced content

Manually setting an :xmlns attribute for the emitter (the parser will never generate such), now exactly behaves as in xml: It transforms non-namespaced tags within the current element into a default xmlns.

Effectively, this specifies a second representation for elemens, that's not canonical and useful mainly for emitting. For QNames, there is already a precedent: Accepting QName instances, keywords, strings in the emitter.

https://dev.clojure.org/jira/browse/DXML-52

This motivates a normalization function, to make equal fragments clojure.core/=

Runtime data structures

;; <rdf:nil xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
;; would be represented in clojure as 

;; [r1] no more 
;; (declare-ns :xml.rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#") ; globally associates the clojure namespace xml.rdf with the rdf xmlns
;; {:tag :xml.rdf/nil} ; now denotes an rdf element with the qualified name {http://www.w3.org/1999/02/22-rdf-syntax-ns#}nil
;; [r1] instead of declare-ns and alias-ns, we now have
(alias-uri :rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#") 
{:tag ::rdf/nil}
;; which the reader expands to {:tag :xmlns.http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23/nil}

;; a reader macro can be used to refer to xmlns in clojures alias facilities [alias, refer :as, e.g.
(require '[#xml/ns "http://www.w3.org/1999/02/22-rdf-syntax-ns#" :as rdf]))
{:tag ::rdf/nil} ; ] can be used to introduce shorthands
;; in clojurescript, this is the only way, as there is no alias-uri there
;; what makes this awkward, is that in this case, http%3A%2F%2Fwww/w3/org%2F1999%2F02%2F22-rdf-syntax-ns%23.clj, + one with all the % replaced by _PERCENT_ for .cljs, need to exist on the classpath.
;; there is hope for this, though: http://dev.clojure.org/jira/browse/CLJ-2030

canonical representation

Even if the emitter accepts a slightly larger set of representations, the parser should produce a very uniform data structure, which should map xml infoset equality to clojure equality and match, what a user would write by hand.

Unfortunately, percent-encoding uri-namespaces don't quite fit the bill on user-friendliness, but outside of clojure's kw-aliasing facilities, this can still be fixed by using reader tags.

Since the emitter accepts a larger set of representations, there is a normalization function for xml fragments, called canonicalize. Additionally, there is clojure.data.xml/= as a possibly more efficient version of #(clojure.core/= (canonicalize %1) (canonicalize %2))

xml elements

Elements are represented as maps with keys #{:tag :attrs :content}. The canonical representation is a clojure.data.xml.node/Element defrecord, exposed through the constructors element and element*.

clojure.data.xml.node/Element implements a custom equality, compatible with maps. It does not, however, use clojure.data.xml/=, in order to preserve commutativity.

element* takes tag name, attributes, a content list, and optional metadata. It can be used to construct non-canonical representations.

element takes content varargs and canonicalizes its tag, attributes and content maps. It wont canonicalize content elements.

xml names

̶I̶n̶ ̶t̶h̶e̶ ̶g̶e̶n̶e̶r̶a̶l̶ ̶c̶a̶s̶e̶,̶ ̶x̶m̶l̶ ̶n̶a̶m̶e̶s̶ ̶a̶r̶e̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶e̶d̶ ̶a̶s̶ ̶(̶Q̶N̶a̶m̶e̶s̶)̶[̶h̶t̶t̶p̶:̶/̶/̶d̶o̶c̶s̶.̶o̶r̶a̶c̶l̶e̶.̶c̶o̶m̶/̶j̶a̶v̶a̶e̶e̶/̶1̶.̶4̶/̶a̶p̶i̶/̶j̶a̶v̶a̶x̶/̶x̶m̶l̶/̶n̶a̶m̶e̶s̶p̶a̶c̶e̶/̶Q̶N̶a̶m̶e̶.̶h̶t̶m̶l̶]̶ ̶o̶r̶,̶ ̶i̶f̶ ̶t̶h̶e̶y̶ ̶h̶a̶v̶e̶ ̶n̶o̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶ ̶u̶r̶i̶,̶ ̶a̶s̶ ̶k̶e̶y̶w̶o̶r̶d̶.̶
̶d̶a̶t̶a̶.̶x̶m̶l̶ ̶h̶a̶s̶ ̶a̶ ̶f̶a̶c̶i̶l̶i̶t̶y̶ ̶t̶o̶ ̶a̶s̶s̶o̶c̶i̶a̶t̶e̶ ̶c̶l̶o̶j̶u̶r̶e̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶s̶ ̶w̶i̶t̶h̶ ̶x̶m̶l̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶ ̶u̶r̶i̶s̶.̶ ̶W̶h̶i̶c̶h̶ ̶a̶l̶l̶o̶w̶s̶ ̶c̶l̶o̶j̶u̶r̶e̶'̶s̶ ̶s̶h̶o̶r̶t̶h̶a̶n̶d̶-̶s̶y̶n̶t̶a̶x̶ ̶f̶o̶r̶ ̶n̶a̶m̶e̶s̶p̶a̶c̶e̶d̶ ̶k̶e̶y̶w̶o̶r̶d̶s̶ ̶t̶o̶ ̶b̶e̶ ̶u̶s̶e̶d̶:̶

Xml qnames are uniformly encoded into keywords, by urlencoding the xmlns uri into the keyword namespace. For names in the empty namespace, non-namespaced keywords are used.

<foo/> => {:tag :foo}

<n:foo xmlns:n="NO:NO/NO" /> => {:tag :xmlns.NO%3ANO%2FNO/foo}

<foo xmlns="NO:NO/NO" /> => {:tag :xmlns.NO%3ANO%2FNO/foo}

Similar to xml serialization, the kw-ns :xmlns/... and :xml/... are given special treatment: Even though you can still emit them, by giving their full namespace uri, their canonical representation is the short form.

Additional, non-canonical qnames types in the emitter:

xml attributes

Are stored in hash-maps. The parser removes xmlns attributes from the attr hash and stores them in metadata (accessible via clojure.data.xml/element-nss).

The namespace environment can be augmented by associating :xmlns and :xmlns/<prefix> attributes.

Associating attributes :xmlns or :xmlns/<prefix> denotes a non-canonical representation for namespaced xml, where you can scope tag names, similar to xml, this is akin to the 0.0.8 API, but only for the emitter.