Skip to end of metadata
Go to start of metadata
This page is work-in-progress. Until finalization, normative language in this page should be considered a proposal.

Runtime data structures

data.xml works with two kinds of xml representations: The representation tier and the model tier.

Representation Tier (raw)

This tier maps directly to serialized xml. Namespace declarations are plain xmlns attributes, name prefixes are keyword namespaces. This representation closely resembles the traditional clojure xml format, readers and writers need not be namespace aware, they just need to translate prefixes correctly.

Example

Model Tier (resolved)

This tier models the xml infoset, the namespace of tags and attributes is resolved from their prefix, namespace declarations are separated from other attributes and bound prefixes are tracked per element.

There are two popular interpretations of the infoset: The XPath data model and the DOM. With respect to namespacing, they have a major difference: DOM considers xmlns declarations to be attribute nodes, while XPath does not represent xmlns declarations at all, it just keeps track of in-scope prefix+namespaces. XPath's deep-equal does not take xmlns declarations into account.

The default model tier representation in data.xml is the xpath data model, others might be added as parser options and tree passes and/or transparently consumed by the emitter.

Clojure's = function will implement deep-equal on xpath flavored model tier data, so in-scope namespaces, prefixes and other information that doesn't contribute to deep-equal, will be kept in metadata. All data.xml functions will preserve metadata, where appropriate.

Example

The ::dav/ convention for denoting resolved names will be explained below. For the purpose of this example, ::dav/name denotes `name` in the namespace DAV:, ::xmlns/ denotes names in the uri denoted by the prefix xmlns.

Tier transitions

data.xml will offer transitions between the various representations. The reference implementation for those transitions will be tree transformers, that can be applied to a representation, to yield another, the base line parser and emitter will deal with representation tier data only. Higher level parsing and emitting functions are defined as their baseline equivalents, composed with tree transformers. They may be implemented directly for added efficiency.

Assigning prefixes when emitting

There is an added complication when going from model tier to representation tier: How and which prefixes should be assigned.

Ideally, a single prefix for every used namespace would be declared at the root element + a default namespace above areas that mainly use one namespace. Unfortunately, neither can be done automatically, without scanning the whole document, which means giving up laziness. data.xml should provide a transformation pass doing that, which can be used on smallish documents.

The emitter, not having the luxury of multiple passes, will err on the side of caution and emit the full namespace environment found in metadata, just removing redundant prefixes. When there is no metadata or an element from an unbound namespace is found, it will fall back to introducing the namespace on first use. This could mean considerable overhead in some cases, so the emitter should provide some kind of warning.

Mixed content

To guarantee consistent results, there are some constraints to be observed when combining raw and resolved xml. There is still open debate on whether this should be allowed at all, since users should mostly stay within the model tier and they can always resolve raw xml manually and with full control over the namespace environment.

The main use case for raw xml are literals within code. The ::ns/name convention helps with referencing namespaced names in a stable way, but still, at least the use of unprefixed keywords for attribute names needs to be allowed in the model tier, because unprefixed attributes always resolve to the empty namespace.

XML Name Literals

Representation tier (raw) names

As detailed above, raw names just have a prefix with no canonical translation to a namespace. Thus the emitter can only check whether the prefix exists, but not if the namespace is right. Due to possible collision on improper handling, it is considered to disallow embedding raw prefixes in model tier.

Raw names that have a corresponding clojure namespace (pseudo-raw)

The notational convention of ::ns/name, used above expands to a raw name in clojure's reader, that has clojure namespace aliased to 'ns as a prefix. We take the opportunity to define a canonical translation of raw names that have an associated clojure namespace, to an xml namespace. They are called pseudo-raw, because they are technically just prefixed names, whose namespace binding is defined by a (clojure ns-)global prefix map.

data.xml maintains a per-(clojure )namespace map of xml prefix->namespace, that can configured similar to namespace aliases:

With this mechanism the user can easily write down xml names with a stable namespace in clojure files, without resorting to specify the uri every time. Because of security (and simplicity) considerations, the integration of this convention will be limited:

Security considerations
If every raw name would be subject to this translation, an attacker could inject prefixes named like clojure namespaces, knowing they would get translated to a different namespace than specified.
Therefore implicit translation will only happen when emitting raw content in the context of resolved xml. When the prefix is not known by the translation mechanism, it could be an error or silently emit the prefix as per discussion above.
 There will also be a specialized resolve-* function for resolve pseudo-raw names, that should not be used on arbitrary input, mostly for supporting a #xml/resolve reader tag.

Model tier (resolved) names

These are represented by javax.xml.namespace.QName

To aid read- and writeability, QName has a reader tag #xml/name + associated print-dup implementation in data.xml.

When a resolved name is used in a raw context, its prefix is used, while the namespace is checked to match the prefix in the context (possible in StAX and the custom treewalkers).

Otherwise, the prefix is reassigned by the emitter. 

Labels: