<< Back to previous view

[TRDR-9] Add more source metadata to read objects Created: 27/Sep/13  Updated: 02/Sep/14  Resolved: 20/Nov/13

Status: Closed
Project: tools.reader
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Alexander Redington Assignee: Alexander Redington
Resolution: Completed Votes: 0
Labels: None

Attachments: File TRDR-9-source-logging.diff    
Patch: Code and Test


I'm working on a static analyzer for Clojure, where I'd like to have a reader which provides precise form coordinates and source representations.

Specifically this enables analyses where it's straightforward to bind an object like:

(fn* [p1__3615#] (inc (conj p1__3615#)))

to the source form which it was read from:

"#(inc (conj %))"

Would this kind of functionality be considered appropriate to tools.reader? Specifically I'd like to make a change such that

  • Read data is unchanged
  • When an IMeta is read, additional information is embedded in the metadata, but no existing metadata is changed
  • The additional metadata would include the source character string, the ending column, and the ending line from the source stream

As an example, reading a stream that contained the following on the first line:

#(inc (conj %))

Would return a list as follows:

(fn* [p1__3615#] (inc (conj p1__3615#)))

And would change the metadata from:

{:line 1, :column 1}

to augment it with more metadata, as follows:

{:line 1, :column 1, :end-line 1, :end-column 15, :source "#(inc (conj %))"}

I'd happily contribute a patch to include this behavior if you'd be inclined to accept it. I'd like to first see if it would be considered appropriate, and also discuss any other concerns you might have (such as having this behavior be optionally enabled, etc)

Comment by Nicola Mometto [ 28/Sep/13 1:48 PM ]

I would definitely take a patch to add such a behaviour, with the condition that this should not degrade the performances too much (a 2x slowdown would be acceptable though).

I don't feel like this needs to be optional behaviour, we already provide column/line info by default, if it's not needed the user might just ignore it.

Comment by Alexander Redington [ 18/Oct/13 12:49 PM ]

Just like to touch base as I've been quiet for a little while, but haven't forgotten about this. It's been my primary goal on Friday time.

I currently have a naive implementation that works, but blows our performance goal by a 5x factor. I think this is solvable with a little more finesse in source logging.

It appears that end-line and end-column information can be obtained for almost no additional computational cost. (less than 2% additional processor time).

I've committed my WIP to: https://github.com/aredington/tools.reader/tree/metadata-logging

Also worth discussing at some point in time: some tests around syntax quoting and metadata reading start failing. For syntax quote, the forms returned are significantly more verbose, including metadata assertions with the source attribution. For metadata, the metadata values read now have an additional :source key included that specifies the source string the forms were read from. In both cases, the expectations of the tests are violated, but I'm not inclined to say that the end result is incorrect without discussion.

Comment by Nicola Mometto [ 22/Oct/13 11:57 AM ]

Awesome, I haven't had a chance to read through the implementation but I'll do so as soon as possible.
If I'm reading the last commit message correctly, it looks like since your comment here you were able to reduce performance loss to 2x, that's cool.

Can you please make the end-line/end-column modifications into another ticket? I can merge that as soon as you upload the patch.

Regarding the failing tests – looks like those are inevitable, it's definitely not a bug and those tests should be fixed to match the new behaviour (fixing those should be just a matter of manually attatching the :source metadata to the forms to be read by the clojure reader)

Comment by Nicola Mometto [ 24/Oct/13 11:39 AM ]

After some more thoughts, I think it should be better to have a "InfoPushBackReader" (not sure if it's the best name) wrapping an IndexingPushBackReader that provides source data, this way we only get the performance hit when we effectively need the :source info instead of making every Reader collect source data even when we don't want/need it.

Would this approach work for your use case?

Comment by Alexander Redington [ 25/Oct/13 8:04 AM ]

Absolutely. I'll make two patches and we will have two paths for metadata:

  • Default case; if line numbering information is available, end-line and end-column will be captured. The performance hit from this is marginal (2-3% slower), but it does provide sufficient information for, e.g. editors, IDEs to precisely indicate what portions of a source file originate what forms. This patch will be submitted in a new ticket.
  • Enhanced case; using a new InfoPushbackReader (SourceInfoPushbackReader? SourceLoggingPushbackReader?) we'll log precise source information on each form read. The performance hit from this will be substantial (100% slower), but it isolates the performance impacts to those users who deliberately want more information and are willing to pay the performance cost. This patch will be attached to this issue (TRDR-9)
Comment by Alexander Redington [ 25/Oct/13 8:27 AM ]

Created TRDR-10 for end-line and end-column info.

Comment by Nicola Mometto [ 06/Nov/13 4:04 PM ]

Looks like CLJS will also beenfit from this, see http://dev.clojure.org/jira/browse/CLJS-658

+1 for SourceLoggingPushbackReader

Comment by Alexander Redington [ 15/Nov/13 1:35 PM ]

On https://github.com/aredington/tools.reader I have it nearly working save that nested collections (e.g. [1 2 [3 4 5]]) fail to read correctly.

The top collection will read correctly with the following as the source string: "[1 2 [3 4 5]]", the nested vector reads as "3 4 5]".

I tried taking the macro dispatching char and prepending it to the source log at the appropriate source logging frame, and that resolves nested collections, but causes metadata to read erroneously, e.g. {:foo :baz} 'zot will read with source string "^{:foo :baz} 'zot".

Gonna keep digging to try and resolve this because source logging is very close, but just wanted to update on current status as I'd been quiet for a while.

Comment by Nicola Mometto [ 15/Nov/13 8:02 PM ]

I'll take a look at your current implementation and see if I can help you spot what's going wrong, thanks again for the effort you're putting into this.

Comment by Nicola Mometto [ 16/Nov/13 8:49 AM ]

I opened a pull request https://github.cam/aredington/tools.reader/pull/1 that should fix all the issues you were having, as well as fixing another minor bug.

If you find my edits reasonable please merge it and then squash the commits and I'll merge the patch upstream

Comment by Alexander Redington [ 20/Nov/13 1:23 PM ]

Squashed commit with working source logging

Comment by Nicola Mometto [ 20/Nov/13 3:17 PM ]

Fixed: https://github.com/clojure/tools.reader/commit/f9e55071d82a443db3fac1c2feda059d0215bb90


Comment by Nikita Beloglazov [ 02/Sep/14 3:40 PM ]

As I see current tools.reader still not able to return position info (line, column numbers) for strings, numbers, booleans and other primitives that don't implement IMeta interface. Is there plans to fix it somehow? I was thinking how it can be fixed and only idea I came up with was to read input data into intermediate datastructure like:

{:type :vector
:children [{:type :string
:value "hello"
:line 1
:column 2}]
:line 1
:column 4}

And then convert it to clojure datastructures: ["hello"] so changes are invisible from outside. Additionally you can pass additional option like :raw-format true and it will return those internal datastructure.

Would it be the change that will be accepted or is it too drastic? I can start working it. I think this functionality is really needed for libraries linter-like libraris that checks whether code is formatted correctly and need to know position of each lexem.

Comment by Nicola Mometto [ 02/Sep/14 4:21 PM ]

I thought about this issue a while ago and I'd be definitely more than happy to take a patch for this.
What I'd like to see is a tools.reader.parser namespace with a parse function that returns a parse tree, and then tools.reader be just a compiler from that parse tree back to clojure, however I'm worried about the performance implications of that; I've deliberately avoided writting tools.reader with multimethods in favour of direct dispatch via `case` for performance reasons, this would probably involve an unacceptable performance hit.

So probably keeping the parser and the reader separated, even at the cost of having to copy&paste most of the current implementation is the way to go; obviously if you want to try to unify the two feel free to do so.

Can you open a new enhancement ticket for this?

Generated at Sun Dec 21 18:47:03 CST 2014 using JIRA 4.4#649-r158309.