tools.reader

Carriage returns in files can cause Pushback buffer overflow exception

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Completed
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

With latest master version of tools.reader (1.3.0), some files with carriage returns that are not followed by a formfeed or newline character can cause the default pushback buffer size of 1 to be overflowed, resulting in an exception thrown during reading. The root cause is that peek-char is implemented with a read-char followed by unread for some types of reader objects, and read-char ends with a call to peek-char when such an isolated carriage return is found.

One example to reproduce:

user=> (require '[clojure.java.io :as io])
nil
user=> (require '[clojure.tools.reader.edn :as tre])
nil
user=> (require '[clojure.tools.reader.reader-types :as rt])
nil
user=> (spit "test.edn" "[a\rb]")
nil
user=> (def rdr (rt/indexing-push-back-reader (io/reader "test.edn")))
#'user/rdr
user=> (tre/read rdr)

IOException Pushback buffer overflow  java.io.PushbackReader.unread (PushbackReader.java:155)

user=> (def rdr (clojure.lang.LineNumberingPushbackReader. (io/reader "test.edn")))
#'user/rdr
user=> (clojure.edn/read rdr)
[a b]

If anyone is experiencing this problem and wants a workaround, if you can replace all carriage returns with newlines in the input file/string/reader, you should avoid this problem, although if you are using a line numbering/indexing reader, that may change the line numbers of lines in the data.

Activity

Hide
Andy Fingerhut added a comment -

Thinking a little bit about possible changes to the code that might be made to avoid this issue:

(a) quick and dirty might be to just add 1 to whatever pushback buffer size is given when creating a pushback reader, with a minimum of 2 instead of 1.

(b) Maybe there is a way to change the definition of read-token so that its first operation on the reader is not unread, but I haven't carefully checked all of the places read-token is called from to see if that could be made to work.

(c) Implement the desired behavior of normalize-newline in a different way, by putting the state that the last character read was a carriage-return into the places where normalize-newline is called. That might get a little messy looking.

Show
Andy Fingerhut added a comment - Thinking a little bit about possible changes to the code that might be made to avoid this issue: (a) quick and dirty might be to just add 1 to whatever pushback buffer size is given when creating a pushback reader, with a minimum of 2 instead of 1. (b) Maybe there is a way to change the definition of read-token so that its first operation on the reader is not unread, but I haven't carefully checked all of the places read-token is called from to see if that could be made to work. (c) Implement the desired behavior of normalize-newline in a different way, by putting the state that the last character read was a carriage-return into the places where normalize-newline is called. That might get a little messy looking.
Hide
Andy Fingerhut added a comment - - edited

Other cases that throw exceptions when trying to read them from a file with an indexing-push-back-reader, given as strings:

"[a\r b]"
"[a\r  b]"   ;; the number of spaces after the carriage return doesn't seem to matter.  Still exception
"{a\rb}"     ;; also for maps
"#{a\rb}"    ;; and sets
"a\rb"       ;; it doesn't need to be inside of a collection.  Maybe (symbol or number) immediately followed by \r followed by at least one character that isn't newline or formfeed.
"1\r "       ;; but no exception if space inserted after 1 before \r

Some cases with no exception:

"\ra"      ;; I would have guessed from casual code inspection that this might be a problem, but perhaps read-delimited must be involved
"[a\r\nb]"   ;; carriage return followed by newline is not a problem, as I would have guessed.
"\"a\rb\""   ;; no problem in the middle of strings, although I find it semi-surprising that both clojure.edn/read and tools.reader change the \r to \n in the returned string.
"1 \r "
"[a \rb]"
"a \rb"      ;; seems to require no space character before \r to cause exception

In all of these cases I have given, there is a java.io.BufferedReader returned by clojure.java.io/reader. rt/indexing-push-back-reader first wraps that in a java.io.PushbackReader (not the PushbackReader deftype defined in reader-types, if that raises a red flag), then a reader-types IndexingPushbackReader.

user=> (class (rt/indexing-push-back-reader (io/reader "test.edn")))
clojure.tools.reader.reader_types.IndexingPushbackReader

user=> (class (rt/to-pbr (io/reader "test.edn") 1))
java.io.PushbackReader

user=> (class (io/reader "test.edn"))
java.io.BufferedReader
Show
Andy Fingerhut added a comment - - edited Other cases that throw exceptions when trying to read them from a file with an indexing-push-back-reader, given as strings:
"[a\r b]"
"[a\r  b]"   ;; the number of spaces after the carriage return doesn't seem to matter.  Still exception
"{a\rb}"     ;; also for maps
"#{a\rb}"    ;; and sets
"a\rb"       ;; it doesn't need to be inside of a collection.  Maybe (symbol or number) immediately followed by \r followed by at least one character that isn't newline or formfeed.
"1\r "       ;; but no exception if space inserted after 1 before \r
Some cases with no exception:
"\ra"      ;; I would have guessed from casual code inspection that this might be a problem, but perhaps read-delimited must be involved
"[a\r\nb]"   ;; carriage return followed by newline is not a problem, as I would have guessed.
"\"a\rb\""   ;; no problem in the middle of strings, although I find it semi-surprising that both clojure.edn/read and tools.reader change the \r to \n in the returned string.
"1 \r "
"[a \rb]"
"a \rb"      ;; seems to require no space character before \r to cause exception
In all of these cases I have given, there is a java.io.BufferedReader returned by clojure.java.io/reader. rt/indexing-push-back-reader first wraps that in a java.io.PushbackReader (not the PushbackReader deftype defined in reader-types, if that raises a red flag), then a reader-types IndexingPushbackReader.
user=> (class (rt/indexing-push-back-reader (io/reader "test.edn")))
clojure.tools.reader.reader_types.IndexingPushbackReader

user=> (class (rt/to-pbr (io/reader "test.edn") 1))
java.io.PushbackReader

user=> (class (io/reader "test.edn"))
java.io.BufferedReader
Hide
Andy Fingerhut added a comment -

Attached patch trdr-54-approach-1.diff that takes the perhaps hacky way out of this problem, by adding 1 to the pushback buffer size when creating a java.io.PushbackReader instance.

Show
Andy Fingerhut added a comment - Attached patch trdr-54-approach-1.diff that takes the perhaps hacky way out of this problem, by adding 1 to the pushback buffer size when creating a java.io.PushbackReader instance.
Hide
Nicola Mometto added a comment -

Hi Andy, thanks a lot for the detailed analysis and patch! I'll take a closer look at this in a few days

Show
Nicola Mometto added a comment - Hi Andy, thanks a lot for the detailed analysis and patch! I'll take a closer look at this in a few days
Hide
Andy Fingerhut added a comment -

Sure, no rush on this. It has a fairly straightforward workaround if you know the reason for the problem (e.g. change all carriage returns to newlines).

Show
Andy Fingerhut added a comment - Sure, no rush on this. It has a fairly straightforward workaround if you know the reason for the problem (e.g. change all carriage returns to newlines).
Hide
Andy Fingerhut added a comment - - edited

Another potential issue I realized while looking into this, which I can create a separate ticket for if you are interested: If you create any kind of reader that is layered over a java.io.PushbackReader, as the example cases above do, and the input reader has a long sequence of carriage returns in it, the java.io.PushbackReader implementation of read-char and normalize-newline can call each other as deeply nested as there are consecutive carriage return characters.

This is similar to this old issue for reading many consecutive spaces, although many consecutive carriage returns seems quite a bit less likely to crop up in practice: https://dev.clojure.org/jira/browse/TRDR-11

Show
Andy Fingerhut added a comment - - edited Another potential issue I realized while looking into this, which I can create a separate ticket for if you are interested: If you create any kind of reader that is layered over a java.io.PushbackReader, as the example cases above do, and the input reader has a long sequence of carriage returns in it, the java.io.PushbackReader implementation of read-char and normalize-newline can call each other as deeply nested as there are consecutive carriage return characters. This is similar to this old issue for reading many consecutive spaces, although many consecutive carriage returns seems quite a bit less likely to crop up in practice: https://dev.clojure.org/jira/browse/TRDR-11
Hide
Nicola Mometto added a comment -

Just a heads up, I started looking into this and it's a bit more involved than I originally thought. As you say, the "hacky" solution doesn't really work in the presence of consecutive carriage returns. I think we need a bigger refactor although I'm still not clear what that'd involve. I'm on vacation now tho so I should have time to work on it.

Show
Nicola Mometto added a comment - Just a heads up, I started looking into this and it's a bit more involved than I originally thought. As you say, the "hacky" solution doesn't really work in the presence of consecutive carriage returns. I think we need a bigger refactor although I'm still not clear what that'd involve. I'm on vacation now tho so I should have time to work on it.
Hide
Nicola Mometto added a comment -

Andy Fingerhut I've pushed a fix to https://github.com/clojure/tools.reader/tree/TRDR-54, could you give it a go before I merge it?

Show
Nicola Mometto added a comment - Andy Fingerhut I've pushed a fix to https://github.com/clojure/tools.reader/tree/TRDR-54, could you give it a go before I merge it?
Hide
Andy Fingerhut added a comment -

Attached patch TRDR-54-part2-v1.patch is on top of the existing changes already committed to the TRDR-54 branch of tools.reader.

One of the code changes is a pretty uncontroversial fix, I believe. The one with the additional code for IndexingPushbackReader's unread method is a bit weirder, and I would guess there may be better ways to do it.

Show
Andy Fingerhut added a comment - Attached patch TRDR-54-part2-v1.patch is on top of the existing changes already committed to the TRDR-54 branch of tools.reader. One of the code changes is a pretty uncontroversial fix, I believe. The one with the additional code for IndexingPushbackReader's unread method is a bit weirder, and I would guess there may be better ways to do it.
Hide
Nicola Mometto added a comment -

Thanks a lot for the patch and the second set of eyes on this, I think all the changes make sense. I've applied the patch to the TRDR-54 branch, I'm going to merge it tomorrow unless I find unexpected issues with the approach

Show
Nicola Mometto added a comment - Thanks a lot for the patch and the second set of eyes on this, I think all the changes make sense. I've applied the patch to the TRDR-54 branch, I'm going to merge it tomorrow unless I find unexpected issues with the approach

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: