<< Back to previous view

[DJSON-28] Escape control characters 0-1F even if :escape-unicode false Created: 11/May/17  Updated: 11/May/17

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Major
Reporter: Christopher Brown Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: data, string
Environment:

Observed on CentOS 7, Mac OS X


Attachments: Text File djson-28.patch    
Patch: Code and Test

 Description   

The 32 control characters U+0000 through U+001F are never allowed in raw form in JSON strings.

From ECMA-404:

All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F.

From RFC 7159:

A string begins and ends with quotation marks. All Unicode characters may be placed within the quotation marks, except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+0000 through U+001F).

When :escape-unicode true (the default), all characters outside the 32-127 range are escaped using \uCAFE syntax (or for the special whitespace cases, using named escapes).

However, when :escape-unicode false is supplied to the write or write-str functions, some of the control characters are written in raw form, resulting in invalid JSON. This is improper behavior; the library should never produce JSON that violates the specification(s), no matter what options the user supplies.

This patch escapes the control characters even when :escape-unicode false is supplied.

There is a bit of special handling to exclude the named escapes in the control character range — the write-string function always escapes the characters (8, 9, 10, 12, 13) which have special escaped names and thus require special treatment.

I did not add any control character validation to the parsing functionality, following Postel's law:

[TCP] implementations should follow a general principle of robustness: be conservative in what you do, be liberal in what you accept from others.


Why use :escape-unicode false at all if I'm worried about compliance? Well, Unicode is a really good idea, and pairs very nicely with the UTF-8 character encoding, which is also a really good idea. UTF-8 encodes text much more efficiently than spelling out literal escapes. The default (:escape-unicode true) does not leverage the compression benefits of UTF-8 — which is a trade-off, since ASCII is nearly impossible to screw up, compared to UTF-8, if you aren't expecting UTF-8 (but you should be expecting UTF-8).

So, in short, I want to be able to leverage UTF-8 and remain confident that I'll get valid JSON output, without having to sanitize the (unusual) control characters out of all the strings in my data.






[DJSON-27] Separator punctuation is treated as whitespace in arrays and objects Created: 11/May/17  Updated: 11/May/17

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Moritz Heidkamp Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

The parser treats separator punctuation (commas between array and object members, colons between object keys and values) as whitespace similar to EDN does. This is not in accordance with any of the JSON specs, in particular the one that the library intends to follow (i.e. http://json.org/).

Some examples:

user> (json/read-str "{,,,\"w\"\"x\"\"y\"\"z\",,,}")
{"w" "x", "y" "z"}
user> (json/read-str "{\"x\"::::\"y\"}")
{"x" "y"}
user> (json/read-str "[1 2 3 4 5]")
[1 2 3 4 5]
user> (json/read-str "[1,,,5]")
[1 5]


 Comments   
Comment by Moritz Heidkamp [ 11/May/17 9:00 AM ]

OK, at least according to the two JSON RFCs, this behavior is permissible by a conforming implementation. See https://tools.ietf.org/html/rfc4627#section-4 and https://tools.ietf.org/html/rfc7159#section-9. Perhaps this behavior should at least be documented?





[DJSON-26] write-object can retain head of collections Created: 02/May/17  Updated: 02/May/17

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Brian Craft Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

If serializing a large data structure, the write-object function will retain head references, causing memory pressure.

The problem code looks to me like a typo:

https://github.com/clojure/data.json/blob/master/src/main/clojure/clojure/data/json.clj#L322

calling seq on the parameter of the function, rather than the loop variable. That keeps the parameter in-scope during the recursion.






[DJSON-25] Escaped backslash in strings throws Created: 24/Sep/16  Updated: 24/Sep/16

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Vitalie Spinu Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Clojure 1.8.0, Java 1.8.0_74, data.json "0.2.6"



 Description   

Strings with escaped backslash throw

(json/read-str "{\"aaa\":\" \\> \"}")

Throws{{"No matching clause: 62"}} exception.



 Comments   
Comment by Vitalie Spinu [ 24/Sep/16 12:00 PM ]

Nevermind. It's an improper json that I have to deal with, not a data.json bug.





[DJSON-24] clojure.json.data should handle non breking whitespace Created: 05/Jul/16  Updated: 05/Jul/16

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Lukas Havemann Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File non-breaking-whitespace0001.patch    
Patch: Code

 Description   

clojure.json.data should work with non breaking whitespace (http://www.fileformat.info/info/unicode/char/00a0/index.htm) as all json linters do.
And the Exception text should state whats wrong



 Comments   
Comment by Lukas Havemann [ 05/Jul/16 1:27 PM ]

signed the CA





[DJSON-22] Improper parsing of numbers - leading zeroes should be disallowed Created: 28/Jul/15  Updated: 30/Jul/15

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Matthew Gilliard Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File djson-22.patch    

 Description   

Handling of numeric literals doesn't perform according to the JSON spec.

Example:

(require '[clojure.data.json :as json])
(json/read-str "0123")
(json/read-str "{\"num\": 0123}")

Both of these examples parse the number as 123. According to the spec, this should actually be an invalid number and throw an exception. NB this restriction does not seem to apply to a number in the exponent, so a number like 1e0003 should be parsed as 1000.0. We handle this case correctly now.



 Comments   
Comment by Matthew Gilliard [ 30/Jul/15 10:34 AM ]

Fix + Tests. Feedback welcome.





[DJSON-21] Improper parsing of literals Created: 07/Jul/15  Updated: 28/Jul/15

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Mike Sukmanowsky Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

Handling of numeric literals doesn't perform according to the JSON spec.

Example:

(require '[clojure.data.json :as json])
(json/read-str "123abc")

Returns the number 1232. According to the spec, this should actually be an invalid literal and throw an exception:



 Comments   
Comment by Matthew Gilliard [ 28/Jul/15 5:39 PM ]

(I assume there's a typo in the description - 123 is returned, not 1232)

It's not just literal values, non-whitespace at the end of any input is silently ignored and should be rejected:

(json/read-str "{}xxx")   =>  {}
(json/read-str "[]yyy")   =>  []
(json/read-str "\"\"zzz") =>  ""

NB This behaviour agrees with the docstring ("Reads a single item of JSON data from ...").





[DJSON-18] Fast way to print indented json Created: 15/Dec/14  Updated: 15/Dec/14

Status: Open
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Enhancement Priority: Major
Reporter: Nikita Prokopov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None

Attachments: Text File djson_18_fast_indent.patch    

 Description   

Hi!

Formatted json is very handy for human consumption, for example, while debugging or exploring JSON API. data.json offers formatting in a form of pprint-json. Problem is, pprint-json is dead slow because it tries to fit everything within some line width limit. In practice it takes 20-100 times more time to use pprint-json instead of write-str, up to the point where it just cannot be used in production:

clojure.data.json=> (def data (read-string (slurp "sample.edn")))
#'clojure.data.json/data
clojure.data.json=> (count data)
4613
clojure.data.json=> (time (do (clojure.data.json/write-str data) nil))
"Elapsed time: 219.33 msecs"
clojure.data.json=> (time (do (with-out-str (clojure.data.json/pprint-json data)) nil))
"Elapsed time: 25271.549 msecs"

Proposed enhancement is very simple: indent new keys and array elements, but do not try to fit values into line width limit. For human, JSON formatted this way is still easy consumable, structure is evident. The only downside is that some lines might become very long.

In a patch attached, I modified write-array and write-object, added new :indent option to write. To print indented json, one can write now: (write-str data :indent true)

There's some performance penalty, of course, but relatively small:

clojure.data.json=> (time (do (clojure.data.json/write-str data :indent true) nil))
"Elapsed time: 250.18 msecs"

I also fixed small bug: (seq m) thing in write-object should be (seq x).






Generated at Mon May 29 02:34:49 CDT 2017 using JIRA 4.4#649-r158309.