<< Back to previous view

[DJSON-3] Error writing strings with characters outside the BMP Created: 16/Dec/11  Updated: 10/Jan/14  Resolved: 07/Mar/12

Status: Closed
Project: data.json
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Defect Priority: Major
Reporter: Matthew Phillips Assignee: Stuart Sierra
Resolution: Completed Votes: 1
Labels: None
Environment:

Clojure 1.2 and 1.3, Mac OS X, Java 1.6


Attachments: Text File djson-3-non-bmp-chars-patch.txt     Text File fix_unicode_non_bmp_2.patch     Text File fix_unicode_non_bmp_3.patch     Text File fix_unicode_non_bmp.patch    
Patch: Code and Test

 Description   

[This ticket filed as a follow on from pull request 2 at github]

One of my users decided to start sending Emoji happy faces over a network protocol encoded using data.json, which caused it to promptly roll over and die. It turned out that write-json-string, while aware of the code point vs. character issue, was still iterating over a character count not a code point count.

I have demonstrated the problem in a test, and fixed this: will file a patch when my contributor agreement is in place.



 Comments   
Comment by Matthew Phillips [ 05/Jan/12 5:57 PM ]

Patch to fix DJSON-3.

Comment by Andy Fingerhut [ 13/Jan/12 1:29 AM ]

I'm not an expert on JSON, so feel free to correct me if I am wrong, but it seems from reading RFC 4627 that strings in JSON, if they are encoded with the \uxxxx format for hex digits xxxx, should have their 16-bit UTF-16 code units encoded in that way. There should always be exactly 4 hex digits after the \u.

Here are a couple of test cases with the current code, without either of the patches fix_unicode_non_bmp.patch or fix_unicode_non_bmp_2.patch:

user=> (in-ns 'clojure.data.json)
#<Namespace clojure.data.json>

;; Incorrect.
clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false))
("22" "61" "62" "63" "d834" "dd1e" "dd1e" "22")

;; Also incorrect.
clojure.data.json=> (json-str "abc\ud834\udd1e")
"\"abc\\u1d11e\\udd1e\""

Here is the behavior with patch fix_unicode_non_bmp.patch:

; This is correct
clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false))
("22" "61" "62" "63" "d834" "dd1e" "22")

;; but this is not. JSON specifies that UTF-16 code units should be
;; included in string in form \uxxxx, for hex encoding xxxx of 16-bit
;; code units.
clojure.data.json=> (json-str "abc\ud834\udd1e")
"\"abc\\u1d11e\""

Here is the behavior with patch fix_unicode_non_bmp_2.patch:

;; correct
clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false))
("22" "61" "62" "63" "d834" "dd1e" "22")

;; also correct.
clojure.data.json=> (json-str "abc\ud834\udd1e")
"\"abc\\ud834\\udd1e\""

Comment by Andy Fingerhut [ 13/Jan/12 3:18 PM ]

fix_unicode_non_bmp_3.patch is same as fix_unicode_non_bmp_2.patch, except it includes the unit test Matthew Phillips had in his patch, plus another one to test the :escape-unicode true case.

Comment by Andy Fingerhut [ 20/Feb/12 4:03 PM ]

Eventually I may even get the hang of this. djson-3-non-bmp-chars-patch.txt contains patch in proper format.

Comment by Stuart Sierra [ 07/Mar/12 7:58 AM ]

patch applied.

Comment by Matthew Phillips [ 08/Mar/12 6:39 PM ]

Hi Stuart, it looks like you've only applied Andy Fingerhut's last patch. My fix is in the first one: "fix_unicode_non_bmp.patch"

Comment by Andy Fingerhut [ 09/Mar/12 11:04 AM ]

Matthew, did you see my comment from Jan 12, 2012? I show a test case there that appears not to work with your patch, unless I am misunderstanding the correct JSON desired behavior. That same test does work with my latest patch, I think. Take a look and see what you think.

Comment by Matthew Phillips [ 09/Mar/12 7:41 PM ]

Oh I see - I misunderstood what your were doing there, thinking it only applied to the escape-unicode == true case.

And of course the Clojure string is already in UTF-16, so there's no reason to turn it into 32 bit code points and then expand it back to UTF-16 pairs.

Comment by Stuart Sierra [ 10/Jan/14 11:10 AM ]

Marking old issues as 'closed'

Generated at Fri Jul 25 13:33:36 CDT 2014 using JIRA 4.4#649-r158309.