[DJSON-3] Error writing strings with characters outside the BMP Created: 16/Dec/11 Updated: 09/Mar/12 Resolved: 07/Mar/12
|Reporter:||Matthew Phillips||Assignee:||Stuart Sierra|
Clojure 1.2 and 1.3, Mac OS X, Java 1.6
|Attachments:||djson-3-non-bmp-chars-patch.txt fix_unicode_non_bmp_2.patch fix_unicode_non_bmp_3.patch fix_unicode_non_bmp.patch|
|Patch:||Code and Test|
[This ticket filed as a follow on from pull request 2 at github]
One of my users decided to start sending Emoji happy faces over a network protocol encoded using data.json, which caused it to promptly roll over and die. It turned out that write-json-string, while aware of the code point vs. character issue, was still iterating over a character count not a code point count.
I have demonstrated the problem in a test, and fixed this: will file a patch when my contributor agreement is in place.
|Comment by Matthew Phillips [ 05/Jan/12 5:57 PM ]|
Patch to fix
|Comment by Andy Fingerhut [ 13/Jan/12 1:29 AM ]|
I'm not an expert on JSON, so feel free to correct me if I am wrong, but it seems from reading RFC 4627 that strings in JSON, if they are encoded with the \uxxxx format for hex digits xxxx, should have their 16-bit UTF-16 code units encoded in that way. There should always be exactly 4 hex digits after the \u.
Here are a couple of test cases with the current code, without either of the patches fix_unicode_non_bmp.patch or fix_unicode_non_bmp_2.patch:
user=> (in-ns 'clojure.data.json)
;; Also incorrect.
Here is the behavior with patch fix_unicode_non_bmp.patch:
; This is correct
;; but this is not. JSON specifies that UTF-16 code units should be
Here is the behavior with patch fix_unicode_non_bmp_2.patch:
;; also correct.
|Comment by Andy Fingerhut [ 13/Jan/12 3:18 PM ]|
fix_unicode_non_bmp_3.patch is same as fix_unicode_non_bmp_2.patch, except it includes the unit test Matthew Phillips had in his patch, plus another one to test the :escape-unicode true case.
|Comment by Andy Fingerhut [ 20/Feb/12 4:03 PM ]|
Eventually I may even get the hang of this. djson-3-non-bmp-chars-patch.txt contains patch in proper format.
|Comment by Stuart Sierra [ 07/Mar/12 7:58 AM ]|
|Comment by Matthew Phillips [ 08/Mar/12 6:39 PM ]|
Hi Stuart, it looks like you've only applied Andy Fingerhut's last patch. My fix is in the first one: "fix_unicode_non_bmp.patch"
|Comment by Andy Fingerhut [ 09/Mar/12 11:04 AM ]|
Matthew, did you see my comment from Jan 12, 2012? I show a test case there that appears not to work with your patch, unless I am misunderstanding the correct JSON desired behavior. That same test does work with my latest patch, I think. Take a look and see what you think.
|Comment by Matthew Phillips [ 09/Mar/12 7:41 PM ]|
Oh I see - I misunderstood what your were doing there, thinking it only applied to the escape-unicode == true case.
And of course the Clojure string is already in UTF-16, so there's no reason to turn it into 32 bit code points and then expand it back to UTF-16 pairs.