[DJSON-3] Error writing strings with characters outside the BMP Created: 16/Dec/11 Updated: 09/Mar/12 Resolved: 07/Mar/12 |
|
| Status: | Resolved |
| Project: | data.json |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Defect | Priority: | Major |
| Reporter: | Matthew Phillips | Assignee: | Stuart Sierra |
| Resolution: | Completed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Clojure 1.2 and 1.3, Mac OS X, Java 1.6 |
||
| Attachments: |
|
| Patch: | Code and Test |
| Description |
|
[This ticket filed as a follow on from pull request 2 at github] One of my users decided to start sending Emoji happy faces over a network protocol encoded using data.json, which caused it to promptly roll over and die. It turned out that write-json-string, while aware of the code point vs. character issue, was still iterating over a character count not a code point count. I have demonstrated the problem in a test, and fixed this: will file a patch when my contributor agreement is in place. |
| Comments |
| Comment by Matthew Phillips [ 05/Jan/12 5:57 PM ] |
|
Patch to fix |
| Comment by Andy Fingerhut [ 13/Jan/12 1:29 AM ] |
|
I'm not an expert on JSON, so feel free to correct me if I am wrong, but it seems from reading RFC 4627 that strings in JSON, if they are encoded with the \uxxxx format for hex digits xxxx, should have their 16-bit UTF-16 code units encoded in that way. There should always be exactly 4 hex digits after the \u. Here are a couple of test cases with the current code, without either of the patches fix_unicode_non_bmp.patch or fix_unicode_non_bmp_2.patch: user=> (in-ns 'clojure.data.json) ;; Incorrect. ;; Also incorrect. Here is the behavior with patch fix_unicode_non_bmp.patch: ; This is correct ;; but this is not. JSON specifies that UTF-16 code units should be Here is the behavior with patch fix_unicode_non_bmp_2.patch: ;; correct ;; also correct. |
| Comment by Andy Fingerhut [ 13/Jan/12 3:18 PM ] |
|
fix_unicode_non_bmp_3.patch is same as fix_unicode_non_bmp_2.patch, except it includes the unit test Matthew Phillips had in his patch, plus another one to test the :escape-unicode true case. |
| Comment by Andy Fingerhut [ 20/Feb/12 4:03 PM ] |
|
Eventually I may even get the hang of this. djson-3-non-bmp-chars-patch.txt contains patch in proper format. |
| Comment by Stuart Sierra [ 07/Mar/12 7:58 AM ] |
|
patch applied. |
| Comment by Matthew Phillips [ 08/Mar/12 6:39 PM ] |
|
Hi Stuart, it looks like you've only applied Andy Fingerhut's last patch. My fix is in the first one: "fix_unicode_non_bmp.patch" |
| Comment by Andy Fingerhut [ 09/Mar/12 11:04 AM ] |
|
Matthew, did you see my comment from Jan 12, 2012? I show a test case there that appears not to work with your patch, unless I am misunderstanding the correct JSON desired behavior. That same test does work with my latest patch, I think. Take a look and see what you think. |
| Comment by Matthew Phillips [ 09/Mar/12 7:41 PM ] |
|
Oh I see - I misunderstood what your were doing there, thinking it only applied to the escape-unicode == true case. And of course the Clojure string is already in UTF-16, so there's no reason to turn it into 32 bit code points and then expand it back to UTF-16 pairs. |