data.json

Error writing strings with characters outside the BMP

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Major Major
  • Resolution: Completed
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None
  • Environment:
    Clojure 1.2 and 1.3, Mac OS X, Java 1.6
  • Patch:
    Code and Test

Description

[This ticket filed as a follow on from pull request 2 at github]

One of my users decided to start sending Emoji happy faces over a network protocol encoded using data.json, which caused it to promptly roll over and die. It turned out that write-json-string, while aware of the code point vs. character issue, was still iterating over a character count not a code point count.

I have demonstrated the problem in a test, and fixed this: will file a patch when my contributor agreement is in place.

  1. fix_unicode_non_bmp.patch
    05/Jan/12 5:57 PM
    4 kB
    Matthew Phillips
  2. fix_unicode_non_bmp_2.patch
    13/Jan/12 1:29 AM
    0.5 kB
    Andy Fingerhut
  3. fix_unicode_non_bmp_3.patch
    13/Jan/12 3:18 PM
    1 kB
    Andy Fingerhut
  4. djson-3-non-bmp-chars-patch.txt
    20/Feb/12 4:03 PM
    2 kB
    Andy Fingerhut

Activity

Hide
Matthew Phillips added a comment -

Patch to fix DJSON-3.

Show
Matthew Phillips added a comment - Patch to fix DJSON-3.
Hide
Andy Fingerhut added a comment -

I'm not an expert on JSON, so feel free to correct me if I am wrong, but it seems from reading RFC 4627 that strings in JSON, if they are encoded with the \uxxxx format for hex digits xxxx, should have their 16-bit UTF-16 code units encoded in that way. There should always be exactly 4 hex digits after the \u.

Here are a couple of test cases with the current code, without either of the patches fix_unicode_non_bmp.patch or fix_unicode_non_bmp_2.patch:

user=> (in-ns 'clojure.data.json)
#<Namespace clojure.data.json>

;; Incorrect.
clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false))
("22" "61" "62" "63" "d834" "dd1e" "dd1e" "22")

;; Also incorrect.
clojure.data.json=> (json-str "abc\ud834\udd1e")
"\"abc\\u1d11e\\udd1e\""

Here is the behavior with patch fix_unicode_non_bmp.patch:

; This is correct
clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false))
("22" "61" "62" "63" "d834" "dd1e" "22")

;; but this is not. JSON specifies that UTF-16 code units should be
;; included in string in form \uxxxx, for hex encoding xxxx of 16-bit
;; code units.
clojure.data.json=> (json-str "abc\ud834\udd1e")
"\"abc\\u1d11e\""

Here is the behavior with patch fix_unicode_non_bmp_2.patch:

;; correct
clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false))
("22" "61" "62" "63" "d834" "dd1e" "22")

;; also correct.
clojure.data.json=> (json-str "abc\ud834\udd1e")
"\"abc\\ud834\\udd1e\""

Show
Andy Fingerhut added a comment - I'm not an expert on JSON, so feel free to correct me if I am wrong, but it seems from reading RFC 4627 that strings in JSON, if they are encoded with the \uxxxx format for hex digits xxxx, should have their 16-bit UTF-16 code units encoded in that way. There should always be exactly 4 hex digits after the \u. Here are a couple of test cases with the current code, without either of the patches fix_unicode_non_bmp.patch or fix_unicode_non_bmp_2.patch: user=> (in-ns 'clojure.data.json) #<Namespace clojure.data.json> ;; Incorrect. clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false)) ("22" "61" "62" "63" "d834" "dd1e" "dd1e" "22") ;; Also incorrect. clojure.data.json=> (json-str "abc\ud834\udd1e") "\"abc\\u1d11e\\udd1e\"" Here is the behavior with patch fix_unicode_non_bmp.patch: ; This is correct clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false)) ("22" "61" "62" "63" "d834" "dd1e" "22") ;; but this is not. JSON specifies that UTF-16 code units should be ;; included in string in form \uxxxx, for hex encoding xxxx of 16-bit ;; code units. clojure.data.json=> (json-str "abc\ud834\udd1e") "\"abc\\u1d11e\"" Here is the behavior with patch fix_unicode_non_bmp_2.patch: ;; correct clojure.data.json=> (map #(format "%x" (int %)) (json-str "abc\ud834\udd1e" :escape-unicode false)) ("22" "61" "62" "63" "d834" "dd1e" "22") ;; also correct. clojure.data.json=> (json-str "abc\ud834\udd1e") "\"abc\\ud834\\udd1e\""
Hide
Andy Fingerhut added a comment -

fix_unicode_non_bmp_3.patch is same as fix_unicode_non_bmp_2.patch, except it includes the unit test Matthew Phillips had in his patch, plus another one to test the :escape-unicode true case.

Show
Andy Fingerhut added a comment - fix_unicode_non_bmp_3.patch is same as fix_unicode_non_bmp_2.patch, except it includes the unit test Matthew Phillips had in his patch, plus another one to test the :escape-unicode true case.
Hide
Andy Fingerhut added a comment -

Eventually I may even get the hang of this. djson-3-non-bmp-chars-patch.txt contains patch in proper format.

Show
Andy Fingerhut added a comment - Eventually I may even get the hang of this. djson-3-non-bmp-chars-patch.txt contains patch in proper format.
Hide
Stuart Sierra added a comment -

patch applied.

Show
Stuart Sierra added a comment - patch applied.
Hide
Matthew Phillips added a comment -

Hi Stuart, it looks like you've only applied Andy Fingerhut's last patch. My fix is in the first one: "fix_unicode_non_bmp.patch"

Show
Matthew Phillips added a comment - Hi Stuart, it looks like you've only applied Andy Fingerhut's last patch. My fix is in the first one: "fix_unicode_non_bmp.patch"
Hide
Andy Fingerhut added a comment -

Matthew, did you see my comment from Jan 12, 2012? I show a test case there that appears not to work with your patch, unless I am misunderstanding the correct JSON desired behavior. That same test does work with my latest patch, I think. Take a look and see what you think.

Show
Andy Fingerhut added a comment - Matthew, did you see my comment from Jan 12, 2012? I show a test case there that appears not to work with your patch, unless I am misunderstanding the correct JSON desired behavior. That same test does work with my latest patch, I think. Take a look and see what you think.
Hide
Matthew Phillips added a comment -

Oh I see - I misunderstood what your were doing there, thinking it only applied to the escape-unicode == true case.

And of course the Clojure string is already in UTF-16, so there's no reason to turn it into 32 bit code points and then expand it back to UTF-16 pairs.

Show
Matthew Phillips added a comment - Oh I see - I misunderstood what your were doing there, thinking it only applied to the escape-unicode == true case. And of course the Clojure string is already in UTF-16, so there's no reason to turn it into 32 bit code points and then expand it back to UTF-16 pairs.
Hide
Stuart Sierra added a comment -

Marking old issues as 'closed'

Show
Stuart Sierra added a comment - Marking old issues as 'closed'

People

Vote (1)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: