[CLJ-945] clojure.string/capitalize can give wrong result if first char is supplementary Created: 05/Mar/12 Updated: 01/Mar/13 Resolved: 01/Mar/13
|Affects Version/s:||Release 1.2, Release 1.3, Release 1.4|
|Fix Version/s:||Release 1.5|
|Patch:||Code and Test|
When the first unicode code point of a string is supplementary (i.e. requires two 16-bit Java chars to represent in UTF-16), and that first code point is changed by converting it to upper case, clojure.string/capitalize gives the wrong answer.
|Comment by Rich Hickey [ 20/Jul/12 7:43 AM ]|
Isn't this a Java bug?
|Comment by Andy Fingerhut [ 20/Jul/12 12:36 PM ]|
If using UTF-16 to encode Unicode strings, and making every UTF-16 code unit (i.e. Java char) individually indexable as a separate entity in strings, is such a bad design choice that you consider it a bug, then yes, this is a Java bug (and a bug in all the other systems that use UTF-16 in this way).
clojure.string/capitalize isn't using some Java capitalization method that has a bug, though. By calling (.toUpperCase (subs s 0 1)) it is not giving enough information to .toUpperCase for any implementation, Java or otherwise, to do the job correctly. It is analogous to calling toupper on the least significant 4 bits of the ASCII encoding of a letter and expecting it to return the correct answer.