[CLJ-1177] clojure.java.io/resource and non-ASCII characters Created: 07/Mar/13 Updated: 10/Mar/13 |
|
| Status: | Open |
| Project: | Clojure |
| Component/s: | None |
| Affects Version/s: | Release 1.5 |
| Fix Version/s: | None |
| Type: | Enhancement | Priority: | Minor |
| Reporter: | Trevor Wennblom | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | bug, enhancement | ||
| Attachments: |
|
| Patch: | Code |
| Description |
|
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. user=> (System/getProperty "java.runtime.version") "1.8.0-ea-b79" user=> (clojure-version) "1.5.0" user=> (System/getProperty "user.dir") "/dir/déf" user=> (clojure.java.io/resource "myfile.txt") #<URL file:/dir/d%c3%a9f/resources/myfile.txt> user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8") FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2) |
| Comments |
| Comment by Andy Fingerhut [ 08/Mar/13 12:10 AM ] |
|
Below is a workaround, at least. I don't know, but perhaps the as-file method for URLs in io.clj of Clojure, the part that converts %hh sequences to a character with code point in the range 0 through 255, is at least partly at fault here. I don't know right now if it is possible to modify that code to handle the general case of whatever character encoding munging is going on here to when .getResource creates the URL object. clojure.java.io/resource is documented to return a Java object of type java.net.URL, which seems like it does %hh escaping of many characters. Reference [1] to a Java bug from 2001 where a Java user was surprised by the then-recent change in behavior of the getResource method [2]. Doing a little searching I found this StackOverflow question [3], which has what might be a workaround. I tried it on my Mac OS X 10.6 system running JDK 1.6 and it seemed to work: (slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt"))) That getContent is a method for class java.net.URL [4] [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4466485 |
| Comment by Trevor Wennblom [ 08/Mar/13 9:56 AM ] |
|
Hi Andy, Thanks for the background and suggestions, that's very helpful. I'm gradually learning Clojure with no Java experience. In this case I was searching for the preferred Clojure way to access items in directories declared under :resource-paths in a Leiningen project.clj file. Perhaps clojure.java.io/resource isn't the best way to do this as it's possibly too tied to the expectation for a URI instead of a more general IRI. You're suggested workaround did work for my use case: (slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt"))) but hopefully there would be more native/direct Clojure way to accomplish the same eventually. I don't know if java.net.IDN would be useful internally as a fix in clojure.java.io/resource — I'm assuming not since it wasn't added until Java 6.[1] user=> (import 'java.net.IDN) java.net.IDN user=> (java.net.IDN/toASCII "/dir/déf") "xn--/dir/df-gya" user=> (java.net.IDN/toUnicode "xn--/dir/df-gya") "/dir/déf" [1]: http://docs.oracle.com/javase/6/docs/api/java/net/IDN.html |
| Comment by Andy Fingerhut [ 08/Mar/13 1:30 PM ] |
|
Patch clj-1177-patch-v1.txt dated Mar 8 2013 is an attempt to solve this issue, in what I think may be a correct way. As specified in RFC 3986, when taking a Unicode string and making a URL of it, it should be encoded in UTF-8 and then each individual byte is subject to the %HH hex encoding. This patch reverses that to turn URLs into file names. Tested on Mac OS X 10.6 with a command line like this (it doesn't work without the -Dfile.encoding=UTF-8 option on my Mac, probably because the default encoding is MacRoman): % java -cp clojure.jar:path/to/resource -Dfile.encoding=UTF-8 clojure.main |