<< Back to previous view

[CLJ-1177] clojure.java.io/resource and non-ASCII characters Created: 07/Mar/13  Updated: 10/Mar/13

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: None

Type: Enhancement Priority: Minor
Reporter: Trevor Wennblom Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: bug, enhancement

Attachments: Text File clj-1177-patch-v1.txt    
Patch: Code

 Description   

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning.

user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory)  java.io.FileInputStream.open (FileInputStream.java:-2)


 Comments   
Comment by Andy Fingerhut [ 08/Mar/13 12:10 AM ]

Below is a workaround, at least. I don't know, but perhaps the as-file method for URLs in io.clj of Clojure, the part that converts %hh sequences to a character with code point in the range 0 through 255, is at least partly at fault here. I don't know right now if it is possible to modify that code to handle the general case of whatever character encoding munging is going on here to when .getResource creates the URL object.

clojure.java.io/resource is documented to return a Java object of type java.net.URL, which seems like it does %hh escaping of many characters. Reference [1] to a Java bug from 2001 where a Java user was surprised by the then-recent change in behavior of the getResource method [2].

Doing a little searching I found this StackOverflow question [3], which has what might be a workaround. I tried it on my Mac OS X 10.6 system running JDK 1.6 and it seemed to work:

(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))

That getContent is a method for class java.net.URL [4]

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4466485
[2] http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Class.html#getResource%28java.lang.String%29
[3] http://stackoverflow.com/questions/13013629/best-international-alternative-to-javas-getclass-getresource
[4] http://docs.oracle.com/javase/1.5.0/docs/api/java/net/URL.html#getContent%28%29

Comment by Trevor Wennblom [ 08/Mar/13 9:56 AM ]

Hi Andy,

Thanks for the background and suggestions, that's very helpful.

I'm gradually learning Clojure with no Java experience. In this case I was searching for the preferred Clojure way to access items in directories declared under :resource-paths in a Leiningen project.clj file. Perhaps clojure.java.io/resource isn't the best way to do this as it's possibly too tied to the expectation for a URI instead of a more general IRI.

You're suggested workaround did work for my use case:

(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))

but hopefully there would be more native/direct Clojure way to accomplish the same eventually.

I don't know if java.net.IDN would be useful internally as a fix in clojure.java.io/resource — I'm assuming not since it wasn't added until Java 6.[1]

user=> (import 'java.net.IDN)
java.net.IDN
user=> (java.net.IDN/toASCII "/dir/déf")
"xn--/dir/df-gya"
user=> (java.net.IDN/toUnicode "xn--/dir/df-gya")
"/dir/déf"

[1]: http://docs.oracle.com/javase/6/docs/api/java/net/IDN.html

Comment by Andy Fingerhut [ 08/Mar/13 1:30 PM ]

Patch clj-1177-patch-v1.txt dated Mar 8 2013 is an attempt to solve this issue, in what I think may be a correct way. As specified in RFC 3986, when taking a Unicode string and making a URL of it, it should be encoded in UTF-8 and then each individual byte is subject to the %HH hex encoding. This patch reverses that to turn URLs into file names.

Tested on Mac OS X 10.6 with a command line like this (it doesn't work without the -Dfile.encoding=UTF-8 option on my Mac, probably because the default encoding is MacRoman):

% java -cp clojure.jar:path/to/resource -Dfile.encoding=UTF-8 clojure.main
user=> (require '[clojure.java.io :as io])
nil
user=> (io/resource "abcíd/foo.txt")
#<URL file:/Users/jafinger/clj/clj-ns-browser/resource/abc%c3%add/foo.txt>
user=> (slurp (io/resource "abcíd/foo.txt"))
"The quick brown fox jumped over the lázy dög!\n"

Generated at Thu May 23 19:06:34 CDT 2013 using JIRA 4.4#649-r158309.