[CLJ-1177] java.io URL to File coercion and encoding of non-ASCII characters Created: 07/Mar/13 Updated: 28/Dec/14 Resolved: 25/Oct/13
|Affects Version/s:||Release 1.5|
|Fix Version/s:||Release 1.6|
|Attachments:||clj-1177-patch-v1.txt clj-1177-patch-v2.diff clj-1177-patch-v2.txt|
|Patch:||Code and Test|
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)
The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.
Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.
The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.
Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
Screened by: Alex Miller
|Comment by Andy Fingerhut [ 08/Mar/13 12:10 AM ]|
Below is a workaround, at least. I don't know, but perhaps the as-file method for URLs in io.clj of Clojure, the part that converts %hh sequences to a character with code point in the range 0 through 255, is at least partly at fault here. I don't know right now if it is possible to modify that code to handle the general case of whatever character encoding munging is going on here to when .getResource creates the URL object.
clojure.java.io/resource is documented to return a Java object of type java.net.URL, which seems like it does %hh escaping of many characters. Reference  to a Java bug from 2001 where a Java user was surprised by the then-recent change in behavior of the getResource method .
Doing a little searching I found this StackOverflow question , which has what might be a workaround. I tried it on my Mac OS X 10.6 system running JDK 1.6 and it seemed to work:
(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))
That getContent is a method for class java.net.URL 
|Comment by Trevor Wennblom [ 08/Mar/13 9:56 AM ]|
Thanks for the background and suggestions, that's very helpful.
I'm gradually learning Clojure with no Java experience. In this case I was searching for the preferred Clojure way to access items in directories declared under :resource-paths in a Leiningen project.clj file. Perhaps clojure.java.io/resource isn't the best way to do this as it's possibly too tied to the expectation for a URI instead of a more general IRI.
You're suggested workaround did work for my use case:
(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))
but hopefully there would be more native/direct Clojure way to accomplish the same eventually.
I don't know if java.net.IDN would be useful internally as a fix in clojure.java.io/resource — I'm assuming not since it wasn't added until Java 6.
|Comment by Andy Fingerhut [ 08/Mar/13 1:30 PM ]|
Patch clj-1177-patch-v1.txt dated Mar 8 2013 is an attempt to solve this issue, in what I think may be a correct way. As specified in RFC 3986, when taking a Unicode string and making a URL of it, it should be encoded in UTF-8 and then each individual byte is subject to the %HH hex encoding. This patch reverses that to turn URLs into file names.
Tested on Mac OS X 10.6 with a command line like this (it doesn't work without the -Dfile.encoding=UTF-8 option on my Mac, probably because the default encoding is MacRoman):
% java -cp clojure.jar:path/to/resource -Dfile.encoding=UTF-8 clojure.main
|Comment by Alex Miller [ 24/Jul/13 10:08 PM ]|
I think the original code and all of these suggestions are missing more obvious answers already in the JDK (and better).
1. URLs can be converted to URIs which can be passed to the File constructor:
2. Or we could also leverage URLDecoder instead of that nasty escaping mess currently in the code.
|Comment by Alex Miller [ 24/Jul/13 10:41 PM ]|
One big caveat: the alternatives I gave above only work for absolute URLs. Relative URLs would need some massaging. I think to cover those, #2 would be better as it gives you a hook to look at the output of getFile and decide whether it's relative.
|Comment by Andy Fingerhut [ 25/Jul/13 8:46 PM ]|
On my system (Mac OS X 10.8.4, JVM 1.7.0_15):
#1 has the same problem of munging characters as the current code does. At least, I got errors trying to open a file with an accented "a" in it, because it tried to open a file with a name that had two characters in place of the accented "a".
#2 is better, but it fails with one of the tests that calls (clojure.java.io/as-file (URL. "file:bar+baz")). With your version #2, URLDecoder/decode changes the plus to a space, and the test comparison to the expected result of (File. "bar+baz") fails. I don't know if that is a good test or not, but if it is, the documentation I read for URLDecoder/decode suggests that it will always change plus to space, regardless of whether it is an absolute or relative URL.
|Comment by Andy Fingerhut [ 01/Sep/13 10:51 AM ]|
Patch clj-1177-patch-v2.txt dated Sep 1 2013 uses URLDecoder/decode to do the decoding of the URL, but only after encoding any plus signs in the URL first, so that they remain plus signs in the returned file name, and are not changed to spaces.
This patch also adds one new test for as-file.
|Comment by Chris Ford [ 24/Dec/14 8:02 AM ]|
I'm a little late to this party, but is there a reason not to use .getResourceAsStream() (which returns an InputStream) instead of .getResource() (which returns a URL).
We wouldn't have to worry about reversing encoding if we avoided encoding in the first place. This change is compatible with io/reader, though a more conservative approach would be to add a new stream-resource function.
|Comment by Andy Fingerhut [ 24/Dec/14 12:15 PM ]|
Chris, I may be missing something in your question, but this bug was due to clojure.java.io/resource returning a value that was incorrect when the resource name contained non-ASCII characters.
After getting a correct return value form clojure.java.io/resource, you can choose to call clojure.java.io/reader on it if you want to read it as text, with UTF-8, UTF-16, etc. encoding, or you can choose instead to call clojure.java.io/input-stream on it if you want to read it as a byte sequence.
However, neither of those second steps can work unless the resource can be found by name somehow.
If that doesn't address your question, please try again.
|Comment by Chris Ford [ 26/Dec/14 5:06 AM ]|
My understanding of the reason for io/resource returning a bad value is that the file path is URL-encoded in the return type, which of class Url. This is because the Java .getResource() (http://docs.oracle.com/javase/7/docs/api/java/lang/ClassLoader.html#getResource(java.lang.String)) method called by io/resource returns a URL, so the encoding happens even before we get back to Clojure-land.
.getResourceAsStream() (http://docs.oracle.com/javase/7/docs/api/java/lang/ClassLoader.html#getResourceAsStream(java.lang.String)) is a similar method to .getResource(), but it returns an InputStream. As it doesn't return a Url, the URL-encoding that causes our issue never happens, and so does not need to be decoded.
As it happens, io/reader works with either an InputStream or a Url, so it happily consumes both the output of .getResource() and .getResourceAsStream().
Avoiding unwanted encoding seems like a more robust solution than encoding and decoding, especially in cases where e.g. the path appears to already have been encoded, perhaps already containing a %20.
|Comment by Chris Ford [ 26/Dec/14 6:26 AM ]|
I checked whether there would be a problem with paths already containing escape sequences e.g. "strange%20namespace.clj", but Clojure 1.6 does the right thing.
Here's a proof-of-concept for how we could use .getResourceAsStream():
|Comment by Andy Fingerhut [ 27/Dec/14 11:04 AM ]|
So you are not saying that there is a bug in the current implementation in Clojure 1.6.0, but that with some new functions implemented and published as part of the API, a developer could get from a resource name to an input stream more efficiently than with the current API?
|Comment by Alex Miller [ 28/Dec/14 10:40 AM ]|
I'm not sure why this discussion is here - if there is a request for enhancement, please file a new ticket that we can assess and target.