<< Back to previous view

[CLJ-1177] java.io URL to File coercion and encoding of non-ASCII characters Created: 07/Mar/13  Updated: 25/Oct/13  Resolved: 25/Oct/13

Status: Closed
Project: Clojure
Component/s: None
Affects Version/s: Release 1.5
Fix Version/s: Release 1.6

Type: Defect Priority: Minor
Reporter: Trevor Wennblom Assignee: Unassigned
Resolution: Completed Votes: 1
Labels: io

Attachments: Text File clj-1177-patch-v1.txt     File clj-1177-patch-v2.diff     Text File clj-1177-patch-v2.txt    
Patch: Code and Test
Approval: Ok

 Description   

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory)  java.io.FileInputStream.open (FileInputStream.java:-2)

Analysis:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

Patch: clj-1177-patch-v2.diff

Approach:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Other approaches:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

Screened by: Alex Miller



 Comments   
Comment by Andy Fingerhut [ 08/Mar/13 12:10 AM ]

Below is a workaround, at least. I don't know, but perhaps the as-file method for URLs in io.clj of Clojure, the part that converts %hh sequences to a character with code point in the range 0 through 255, is at least partly at fault here. I don't know right now if it is possible to modify that code to handle the general case of whatever character encoding munging is going on here to when .getResource creates the URL object.

clojure.java.io/resource is documented to return a Java object of type java.net.URL, which seems like it does %hh escaping of many characters. Reference [1] to a Java bug from 2001 where a Java user was surprised by the then-recent change in behavior of the getResource method [2].

Doing a little searching I found this StackOverflow question [3], which has what might be a workaround. I tried it on my Mac OS X 10.6 system running JDK 1.6 and it seemed to work:

(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))

That getContent is a method for class java.net.URL [4]

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4466485
[2] http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Class.html#getResource%28java.lang.String%29
[3] http://stackoverflow.com/questions/13013629/best-international-alternative-to-javas-getclass-getresource
[4] http://docs.oracle.com/javase/1.5.0/docs/api/java/net/URL.html#getContent%28%29

Comment by Trevor Wennblom [ 08/Mar/13 9:56 AM ]

Hi Andy,

Thanks for the background and suggestions, that's very helpful.

I'm gradually learning Clojure with no Java experience. In this case I was searching for the preferred Clojure way to access items in directories declared under :resource-paths in a Leiningen project.clj file. Perhaps clojure.java.io/resource isn't the best way to do this as it's possibly too tied to the expectation for a URI instead of a more general IRI.

You're suggested workaround did work for my use case:

(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))

but hopefully there would be more native/direct Clojure way to accomplish the same eventually.

I don't know if java.net.IDN would be useful internally as a fix in clojure.java.io/resource — I'm assuming not since it wasn't added until Java 6.[1]

user=> (import 'java.net.IDN)
java.net.IDN
user=> (java.net.IDN/toASCII "/dir/déf")
"xn--/dir/df-gya"
user=> (java.net.IDN/toUnicode "xn--/dir/df-gya")
"/dir/déf"

[1]: http://docs.oracle.com/javase/6/docs/api/java/net/IDN.html

Comment by Andy Fingerhut [ 08/Mar/13 1:30 PM ]

Patch clj-1177-patch-v1.txt dated Mar 8 2013 is an attempt to solve this issue, in what I think may be a correct way. As specified in RFC 3986, when taking a Unicode string and making a URL of it, it should be encoded in UTF-8 and then each individual byte is subject to the %HH hex encoding. This patch reverses that to turn URLs into file names.

Tested on Mac OS X 10.6 with a command line like this (it doesn't work without the -Dfile.encoding=UTF-8 option on my Mac, probably because the default encoding is MacRoman):

% java -cp clojure.jar:path/to/resource -Dfile.encoding=UTF-8 clojure.main
user=> (require '[clojure.java.io :as io])
nil
user=> (io/resource "abcíd/foo.txt")
#<URL file:/Users/jafinger/clj/clj-ns-browser/resource/abc%c3%add/foo.txt>
user=> (slurp (io/resource "abcíd/foo.txt"))
"The quick brown fox jumped over the lázy dög!\n"

Comment by Alex Miller [ 24/Jul/13 10:08 PM ]

I think the original code and all of these suggestions are missing more obvious answers already in the JDK (and better).

1. URLs can be converted to URIs which can be passed to the File constructor:

(java.io.File. (.toURI (io/resource "abcíd/foo.txt")))

2. Or we could also leverage URLDecoder instead of that nasty escaping mess currently in the code.

(java.io.File. 
  (URLDecoder/decode 
    (.getFile (io/resource "abcíd/foo.txt")) 
    "UTF-8")))
Comment by Alex Miller [ 24/Jul/13 10:41 PM ]

One big caveat: the alternatives I gave above only work for absolute URLs. Relative URLs would need some massaging. I think to cover those, #2 would be better as it gives you a hook to look at the output of getFile and decide whether it's relative.

Comment by Andy Fingerhut [ 25/Jul/13 8:46 PM ]

On my system (Mac OS X 10.8.4, JVM 1.7.0_15):

#1 has the same problem of munging characters as the current code does. At least, I got errors trying to open a file with an accented "a" in it, because it tried to open a file with a name that had two characters in place of the accented "a".

#2 is better, but it fails with one of the tests that calls (clojure.java.io/as-file (URL. "file:bar+baz")). With your version #2, URLDecoder/decode changes the plus to a space, and the test comparison to the expected result of (File. "bar+baz") fails. I don't know if that is a good test or not, but if it is, the documentation I read for URLDecoder/decode suggests that it will always change plus to space, regardless of whether it is an absolute or relative URL.

Comment by Andy Fingerhut [ 01/Sep/13 10:51 AM ]

Patch clj-1177-patch-v2.txt dated Sep 1 2013 uses URLDecoder/decode to do the decoding of the URL, but only after encoding any plus signs in the URL first, so that they remain plus signs in the returned file name, and are not changed to spaces.

This patch also adds one new test for as-file.

Generated at Sun Oct 26 00:58:00 CDT 2014 using JIRA 4.4#649-r158309.