java.io URL to File coercion and encoding of non-ASCII characters

Description

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

Analysis:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

Patch: clj-1177-patch-v2.diff

Approach:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Other approaches:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

Screened by: Alex Miller

Environment

None

Attachments

3

Activity

Show:

Alex Miller December 28, 2014 at 4:40 PM

I'm not sure why this discussion is here - if there is a request for enhancement, please file a new ticket that we can assess and target.

Andy Fingerhut December 27, 2014 at 5:04 PM

So you are not saying that there is a bug in the current implementation in Clojure 1.6.0, but that with some new functions implemented and published as part of the API, a developer could get from a resource name to an input stream more efficiently than with the current API?

import December 26, 2014 at 12:26 PM

Comment made by: ctford

I checked whether there would be a problem with paths already containing escape sequences e.g. "strange%20namespace.clj", but Clojure 1.6 does the right thing.

Here's a proof-of-concept for how we could use .getResourceAsStream():

import December 26, 2014 at 11:06 AM

Comment made by: ctford

Hi Andy,

My understanding of the reason for io/resource returning a bad value is that the file path is URL-encoded in the return type, which of class Url. This is because the Java .getResource() (http://docs.oracle.com/javase/7/docs/api/java/lang/ClassLoader.html#getResource(java.lang.String)) method called by io/resource returns a URL, so the encoding happens even before we get back to Clojure-land.

.getResourceAsStream() (http://docs.oracle.com/javase/7/docs/api/java/lang/ClassLoader.html#getResourceAsStream(java.lang.String)) is a similar method to .getResource(), but it returns an InputStream. As it doesn't return a Url, the URL-encoding that causes our issue never happens, and so does not need to be decoded.

As it happens, io/reader works with either an InputStream or a Url, so it happily consumes both the output of .getResource() and .getResourceAsStream().

Avoiding unwanted encoding seems like a more robust solution than encoding and decoding, especially in cases where e.g. the path appears to already have been encoded, perhaps already containing a %20.

Andy Fingerhut December 24, 2014 at 6:15 PM

Chris, I may be missing something in your question, but this bug was due to clojure.java.io/resource returning a value that was incorrect when the resource name contained non-ASCII characters.

After getting a correct return value form clojure.java.io/resource, you can choose to call clojure.java.io/reader on it if you want to read it as text, with UTF-8, UTF-16, etc. encoding, or you can choose instead to call clojure.java.io/input-stream on it if you want to read it as a byte sequence.

However, neither of those second steps can work unless the resource can be found by name somehow.

If that doesn't address your question, please try again.

Completed

Details

Assignee

Reporter

Labels

Approval

Patch

Priority

Affects versions

Fix versions

Created March 7, 2013 at 10:54 PM
Updated December 28, 2014 at 4:40 PM
Resolved December 28, 2014 at 4:40 PM