Clojure

java.io URL to File coercion and encoding of non-ASCII characters

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Completed
  • Affects Version/s: Release 1.5
  • Fix Version/s: Release 1.6
  • Component/s: None
  • Labels:
  • Patch:
    Code and Test
  • Approval:
    Ok

Description

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory)  java.io.FileInputStream.open (FileInputStream.java:-2)

Analysis:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

Patch: clj-1177-patch-v2.diff

Approach:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Other approaches:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

Screened by: Alex Miller

  1. clj-1177-patch-v1.txt
    08/Mar/13 1:30 PM
    3 kB
    Andy Fingerhut
  2. clj-1177-patch-v2.diff
    22/Oct/13 9:10 AM
    2 kB
    Alex Miller
  3. clj-1177-patch-v2.txt
    01/Sep/13 10:51 AM
    2 kB
    Andy Fingerhut

Activity

People

Vote (1)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: