Clojure

java.io URL to File coercion and encoding of non-ASCII characters

Details

  • Type: Defect Defect
  • Status: Closed Closed
  • Priority: Minor Minor
  • Resolution: Completed
  • Affects Version/s: Release 1.5
  • Fix Version/s: Release 1.6
  • Component/s: None
  • Labels:
  • Patch:
    Code and Test
  • Approval:
    Ok

Description

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory)  java.io.FileInputStream.open (FileInputStream.java:-2)

Analysis:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

Patch: clj-1177-patch-v2.diff

Approach:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Other approaches:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

Screened by: Alex Miller

  1. clj-1177-patch-v1.txt
    08/Mar/13 1:30 PM
    3 kB
    Andy Fingerhut
  2. clj-1177-patch-v2.diff
    22/Oct/13 9:10 AM
    2 kB
    Alex Miller
  3. clj-1177-patch-v2.txt
    01/Sep/13 10:51 AM
    2 kB
    Andy Fingerhut

Activity

Andy Fingerhut made changes -
Field Original Value New Value
Attachment clj-1177-patch-v1.txt [ 11900 ]
Andy Fingerhut made changes -
Patch Code [ 10001 ]
Alex Miller made changes -
Approval Triaged [ 10120 ]
Alex Miller made changes -
Summary clojure.java.io/resource and non-ASCII characters java.io URL to File coercion and encoding of non-ASCII characters
Alex Miller made changes -
Issue Type Enhancement [ 4 ] Defect [ 1 ]
Andy Fingerhut made changes -
Attachment clj-1177-patch-v2.txt [ 12218 ]
Alex Miller made changes -
Labels bug enhancement io
Rich Hickey made changes -
Approval Triaged [ 10120 ] Vetted [ 10003 ]
Fix Version/s Release 1.6 [ 10157 ]
Andy Fingerhut made changes -
Patch Code [ 10001 ] Code and Test [ 10002 ]
Description clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning.

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Patch*: clj-1177-patch-v2.txt

*Approach*:

In the implementation of method as-file of protocol Coercions to the java.net.URL class, use method java.net.URLDecoder.decode to decode the contents of a URL string that may contain escaped characters in the form "%xy", where x and y are hex digits in ASCII.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The previous implementation of as-file for URL's also did unescaping of such sequences, but it decoded each %xy occurrence to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
Andy Fingerhut made changes -
Description clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Patch*: clj-1177-patch-v2.txt

*Approach*:

In the implementation of method as-file of protocol Coercions to the java.net.URL class, use method java.net.URLDecoder.decode to decode the contents of a URL string that may contain escaped characters in the form "%xy", where x and y are hex digits in ASCII.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The previous implementation of as-file for URL's also did unescaping of such sequences, but it decoded each %xy occurrence to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
Andy Fingerhut made changes -
Description clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Concern from Alex Miller: One big caveat: this approach only works for absolute URLs. Relative URLs would need some massaging.

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
Andy Fingerhut made changes -
Description clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Concern from Alex Miller: One big caveat: this approach only works for absolute URLs. Relative URLs would need some massaging.

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Concern from Alex Miller: One big caveat: this approach only works for absolute URLs. Relative URLs would need some massaging.

Response from Andy Fingerhut: Can someone give an example of a relative URL that fails to work with this approach, and what it should produce instead?

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
Alex Miller made changes -
Approval Vetted [ 10003 ] Screened [ 10004 ]
Description clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

Concern from Alex Miller: One big caveat: this approach only works for absolute URLs. Relative URLs would need some massaging.

Response from Andy Fingerhut: Can someone give an example of a relative URL that fails to work with this approach, and what it should produce instead?

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

*Screened by:* Alex Miller
Alex Miller made changes -
Description clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.txt

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

*Screened by:* Alex Miller
clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning. (The behavior in the example below is not specific to JDK 8 or Clojure 1.5.0. It is seen with the latest Clojure master as of Sep 15, 2013, and with JDK 6 and JDK 7.)

{code}
user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory) java.io.FileInputStream.open (FileInputStream.java:-2)
{code}

*Analysis*:

The implementation of method as-file of protocol Coercions for class java.net.URL transforms each occurrence of '%xy', where x and y are hex digits in ASCII, to a separate character in the result. The correct behavior is to treat sequences of more than one '%xy' as a byte sequence encoded in UTF-8, where single Unicode code points (i.e. 'Unicode characters') are encoded with anywhere from 1 to 4 bytes.

*Patch*: clj-1177-patch-v2.diff

*Approach*:

Change method as-file for class java.net.URL to use method java.net.URLDecoder.decode to decode the contents of a URL string.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLDecoder.html#decode%28java.lang.String,%20java.lang.String%29

The only issue with java.net.URLDecoder.decode's behavior is that it changes plus-sign characters to spaces, which according to at least one of the existing unit tests should not happen in as-file. To work around this, first explicitly encode any plus-sign characters in the given URL string, using method java.net.URLEncoder.encode. After that, pass the result to method decode.

    http://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html#encode%28java.lang.String,%20java.lang.String%29

*Other approaches*:

Patch clj-1177-patch-v1.txt represents an alternate approach that does its own 'unescaping' of UTF-8 encoded URL strings, without relying on class java.net.URLDecoder. As a result, it is longer and more detailed.

*Screened by:* Alex Miller
Attachment clj-1177-patch-v2.diff [ 12363 ]
Rich Hickey made changes -
Approval Screened [ 10004 ] Ok [ 10007 ]
Stuart Halloway made changes -
Resolution Completed [ 1 ]
Status Open [ 1 ] Closed [ 6 ]

People

Vote (1)
Watch (2)

Dates

  • Created:
    Updated:
    Resolved: