Clojure

clojure.java.io/resource and non-ASCII characters

Details

  • Type: Enhancement Enhancement
  • Status: Open Open
  • Priority: Minor Minor
  • Resolution: Unresolved
  • Affects Version/s: Release 1.5
  • Fix Version/s: None
  • Component/s: None
  • Labels:
  • Patch:
    Code

Description

clojure.java.io/resource corrupts path containing UTF-8 characters without issuing warning.

user=> (System/getProperty "java.runtime.version")
"1.8.0-ea-b79"
user=> (clojure-version)
"1.5.0"
user=> (System/getProperty "user.dir")
"/dir/déf"
user=> (clojure.java.io/resource "myfile.txt")
#<URL file:/dir/d%c3%a9f/resources/myfile.txt>
user=> (slurp (clojure.java.io/resource "myfile.txt") :encoding "UTF-8")
FileNotFoundException /dir/déf/resources/myfile.txt (No such file or directory)  java.io.FileInputStream.open (FileInputStream.java:-2)

Activity

Hide
Andy Fingerhut added a comment -

Below is a workaround, at least. I don't know, but perhaps the as-file method for URLs in io.clj of Clojure, the part that converts %hh sequences to a character with code point in the range 0 through 255, is at least partly at fault here. I don't know right now if it is possible to modify that code to handle the general case of whatever character encoding munging is going on here to when .getResource creates the URL object.

clojure.java.io/resource is documented to return a Java object of type java.net.URL, which seems like it does %hh escaping of many characters. Reference [1] to a Java bug from 2001 where a Java user was surprised by the then-recent change in behavior of the getResource method [2].

Doing a little searching I found this StackOverflow question [3], which has what might be a workaround. I tried it on my Mac OS X 10.6 system running JDK 1.6 and it seemed to work:

(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))

That getContent is a method for class java.net.URL [4]

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4466485
[2] http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Class.html#getResource%28java.lang.String%29
[3] http://stackoverflow.com/questions/13013629/best-international-alternative-to-javas-getclass-getresource
[4] http://docs.oracle.com/javase/1.5.0/docs/api/java/net/URL.html#getContent%28%29

Show
Andy Fingerhut added a comment - Below is a workaround, at least. I don't know, but perhaps the as-file method for URLs in io.clj of Clojure, the part that converts %hh sequences to a character with code point in the range 0 through 255, is at least partly at fault here. I don't know right now if it is possible to modify that code to handle the general case of whatever character encoding munging is going on here to when .getResource creates the URL object. clojure.java.io/resource is documented to return a Java object of type java.net.URL, which seems like it does %hh escaping of many characters. Reference [1] to a Java bug from 2001 where a Java user was surprised by the then-recent change in behavior of the getResource method [2]. Doing a little searching I found this StackOverflow question [3], which has what might be a workaround. I tried it on my Mac OS X 10.6 system running JDK 1.6 and it seemed to work: (slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt"))) That getContent is a method for class java.net.URL [4] [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4466485 [2] http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Class.html#getResource%28java.lang.String%29 [3] http://stackoverflow.com/questions/13013629/best-international-alternative-to-javas-getclass-getresource [4] http://docs.oracle.com/javase/1.5.0/docs/api/java/net/URL.html#getContent%28%29
Hide
Trevor Wennblom added a comment -

Hi Andy,

Thanks for the background and suggestions, that's very helpful.

I'm gradually learning Clojure with no Java experience. In this case I was searching for the preferred Clojure way to access items in directories declared under :resource-paths in a Leiningen project.clj file. Perhaps clojure.java.io/resource isn't the best way to do this as it's possibly too tied to the expectation for a URI instead of a more general IRI.

You're suggested workaround did work for my use case:

(slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt")))

but hopefully there would be more native/direct Clojure way to accomplish the same eventually.

I don't know if java.net.IDN would be useful internally as a fix in clojure.java.io/resource — I'm assuming not since it wasn't added until Java 6.[1]

user=> (import 'java.net.IDN)
java.net.IDN
user=> (java.net.IDN/toASCII "/dir/déf")
"xn--/dir/df-gya"
user=> (java.net.IDN/toUnicode "xn--/dir/df-gya")
"/dir/déf"

[1]: http://docs.oracle.com/javase/6/docs/api/java/net/IDN.html

Show
Trevor Wennblom added a comment - Hi Andy, Thanks for the background and suggestions, that's very helpful. I'm gradually learning Clojure with no Java experience. In this case I was searching for the preferred Clojure way to access items in directories declared under :resource-paths in a Leiningen project.clj file. Perhaps clojure.java.io/resource isn't the best way to do this as it's possibly too tied to the expectation for a URI instead of a more general IRI. You're suggested workaround did work for my use case: (slurp (.getContent (clojure.java.io/resource "abcíd/foo.txt"))) but hopefully there would be more native/direct Clojure way to accomplish the same eventually. I don't know if java.net.IDN would be useful internally as a fix in clojure.java.io/resource — I'm assuming not since it wasn't added until Java 6.[1]
user=> (import 'java.net.IDN)
java.net.IDN
user=> (java.net.IDN/toASCII "/dir/déf")
"xn--/dir/df-gya"
user=> (java.net.IDN/toUnicode "xn--/dir/df-gya")
"/dir/déf"
[1]: http://docs.oracle.com/javase/6/docs/api/java/net/IDN.html
Hide
Andy Fingerhut added a comment -

Patch clj-1177-patch-v1.txt dated Mar 8 2013 is an attempt to solve this issue, in what I think may be a correct way. As specified in RFC 3986, when taking a Unicode string and making a URL of it, it should be encoded in UTF-8 and then each individual byte is subject to the %HH hex encoding. This patch reverses that to turn URLs into file names.

Tested on Mac OS X 10.6 with a command line like this (it doesn't work without the -Dfile.encoding=UTF-8 option on my Mac, probably because the default encoding is MacRoman):

% java -cp clojure.jar:path/to/resource -Dfile.encoding=UTF-8 clojure.main
user=> (require '[clojure.java.io :as io])
nil
user=> (io/resource "abcíd/foo.txt")
#<URL file:/Users/jafinger/clj/clj-ns-browser/resource/abc%c3%add/foo.txt>
user=> (slurp (io/resource "abcíd/foo.txt"))
"The quick brown fox jumped over the lázy dög!\n"

Show
Andy Fingerhut added a comment - Patch clj-1177-patch-v1.txt dated Mar 8 2013 is an attempt to solve this issue, in what I think may be a correct way. As specified in RFC 3986, when taking a Unicode string and making a URL of it, it should be encoded in UTF-8 and then each individual byte is subject to the %HH hex encoding. This patch reverses that to turn URLs into file names. Tested on Mac OS X 10.6 with a command line like this (it doesn't work without the -Dfile.encoding=UTF-8 option on my Mac, probably because the default encoding is MacRoman): % java -cp clojure.jar:path/to/resource -Dfile.encoding=UTF-8 clojure.main user=> (require '[clojure.java.io :as io]) nil user=> (io/resource "abcíd/foo.txt") #<URL file:/Users/jafinger/clj/clj-ns-browser/resource/abc%c3%add/foo.txt> user=> (slurp (io/resource "abcíd/foo.txt")) "The quick brown fox jumped over the lázy dög!\n"

People

Vote (1)
Watch (0)

Dates

  • Created:
    Updated: