<< Back to previous view

[CLJ-1766] Serializing+deserializing lists breaks their hash Created: 21/Jun/15  Updated: 04/Aug/15

Status: Open
Project: Clojure
Component/s: None
Affects Version/s: Release 1.6, Release 1.7
Fix Version/s: Release 1.8

Type: Defect Priority: Major
Reporter: Chris Vermilion Assignee: Andrew Rosa
Resolution: Unresolved Votes: 0
Labels: interop
Environment:

Tested on OS X 10.10.3, Clojure 1.6.0, Java 1.8.0_45-b14, but don't think this is a factor.


Attachments: Text File CLJ-1766.patch     File serialization_test_mod.diff     File serialize-test.clj    
Approval: Vetted

 Description   

clojure.lang.PersistentList implements Serializable but a serialization roundtrip results in a hash of 0. This appears to be caused by ASeq marking its _hash property as transient; other collection base classes don't do this. I don't know if there is a rational for the distinction, but the result is that serializing, then deserializing, a list results in its _hash property being 0 instead of either its previous, calculated value, or -1, which would trigger recalculation.

This means that any code that relies on a list's hash value can break in subtle ways.

Examples:

(import '[java.io ByteArrayInputStream ByteArrayOutputStream ObjectInputStream ObjectOutputStream])

(defn obj->bytes [obj]
  (let [bytestream (ByteArrayOutputStream.)]
    (doto (ObjectOutputStream. bytestream) (.writeObject obj))
    (.toByteArray bytestream)))

(defn bytes->obj [bs]
  (.readObject (ObjectInputStream. (ByteArrayInputStream. bs)))

(defn round-trip [x] (bytes->obj (obj->bytes x)))
user=> (hash '(1))
-1381383523
user=> (hash (round-trip '(1)))
0
user=> #{'(1) (round-trip '(1))}
#{(1) (1)}
user=> (def m {'(1) 1})
#'user/m
user=> (= m (round-trip m))
true
user=> (= (round-trip m) m)
false

Approach: Use 0 as the "uncomputed" hash value, not -1. The auto initialized value on serialization construction will then automatically trigger a re-computation.

Alternate approaches:

  • Implement a readObject method that sets the _hash property to -1. This is somewhat complicated in the face of subclasses, etc.

Note: Also need to consider other classes that use -1 instead of 0 as the uncomputed hash value: APersistentMap, APersistentSet, APersistentVector, PersistentQueue - however I believe in those cases they are not transient and thus may avoid the issue. Perhaps they should be transient though.



 Comments   
Comment by Chris Vermilion [ 21/Jun/15 1:10 PM ]

Yikes, sorry about the formatting above; JIRA newbie and looks like I can't edit. Also, I should have noted that the above functions require an import: (import '[java.io ByteArrayInputStream ByteArrayOutputStream ObjectInputStream ObjectOutputStream]).

Comment by Alex Miller [ 22/Jun/15 7:55 AM ]

Yes, this is a bug. My preference would be to use 0 to indicate "not computed" and thus sidestep this issue. The downside of changing from -1 to 0 is that serialization/deserialization is then broken between versions of Clojure (although maybe if it was already broken, that's not an issue). -1 is used like this for lazily computed hashes in many classes in Clojure so the issue should really be fixed across the board.

Comment by Alex Miller [ 22/Jun/15 8:22 AM ]

There are some serialization round-trip tests in clojure.test-clojure.serialization - seems like those should be updated to compare the .hashCode and hash before/after, which should catch this. I attached a diff (not a proper patch) with that change just as a demonstration, which indeed produces a bunch of failures. That should be incorporated into any fix, possibly along with other failures.

Comment by Alex Miller [ 31/Jul/15 9:04 AM ]

Also, don't worry about crediting me on that test change, please just incorporate it into whatever gets made here.

Comment by Andrew Rosa [ 31/Jul/15 9:31 AM ]

Alex Miller I hope to work on this issue this weekend. There are some conflict with the alpha release schedule?

Comment by Alex Miller [ 31/Jul/15 9:47 AM ]

No conflict - when it's ready we'll look at it.

Comment by Andrew Rosa [ 04/Aug/15 8:21 AM ]

As Alex Miller recomended, I followed the change-default-to-zero path.
Not only it sidesteps the serialization issue, it looks more aligned with
the semantics of transient. Of course, there is no guarantees
about how different JVM implementations will deal with it - sometimes
they will be serialized, sometimes they will not.

So, together with the patch I've made some manual tests between some versions. The script used is attached for further inspection.

Running on a 1.6 REPL has shown on the original described issue:

serialize-test=> *clojure-version*
{:major 1, :minor 6, :incremental 0, :qualifier nil}
serialize-test=> (def f "seq-1-6.dump")
#'serialize-test/f
serialize-test=> (def x {'(1) 1})
#'serialize-test/x
serialize-test=> (dump-bytes f (serialize x))
nil
serialize-test=> (deserialize (slurp-bytes f))
{(1) 1}
serialize-test=> (hash x)
202919476
serialize-test=> (hash (deserialize (serialize x)))
202919476
serialize-test=> (hash (deserialize (slurp-bytes f)))
-1619826937
serialize-test=> (= x (deserialize (slurp-bytes f)))
true
serialize-test=> (= (deserialize (slurp-bytes f)) x)
false

Running on 1.8 master. Despite of the = behavior looking ok, the
hashes are different between roundtrips, like we saw on 1.6:

serialize-test=> *clojure-version*
{:major 1, :minor 8, :incremental 0, :qualifier "master", :interim true}
serialize-test=> (def f "seq-1-8-master.dump")
#'serialize-test/f
serialize-test=> (def x {'(1) 1})
#'serialize-test/x
serialize-test=> (dump-bytes f (serialize x))
nil
serialize-test=> (deserialize (slurp-bytes f))
{(1) 1}
serialize-test=> (hash x)
202919476
serialize-test=> (hash (deserialize (serialize x)))
202919476
serialize-test=> (hash (deserialize (slurp-bytes f)))
-1619826937
serialize-test=> (= x (deserialize (slurp-bytes f)))
true
serialize-test=> (= (deserialize (slurp-bytes f)) x)
true

Running on 1.8 after patch the hashes are correctly computed on both cases:

serialize-test=> *clojure-version*
{:major 1, :minor 8, :incremental 0, :qualifier "master", :interim true}
serialize-test=> (def f "seq-1-8-patch.dump")
#'serialize-test/f
serialize-test=> (def x {'(1) 1})
#'serialize-test/x
serialize-test=> (dump-bytes f (serialize x))
nil
serialize-test=> (deserialize (slurp-bytes f))
{(1) 1}
serialize-test=> (hash x)
202919476
serialize-test=> (hash (deserialize (serialize x)))
202919476
serialize-test=> (hash (deserialize (slurp-bytes f)))
202919476
serialize-test=> (= x (deserialize (slurp-bytes f)))
true
serialize-test=> (= (deserialize (slurp-bytes f)) x)
true

Please note I've dumped each serialized data into different files, so I could test the interoperability between those versions. What I found:

  • 1.6 serialization is already incompatible with 1.8, on both directions;
  • 1.8 data could be exchange between master and patched versions, not affecting version behavior (hashes are computed only on patched REPL).

Did I miss something or everything looks correct for you too? I'm open to do some more exhaustive testing if anyone could give me some directions about where to explore.





Generated at Tue Aug 04 14:58:50 CDT 2015 using JIRA 4.4#649-r158309.