Clojure

Make more datastructures serializable

Details

  • Type: Enhancement Enhancement
  • Status: Closed Closed
  • Resolution: Completed
  • Affects Version/s: None
  • Fix Version/s: Release 1.2
  • Component/s: None
  • Labels:
    None
  • Approval:
    Ok

Description

Make several more core data structures serializable via java-serialization.

The attached patch is a minimal patch- serialVersionUIDs are not created, nor are redundant "implements Serializable" statements added to classes who already implement Serializable from some other interface. This patch also does not make functions, lazy lists, atoms, references, etc. serializable, nor those data structures that depend upon any of the above (so tree maps and tree sets, which depend upon an ordering function, are not serializable, for example). Only those data structures which can be reasonably expected to be serialized by one program and deserialized by another are made serializable. Whether any of these extra functions should be made serializable is a different debate.

The classes made serializable are:
PersistentVector$Node (PersistentVector is already serializable)
PersistentQueue
PersistentList
PersistentList$EmptyList

Activity

Hide
Assembla Importer added a comment -

bhurt said: [file:dwWwVQmG0r37KxeJe5aVNr]: New diff

Show
Assembla Importer added a comment - bhurt said: [file:dwWwVQmG0r37KxeJe5aVNr]: New diff
Hide
Assembla Importer added a comment -

bhurt said: Just added new file serial2.diff, which is a new diff which adds a few more classes to what is serialized:
Cons
PersistentHashMap$ArrayNode
PersistentHashMap$BitmapIndexedNode
PersistentHashMap$HashCollisionNode

Note that serial2.diff implements all the changes serial.diff implements, plus these new classes. So it's a complete replacement.

Show
Assembla Importer added a comment - bhurt said: Just added new file serial2.diff, which is a new diff which adds a few more classes to what is serialized: Cons PersistentHashMap$ArrayNode PersistentHashMap$BitmapIndexedNode PersistentHashMap$HashCollisionNode Note that serial2.diff implements all the changes serial.diff implements, plus these new classes. So it's a complete replacement.
Hide
Assembla Importer added a comment -

dsg said: I am not sure that simply tagging Clojure's data structures as serializable is the best way to handle this issue. I have not taken too much time to think about this, but here are some of the issues I do not think have been addressed in this patch:

1. Given the shared structure of most of Clojure's data types, I think care should be taken to only serialize the logical snapshot of a structure, not the entire history.

2. I am not sure that these patches even work. For example, PersistentVector$Node is marked serializable, but part of its state is an AtomicReference<Thread>. It doesn't make any sense to serialize a reference to a thread. In fact, I don't think Thread is serializable at all. If the reference is set, this will throw a NotSerializableException, I believe.

In the end, I think the best (and perhaps only) way to handle serialization for Clojure's data structures is to use Externalizable.

Show
Assembla Importer added a comment - dsg said: I am not sure that simply tagging Clojure's data structures as serializable is the best way to handle this issue. I have not taken too much time to think about this, but here are some of the issues I do not think have been addressed in this patch: 1. Given the shared structure of most of Clojure's data types, I think care should be taken to only serialize the logical snapshot of a structure, not the entire history. 2. I am not sure that these patches even work. For example, PersistentVector$Node is marked serializable, but part of its state is an AtomicReference<Thread>. It doesn't make any sense to serialize a reference to a thread. In fact, I don't think Thread is serializable at all. If the reference is set, this will throw a NotSerializableException, I believe. In the end, I think the best (and perhaps only) way to handle serialization for Clojure's data structures is to use Externalizable.
Hide
Assembla Importer added a comment -

bhurt said: 1. I thought that the clojure data structures were just immutable- and that thus they didn't keep their history. If I'm wrong, this is a significant problem with clojure, as keeping the history of the data structure would interfere with garbage collecting.

2. The only times the edit reference is set to not null is when the vector is a transient vector- and I'm explicitly not allowing the serialization of transient vectors. But I think you could still hit it if some other thread has converted the vector to being a transient while the current thread was trying to serialize the non-transient version of the vector. Which at least explains why I haven't hit this bug yet. In any case, no serializing the edit field is probably a good idea in general.

I'll write a new patch to address these problems probably over the weekend.

Show
Assembla Importer added a comment - bhurt said: 1. I thought that the clojure data structures were just immutable- and that thus they didn't keep their history. If I'm wrong, this is a significant problem with clojure, as keeping the history of the data structure would interfere with garbage collecting. 2. The only times the edit reference is set to not null is when the vector is a transient vector- and I'm explicitly not allowing the serialization of transient vectors. But I think you could still hit it if some other thread has converted the vector to being a transient while the current thread was trying to serialize the non-transient version of the vector. Which at least explains why I haven't hit this bug yet. In any case, no serializing the edit field is probably a good idea in general. I'll write a new patch to address these problems probably over the weekend.
Hide
Assembla Importer added a comment -

cemerick said: Clojure data structures only retain the portions of their "predecessors" as required by the "current version" of the data structure. So, there should never be any dead nodes, etc. in the "entire history" of a data structure.

The AtomicReference<Thread> field is never set to null if I'm reading the code right – rather, it is set to an AtomicReference containing null (see the NOEDIT static field). I think the most straightforward way to make this serialization-friendly is to mark the edit field (all all similar fields in other data structures) transient, but default the field to the NOEDIT object so that deserialized instances have a sane value there.

Show
Assembla Importer added a comment - cemerick said: Clojure data structures only retain the portions of their "predecessors" as required by the "current version" of the data structure. So, there should never be any dead nodes, etc. in the "entire history" of a data structure. The AtomicReference<Thread> field is never set to null if I'm reading the code right – rather, it is set to an AtomicReference containing null (see the NOEDIT static field). I think the most straightforward way to make this serialization-friendly is to mark the edit field (all all similar fields in other data structures) transient, but default the field to the NOEDIT object so that deserialized instances have a sane value there.
Hide
Assembla Importer added a comment -

bhurt said: The edit member variable (which holds the atomic reference cell) needs to be set up correctly- if I'm reading the code correctly, it's shared among all copies of the data structure. So when we deserialize a persistent vector, we need to share the reference cell among all the nodes.

Actually, the issue of sharing is one I haven't thought about much. Consider the case where you serialize multiple versions of the same vector that share a lot of nodes. Do the deserialized vectors need to share the nodes as well?

Show
Assembla Importer added a comment - bhurt said: The edit member variable (which holds the atomic reference cell) needs to be set up correctly- if I'm reading the code correctly, it's shared among all copies of the data structure. So when we deserialize a persistent vector, we need to share the reference cell among all the nodes. Actually, the issue of sharing is one I haven't thought about much. Consider the case where you serialize multiple versions of the same vector that share a lot of nodes. Do the deserialized vectors need to share the nodes as well?
Hide
Assembla Importer added a comment -

cemerick said: ObjectOutputStream and ObjectInputStream are smart about using references to previously-written objects instead of duplicating serializations. So if you write three different large vectors that share a lot of internal structure to the same OOS, that shared structure will be written only once.

As for the edit field, I think you can just default it to NOEDIT. To be clear, that field should never be serialized to begin with, so whatever the default value is in the code will be the value used by deserialized instances.

Show
Assembla Importer added a comment - cemerick said: ObjectOutputStream and ObjectInputStream are smart about using references to previously-written objects instead of duplicating serializations. So if you write three different large vectors that share a lot of internal structure to the same OOS, that shared structure will be written only once. As for the edit field, I think you can just default it to NOEDIT. To be clear, that field should never be serialized to begin with, so whatever the default value is in the code will be the value used by deserialized instances.
Hide
Assembla Importer added a comment -

richhickey said: This patch does not apply. Please format your patch with git format-patch as described here:

http://clojure.org/patches

Thanks

Show
Assembla Importer added a comment - richhickey said: This patch does not apply. Please format your patch with git format-patch as described here: http://clojure.org/patches Thanks
Hide
Assembla Importer added a comment -

cemerick said: Taking this on, hoping to get it in under the wire for clojure 1.2 RC.

Show
Assembla Importer added a comment - cemerick said: Taking this on, hoping to get it in under the wire for clojure 1.2 RC.
Hide
Assembla Importer added a comment -

cemerick said: [file:bJZ074vWWr34fReJe5cbLA]

Show
Assembla Importer added a comment - cemerick said: [file:bJZ074vWWr34fReJe5cbLA]
Hide
Assembla Importer added a comment -

cemerick said: Patch attached. It further requires the fix in the patch attached to #331.

Show
Assembla Importer added a comment - cemerick said: Patch attached. It further requires the fix in the patch attached to #331.
Hide
Assembla Importer added a comment -
Show
Assembla Importer added a comment - stu said: [file:dZLUVGv2yr35hJeJe5cbCb]
Hide
Assembla Importer added a comment -

stu said: (My attachment simply updates Chas' patch to work with current master.)

Looks good, tests pass. One question: What is the use case for serializing a fn? Deserializing a fn will not work across VMs, as the fn internally refers to anonymous classes that no longer exist. Without knowing the use case I would be tempted to pull this out because of the confusion it could cause.

The only other minor interesting thing to note is that serializing a namespace just serializes the namespace, not its contents. This makes sense once you think about it, and is necessary for serializing structs and records, but might be worth a doc note.

Show
Assembla Importer added a comment - stu said: (My attachment simply updates Chas' patch to work with current master.) Looks good, tests pass. One question: What is the use case for serializing a fn? Deserializing a fn will not work across VMs, as the fn internally refers to anonymous classes that no longer exist. Without knowing the use case I would be tempted to pull this out because of the confusion it could cause. The only other minor interesting thing to note is that serializing a namespace just serializes the namespace, not its contents. This makes sense once you think about it, and is necessary for serializing structs and records, but might be worth a doc note.
Hide
Assembla Importer added a comment -

cemerick said: I've mentioned in #clojure a couple of times (and it's worth adding here "for the record") that all of the support for Java serialization should be and now is only present with the use case in mind of short term storage, not long term archiving – the class hierarchy implementing all of the standard data structures will change, and nothing short of a print-dup-esque mechanism (which could be layered on top of Serializable for interop purposes) will ensure longevity of serialized Clojure objects. More concretely, think sessions, RMI, etc.

Anyway: functions have been serializable for a while now, so I added tests to that end. As you said, the serializability of functions has a much narrower useful lifespan at the moment. I personally don't know of a use case for same-VM serialization of functions; I wonder if Rich does, or if that was just as far as he got in enabling deeper support for fn serialization. If the latter, then yes, we should probably pull it out; I can squash that change and the metadata tweak you added if you like.

I don't know that any documentation of how namespaces are serialized will be anything other than confusing – it's purely an implementation detail that essentially no one will (or should) ever care about. And, FWIW, the automatic interning of namespaces is essentially implied by the docs on create-ns anyway.

Show
Assembla Importer added a comment - cemerick said: I've mentioned in #clojure a couple of times (and it's worth adding here "for the record") that all of the support for Java serialization should be and now is only present with the use case in mind of short term storage, not long term archiving – the class hierarchy implementing all of the standard data structures will change, and nothing short of a print-dup-esque mechanism (which could be layered on top of Serializable for interop purposes) will ensure longevity of serialized Clojure objects. More concretely, think sessions, RMI, etc. Anyway: functions have been serializable for a while now, so I added tests to that end. As you said, the serializability of functions has a much narrower useful lifespan at the moment. I personally don't know of a use case for same-VM serialization of functions; I wonder if Rich does, or if that was just as far as he got in enabling deeper support for fn serialization. If the latter, then yes, we should probably pull it out; I can squash that change and the metadata tweak you added if you like. I don't know that any documentation of how namespaces are serialized will be anything other than confusing – it's purely an implementation detail that essentially no one will (or should) ever care about. And, FWIW, the automatic interning of namespaces is essentially implied by the docs on create-ns anyway.
Hide
Assembla Importer added a comment -

cemerick said: From an email exchange between myself, S. Halloway, and Rich addressing the serializability of functions:

Rich: If the code is AOT compiled and deployed to both VMs, they will have the same classnames for the fns.

S. Halloway: But serialization doesn't know this, and will happily serialize runtime-created fns. If the mangled names happen to exist in the new VM, you would end up calling different fns...

Rich: As you say, for consenting adults.

Chas: I forgot about the (pleasant) AOT caveat. Yessir, that's useful.

Is there any way to tell where a class originated from – a classfile vs. runtime codegen? If so, I could put the necessary check-and-throw in so that non-AOT'd functions would fail to serialize.

Rich: This really isn't different from the versioning problem inherent with the data structures as well, it's just that the unit of change is a compilation, not a release. You don't really protect anything with the AOT check since AOT'ed in two different compilations is potentially different, but might have same name.

I don't think we should spend any time trying to protect anyone on this. We can just document the behavior and limitations. Serializing closures is fancy stuff for experts in any case.

Show
Assembla Importer added a comment - cemerick said: From an email exchange between myself, S. Halloway, and Rich addressing the serializability of functions: Rich: If the code is AOT compiled and deployed to both VMs, they will have the same classnames for the fns. S. Halloway: But serialization doesn't know this, and will happily serialize runtime-created fns. If the mangled names happen to exist in the new VM, you would end up calling different fns... Rich: As you say, for consenting adults. Chas: I forgot about the (pleasant) AOT caveat. Yessir, that's useful. Is there any way to tell where a class originated from – a classfile vs. runtime codegen? If so, I could put the necessary check-and-throw in so that non-AOT'd functions would fail to serialize. Rich: This really isn't different from the versioning problem inherent with the data structures as well, it's just that the unit of change is a compilation, not a release. You don't really protect anything with the AOT check since AOT'ed in two different compilations is potentially different, but might have same name. I don't think we should spend any time trying to protect anyone on this. We can just document the behavior and limitations. Serializing closures is fancy stuff for experts in any case.
Hide
Assembla Importer added a comment -

stu said: Updating tickets (#281, #331)

Show
Assembla Importer added a comment - stu said: Updating tickets (#281, #331)
Hide
Assembla Importer added a comment -

bhurt said: I just wanted to say Thank You to Chas for picking this up.

Show
Assembla Importer added a comment - bhurt said: I just wanted to say Thank You to Chas for picking this up.
Hide
Assembla Importer added a comment -

cemerick said: @Brian: No problem, thanks for getting it started.

Show
Assembla Importer added a comment - cemerick said: @Brian: No problem, thanks for getting it started.

People

Vote (0)
Watch (0)

Dates

  • Created:
    Updated:
    Resolved: