The wiki doesn't mention calling reference(). It is completely out of date as it mentions Serializer having a create method!
Comment #1
Posted on Nov 30, 2012 by Swift HorseHi Nate,
While I totally agree that current semantics of references should be documented, I think something could be done in the core Kryo logic to improve the situation and make using them less error-prone for users and serializer writers. I haven't looked too deep into it yet, but my impression was that Kryo registers references too late (after object was completely read). This leads to problems in situations, where objects have sub-objects, etc.
I think something could be done, if Kryo would "pre-register" objects very early (i.e. almost after reference was read from a stream) and later patch them with proper objects. Or something along these lines. What do you think?
-Leo
Comment #2
Posted on Nov 30, 2012 by Grumpy CamelI'm all for improving it if we can come up with something that works better. :)
The complexity is mostly during deserialization because Serializer#read() both creates the new object and may use Kryo to deserialize other objects that may reference the new object.
I'll list some use cases. I hope I get the current behavior correct!
1) Serializer#read() creates a new object and returns it without deserializing other objects via Kryo. Current behavior: Nothing special needs to be done, Kryo detects reference() was not called and stores a reference to the object.
2) read() creates a new object then deserializes other objects via Kryo. The other objects are sure to never reference the new object. Current behavior: Nothing special needs to be done, Kryo detects reference() was not called and stores a reference to the object.
3) read() creates a new object then deserializes other objects via Kryo. The other objects may reference the new object. Current behavior: reference() has to be called before deserialization of the other objects.
4) read() deserializes other objects via Kryo then creates a new object (eg DefaultSerializers.TreeMapSerializer, but update first). The objects deserialized first cannot reference the new object. Current behavior: other objects are deserialized via Kryo, then the new object created, then see use cases 1, 2, or 3.
Are there other use cases? Let's list them all so we can see where the current behavior is less than ideal.
These seemed to be solved pretty well currently, though there is the downside that if you fail to call reference() for an object and then try to serialize objects that reference the object it will fail. At one point calling reference() before using Kryo to deserialize other objects was enforced with an exception, but this ended up getting removed. I don't remember the reason why, but I think it was too clunky for use case #4, which required references to be disabled temporarily.
I didn't understand what you meant by "pref-register objects very early and later patch them"?
Comment #3
Posted on Nov 30, 2012 by Grumpy Camel"then try to serialize objects" in second to last paragraph should be "then try to deserialize objects".
Comment #4
Posted on Nov 30, 2012 by Swift HorseI didn't understand what you meant by "pre-register objects very early and later patch them"?
IIRC, most of the problems users recently reported regarding references were due to the fact that numbering of references (i.e. assignment of ids to references) when de-serializing objects went wrong and all subsequent reference ids got screwed.
According to my analysis, for this to happen it is not always necessary that sub-objects reference the parent object. It is enough that this parent object does not call reference before reading sub-objects.
As we know, it is too easy to forget calling reference() in a custom serializer at the right place :-(
More over, sometimes a parent object cannot be constructed before all sub-objects are read. In such cases, I used the trick where at the beginning of read() I first "pre-registered" a fake object instead of this parent object, which is not constructed yet. This pre-registration preserved a proper assignment of reference ids for sub-objects. Later, when I read all those sub-objects and constructed the real parent object, I changed/redefined the pre-registered mapping from the parent object reference id to a fake parent object to point to a real parent object. And this last step is what I called "patching" or more precisely back-patching. I use this terminology because it is similar to how you generate code for gotos that refer to forward label "L", whose address is not known at the time when you process a "goto L" statement.
I hope I managed to explain what I had in mind. With all that, my impression is that Kryo could do this pre-registration automatically and after object is read it could "patch" the mapping to point to a properly constructed object, if it has not happened yet.
Comment #5
Posted on Nov 30, 2012 by Swift HorseTo give you an example of what I mean: https://github.com/romix/scala-kryo-serialization/blob/master/src/main/scala/com/romix/scala/serialization/kryo/ScalaCollectionsSerializer.scala
Have a look at the ScalaCollectionSerializer#read method and tricks I use there to pre-register a fake object and later replace the mapping with a proper object.
Comment #6
Posted on Nov 30, 2012 by Swift HorseBTW, the reason for producing a wrong assignment of ids to references when reading is easy to explain: to be correct, this assignment has to happen in exactly the same way as we do reference ids assignments during serialization. And this means, assign an id as soon as you read a reference header from the stream, even before the whole object is read.
But this is not what we do now: Currently we first read the whole object and only then assign a reference id and register a mapping from this id to the object.
Comment #7
Posted on Nov 30, 2012 by Grumpy CamelAh, I see, your proposal is to hold the reference ID at the appropriate time, then later associate it with the correct object.
Let me take a run at describing the current system, as a refresher for myself if anything. When serializing, Kryo writes the reference ID just before calling Serializer#write(). When deserializing, Kryo reads the reference ID just before calling Serializer#read(). So far, so good, these will always match up.
Next, the reference ID that was read is stored in Kryo#readReferenceIds (a stack) because we don't yet have an object to associate with the ID. reference() associates an object with the top ID in the stack. If reference() is not called, Kryo calls it immediately after Serializer#read(). If reference() is not called and Kryo is used to deserialize child objects, the parent reference ID remains unassociated in the readReferenceIds stack, the child reference IDs go on the stack and are popped off, then the parent reference ID is popped off and associated with the right object.
I may be missing something, but there shouldn't be a way for the reference IDs to get off just because reference() is not called. The only ramification of not calling reference() when using Kryo to deserialize child objects should be that ReferenceResolver#getReadObject() will be called with an ID that it has not been added with addReadObject(). In fact, I believe in this case we can have getReadObject() throw an exception saying that an unknown reference ID has been encountered.
You are saying that for use case #2, reference() must be called even if the child objects don't reference the parent? Can you show this in a unit test? I did look at the Scala but it isn't clear as to why you are doing what you are doing. Also, Scala is rough on my poor eyes. ;)
Comment #8
Posted on Nov 30, 2012 by Swift HorseNate, have you followed this thread on the mailing list: https://groups.google.com/forum/?fromgroups=#!searchin/kryo-users/leo/kryo-users/Eu5V4bxCfws/YjHGa59x5QwJ
This is an example of what can happen and I gave a rather long explanation there. Please have a look at the original example and explanations. I think this example is rather reproducible - at least it was for me. And stepping it through under debugger makes some things rather clear.
Comment #9
Posted on Nov 30, 2012 by Swift HorseI did look at the Scala but it isn't clear as to why you are doing what you are doing.
The idea is that in Scala, different from Java, in some cases I cannot construct a collection object before I start reading its elements. I can construct it only after I've read all the elements. But I have to call reference() anyway before reading sub-elements to preserve proper assignments of reference ids. Since a real collection does not exist yet, I pass a fake object. Later I refine this mapping with a proper collection object.
Comment #10
Posted on Nov 30, 2012 by Grumpy CamelYes, I read all the messages on the list, though I don't always have time to respond.
From your explanation on that thread, "This nested call tries to read a reference to the parent list and assign a reference id to it." If deserialization of a child object can have a reference to the parent, then reference() must be called for the parent before deserialization of child objects.
I think this is complicated enough that I need some code that shows the problem. We can whip up some fake list classes that behave like Scala lists, where the items have to be deserialized first. I think this would make for the simplest and easiest to follow test. I'll do this today.
Comment #11
Posted on Nov 30, 2012 by Swift HorseThe code provided by the reporter of the issue on that thread is such an example. Just take his serializer and his code for reproducing the problem. You'll see that it results in exactly the same error that he described. Then try to look at the debug output and step through. You'll see pretty quickly what goes wrong.
Comment #12
Posted on Nov 30, 2012 by Swift HorseThe real problem is most likely the situation where we have nested calls of readObject, e.g. when a sub-object is read using readObject inside Serializer#read. In such cases, if reference() is not called before the nested call, reference ids will get screwed, because the reference for the nested object will get the id that should have been assigned to the outer object.
Comment #13
Posted on Dec 4, 2012 by Grumpy CamelI've added a test to ReferenceTest that shows the issue and committed a fix. By making ReferenceResolver#nextReadId reserve the ID instead of just returning the next one, everything works like it should. As described above, child objects can be read before or after reference() is called, with the only ramification being that child objects read before reference() won't be able to reference the parent object.
I'll leave this issue open as it is a reminder to clean up the wiki. We can still carry on discussion about the references stuff here if needed though. :)
Comment #14
Posted on Dec 4, 2012 by Swift HorseHi Nate,
Good that you looked into this issue and fixed at least the problem mentioned on the mailing list. I'll perform some further tests during this week using my private use-cases that had similar issues and let you know if they work properly now.
In the meantime, I have a question: We have now in Kryo#reference() this snippet:
int id = readReferenceIds.pop(); if (id != NO_REF) referenceResolver.setReadObject(id, object);
As far as I can see, it is not quite idempotent, i.e. if in I call it (may be unintentionally) multiple times from the same serializer for the same object, it may result in a strange behavior due to a call of pop(). IMHO, it would be safer, if multiple calls of reference() when processing the same object would lead to the same results. Basically, would should try to avoid by all means that readReferenceIds and referenceResolver.seenObjects get misaligned and inconsistent with each other. IIRC, in some of my test-cases this was a reason of some problems.
Another question: Do we still need to call reference() in the read method of a serializer for an enclosing object even if sub-objects do not refer to their enclosing object? If we don't call reference(), e.g. because it was forgotten, what happens? It looks like nextReadId is not called then, or? Or reference() is called automatically by e.g. Kryo#readClassAndObject. But this may happen only after the whole enclosing object was read, i.e. it may happen after sub-objects and their references were processed and reference() was called for them. I'm still wondering if it is always OK in such cases that reference() for the enclosing object is called after reference() for sub-objects. Doesn't it change the order as compared to the write order? IMHO, it would be safer if in the following code that we use everywhere, we would invoke reference(fakeObj) automatically right after readReferenceOrNull call and remember the id, because this would be exactly in the same order, as during writes. This would reserve the id at the right time. And later, we can change the call reference(object) in the last line to referenceResolver.setReadObject(id, object) thus creating a proper mapping to a fully build object.
if (references) { int stackSize = readReferenceOrNull(input, type, false); if (stackSize == REF) return readObject; object = registration.getSerializer().read(this, input, type); if (stackSize == readReferenceIds.size) reference(object);}
Of course, with this proposed change, we need also to take care of potential calls to reference() from Serializer#read(). Eventually, Serializer#read needs to know the id of the reference belonging to the object being read, so that it can set a mapping from this id to a proper object.
Overall, I think that in the ideal case, serializer should only call reference() explicitly if sub-objects may refer to the enclosing object. In all other cases Kryo should do everything automatically. This would be the safest approach.
Are my concerns unjustified and I overcomplicate things? I just seem to remember that when I looked into different failing tests I noticed that reference handling problems turned out to be more serious and deeper than they seemed at the first glance.
Comment #15
Posted on Dec 5, 2012 by Grumpy CamelIt is true that calling reference() multiple times will cause an extra pop(), which will crash. This will crash every time though, it will never fail silently. I agree it would be better if reference() could be called multiple times without a problem, but in this case I don't see any easy way to guard against it. Also, I think it would be quite abnormal to call reference() twice.
Do we still need to call reference() in the read method of a serializer for an enclosing object even if sub-objects do not refer to their enclosing object?
No, you should only have to call reference() if you are reading child objects that could reference the parent. If the child objects won't reference the parent, everything will still work, even if the child objects call reference(). It is definitely a tricky piece of code, but I think the implementation is sound. It works basically as you describe: the ID is read from the bytes, a null is stored for the ID in ReferenceResolver#nextReadId(), and later the null is replaced with the actual object for that ID.
You are right that the parent object reference at serialization time may be stored sooner than at read time. This is because for serialization we always have the object and can store it first. During read time we have to create the object, which may involve reading child objects first. However by definition the child objects can't reference the parent object if they are required to create it. If they did, they would appear in the serialized bytes before the parent object (ie, they wouldn't be child objects).
I don't think we ever need to let the serializer know about the reference IDs. Ideally we can keep it hidden and just have a simple rule: if a serializer needs to deserialize child objects that may reference the parent, call reference(). Otherwise, don't worry about it.
I look forward to hearing if your tests are solved by the latest fix. :)
Comment #16
Posted on Dec 6, 2012 by Grumpy CamelNote I've done a couple commits to simplify the stock ReferenceResolvers slightly. Hopefully this doesn't cause any bugs. :)
Comment #17
Posted on Dec 11, 2012 by Swift HorseHi Nate,
I finally found some time to test your fixes. It seems to work fine for my tests and also for my Scala tests in akka-kryo-serialization! Now I don't need those strange workarounds for Scala collections that I mentioned earlier in this thread.
Good job!
Comment #18
Posted on Dec 12, 2012 by Grumpy CamelFantastic news! The references code is some hairy stuff, but I think the current approach is sound and the API is kept nice -- just call reference() before deserializing children, otherwise don't worry about it. Hopefully no more issues come up because changes to this part of the code causes gray hairs. :D
Status: Accepted
Labels:
Type-Defect
Priority-Medium