Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ruby: Encode decode cleanup and behavior normalization #338

Merged
merged 1 commit into from May 15, 2015

Conversation

skippy
Copy link
Contributor

@skippy skippy commented May 3, 2015

Hey @cfallin

This PR may be a bridge too far. I am sending this PR as DO NOT MERGE; if you agree with the direction of the changs, I'll add some tests (in the old-style format) and documentation.

This pr started as just wanting to add common ruby syntactic sugar, specifically #to_json and #to_proto. Then I realized that the jruby .encode_json was broken (gist). But it also had the very nice and useful #to_h which the c-ext did not.

I tried consolidating the json encoding/decoding into pure ruby, using the stdlib JSON module, the JSON gem (using the c-extension), and Yajl. This ripped out a lot of c code, and a bit of java code, and I liked the simplification. Until I ran benchmarks. tl;dr: the native json library (haberman/upb) in this lib was consistantly 10x faster (gist). Yowzers....

I don't want to reopen the conversation in #121, and definitely not the tenor. However, I agree with the concerns that having two fully separate implementations is going to be hard to maintain. I'm finding more inconsitancies than I expected, and I'm just starting to scratch the surface. (e.g., another is that the jruby extension doesn't check STRING encoding during init. like the c-ext does). BUT it is alpha :)

If the JSON performance was closer, I would have submitted the PR with consolidated logic and have tried to make the case that moving some code into ruby is a good thing from a consistancy and maintenance perspective. However, the performance difference is so great I don't feel I should make that argument.

So here is a less-aggressive PR with fewer changes. Again, if this is moving in an acceptable direction, I'll finish it up.

TypedData_Get_Struct(_self, MessageHeader, &Message_type, self);

VALUE hash = rb_hash_new();
rb_gc_register_address(&hash);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed if I use rb_hash_new()? I sort of suspect not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, definitely not! rb_gc_register_address registers a static root with the garbage collector -- e.g., for a global variable. You're telling the GC that some random address on this thread's stack is a root to be kept alive. Probably not what you want :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! yes; sorry, I should have dived into the docs a bit more on that one. thanks for the explanation.

@skippy skippy changed the title Encode decode cleanup and behavior normalization ruby: Encode decode cleanup and behavior normalization May 3, 2015
@skippy
Copy link
Contributor Author

skippy commented May 4, 2015

your call, but:

  • having a pure ruby method of #to_h that is used by both implementations is 'better' (hard time defining that... more maintainable? I'm not so sure... clearer? yes. dry-er? yes...but does that really matter?
  • but raw-c is 9x faster (gist)

but damn, if the java and jruby versions were a bit closer in performance, then the argument for moving most of the logic to ruby with a little java in the performant spots, and then a bit more of the C code could be made.

@haberman
Copy link
Member

haberman commented May 4, 2015

Until I ran benchmarks. tl;dr: the native json library (haberman/upb) in this lib was consistantly 10x faster (gist). Yowzers....

I would also mention that the upb JSON encoder is not very optimized (it uses snprintf() to format numbers, for example). Once this is optimized, the performance difference will probably be even greater than 10x.

By the way, if you are discovering inconsistencies between implementations, please add what you discovered to the unit tests. Not only does this make everything more robust, it makes refactorings like this much more tractable and lower-risk.

upb_msg_field_next(&it)) {
const upb_fielddef* field = upb_msg_iter_field(&it);
VALUE msg_value = layout_get(self->descriptor->layout, Message_data(self), field);
VALUE msg_key = ID2SYM(upb_fielddef_name(field));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE this makes the hash return symbols for keys. I think this is the right call for 2 reasons:

  • hell of a lot faster
  • primary reason: the message initializer only accepts symbolized keys

@skippy
Copy link
Contributor Author

skippy commented May 4, 2015

@haberman I am adding tests; on every one that I find and fix. Yet I'm eager to dive into tests a bit more as there are a bunch of edge cases that aren't being tested. I hope to start get to that later this week.

nice work on haberman/upb btw.

@cfallin
Copy link
Contributor

cfallin commented May 4, 2015

Hi @skippy,

Thanks for all the work here! I'll give a closer look tomorrow. But one high-level request, especially since the flurry of pull-requests and comments is becoming somewhat hectic: could you put together a design/summary document of sorts listing (i) the discrepancies you're finding between the MRI Ruby extension and JRuby extension, and (ii) the specific semantic and API changes you're proposing? It would be easier to read that first and then review the code as appropriate. A shared Google doc is fine, or a gist in Markdown, or whatnot.

It's been a while since I reviewed the JRuby extension and wrote the MRI one so the summary is useful for me personally, at least :-)

Thanks!

@haberman
Copy link
Member

haberman commented May 4, 2015

@skippy Thanks for adding the tests and thanks for your nice words about upb!

@skippy
Copy link
Contributor Author

skippy commented May 4, 2015

@cfallin of course; thank you for asking. An initial draft is here: https://gist.github.com/skippy/00c23e4e7bdac5124fe4.

No rush on my end!

@cfallin
Copy link
Contributor

cfallin commented May 5, 2015

Thanks for that -- I'll try to take a look in the next few days!

* make consistent between mri and jruby
* create a #to_h and have it use symbols for keys
* add #to_json and #to_proto helpers on the Google::Protobuf message classes
end

end
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods existed in both jruby and mri, but I think it is cleaner to just define them once

@skippy
Copy link
Contributor Author

skippy commented May 13, 2015

@cfallin what ya think?

@cfallin
Copy link
Contributor

cfallin commented May 14, 2015

The unification effort in general is really helpful here, thanks!

The only potential concern I have is in adding #to_proto and #to_json methods to message objects. It's definitely more convenient than having to invoke MyClass.encode(msg), but I recall a fairly spirited discussion around this when we designed the API and there were arguments to keep the message objects' method list "clean", i.e., only have overrides to Object methods. I realize the cat is partially out of the bag already with #to_h, and from my perspective today at least, the convenience methods do seem nice (no need to write out the message class's name explicitly; also allows for generic methods that take a message of any type).

@haberman, do you have any opinion here?

(To be clear, this is my only real potential concern, and I'm happy to take the rest of the changes!)

@skippy
Copy link
Contributor Author

skippy commented May 14, 2015

@cfallin very fair. #to_h is a common extension and one worth keeping, esp. after Rails, which I think popularized that helper (though I'm not totally clear on the etymology). #to_json has become more popular as a helper, usually used in the context of creating a data representation (such as a hash) and then, at the last minute, serializing it. It is common pattern in the ruby on rails and api realms. #to_proto is pure syntax sugar, and one that I liked from a few other ruby proto libraries.

As an FYI, and why this pattern has emerged, is that frameworks like Rails, and other ruby gems, have a common short-hand where you can use syntax such as render json: msg, and it will call msg.to_json. And that pattern holds for other helpers such as render proto: msg => msg.to_proto.

The ruby community loves its syntax sugar, that is for sure.

Except for #to_h, which makes retrieving keys and values from the object very convenient, I won't push hard (or perhaps I should say, push my luck) for the other two. :) Let me know what you all decide, and I'll go in and make the changes.

thanks!

@haberman
Copy link
Member

The unfortunate thing about adding methods on the message objects is that is mixes the namespace of field names (from the .proto file) with the namespace of helper/convenience methods. msg.to_json is a helper method but msg.to_jsonx is a message field. This is really unclear to a reader. If a message field is named to_json, the user will need to use msg["to_json"] instead. If we add any convenience methods in the future, we could break existing code for any users who used that method name as a protobuf field name.

All that said, it's becoming apparent that the convenience and convention of helper methods has such strong precedent in Ruby culture that we'd be remiss not to support it. It's clear that we should support #to_proto and #to_json. Others we should evaluate on a case-by-case basis.

@cfallin
Copy link
Contributor

cfallin commented May 15, 2015

OK, I'm in agreement too. Since we're still in alpha, I don't feel too bad about encroaching-by-eminent-domain into field namespace and grabbing a few method names. I agree that we should be careful about this in the future so that we don't break users (especially if we ever want to use a short/common/ambiguous name as a method).

The change itself looks good, so LGTM, and I'll merge!

cfallin added a commit that referenced this pull request May 15, 2015
ruby: Encode decode cleanup and behavior normalization
@cfallin cfallin merged commit a526605 into protocolbuffers:master May 15, 2015
@skippy
Copy link
Contributor Author

skippy commented May 15, 2015

@cfallin I found a bug mapping #to_h through the Map type. I'm tracking down right now. I'll beef up the test cases in that area as well. It will take me, possibly, EOW to get that patch in.

but shit. sorry about that.

@cfallin
Copy link
Contributor

cfallin commented May 15, 2015

@skippy OK, no worries, thanks for letting us know! If you'd like or if you need to pass it off, I'm happy to track it down given a repro, otherwise we're happy to take your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants