New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strings should be unicode by default, if bytes is required make it explicit #60
Comments
Strings returned from javascript and strings passed to javascript should probably be always utf-8 and for these conversions the ApplicationSettings.string_encoding should not be used (not 100% sure about that, need to test it). Original comment by |
Starting with next commit all the encoding/decoding of strings will be kept in one file string_utils.pyx, this will make it easier to make the unicode the default strings, but still you have to check all the calls to CefToPyString(), CharToPyString(), VoidPtrToStr() to check to what type the string is assigned to, The documentation on the wiki pages needs to be updated, "str" types need to be replaced with "unicode". That's not all, there are other fixed strings in the code, if we decide to use unicode then we must stick to it and all the strings passed to python should be unicode, this is going to be a bit of a nightmare, we would be forced to use u"" syntax (in Py3 such syntax is disallowed, but in Cython it is allowed, so it is a bit easier to write portable code for both Py2/Py3), but what if we pass normal byte string instead in Py2? Then this is going to be a hell in user code, as concatenating bytes string with unicode string will throw a TypeError "can't concat bytes to str". Original comment by |
More discussion on making unicode default in Py27 here: Original comment by |
Marking as Won't Fix. Use Python 3 if you need unified unicode strings. Fixing this would break backwards compatibility. Original comment by
|
Right now in Py2 byte strings are passed to python, but in Py3 unicode strings are passed. This will cause troubles in the future when upgrading existing python code from Py2 to Py3.
If you write code in Py2 that decodes the byte strings then this will break in Py3 with decode errors, because when calling decode on a unicode string doesn't make sense and what python does is it encodes it to byte string using the 'ascii' codec and then decodes it back using the 'utf-8' codec and this will cause errors.
Strings that are paths to files should be bytes strings in Py2, otherwise you might get into trouble see this post: https://groups.google.com/d/msg/cython-users/Q1_jyOX4tVM/f8vsYDuUWL0J
In Cython code do an explicit conversion to 'bytes' when you do not want unicode string.
See the ApplicationSettings.unicode_to_bytes_encoding option. This option is also used when converting bytes to unicode, so it's name is confusing, it should be renamed.
How do we know what kind of encoding should be used when decoding the javascript strings? Web pages might have encodings other than utf-8. Is there an API in CEF to get the encoding of a current Frame? Making a fixed encoding through application settings doesn't make too much sense, as different websites might use different encodings. Still, the encoding of the strings that users pass to cefpython should be configurable through some option, as it might be different than the encoding that website in current context uses.
Take a look at this video explaining of why "Unicode is poison to python performance":
http://www.youtube.com/watch?v=oK3EQH5Wdqo&feature=youtu.be&t=24m26s
Don't stick unicode everywhere.
Original issue reported on code.google.com by
czarek.t...@gmail.com
on 4 Jun 2013 at 2:10The text was updated successfully, but these errors were encountered: