Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings should be unicode by default, if bytes is required make it explicit #60

Closed
GoogleCodeExporter opened this issue Aug 28, 2015 · 4 comments

Comments

@GoogleCodeExporter
Copy link

Right now in Py2 byte strings are passed to python, but in Py3 unicode strings are passed. This will cause troubles in the future when upgrading existing python code from Py2 to Py3.

If you write code in Py2 that decodes the byte strings then this will break in Py3 with decode errors, because when calling decode on a unicode string doesn't make sense and what python does is it encodes it to byte string using the 'ascii' codec and then decodes it back using the 'utf-8' codec and this will cause errors.

Strings that are paths to files should be bytes strings in Py2, otherwise you might get into trouble see this post: https://groups.google.com/d/msg/cython-users/Q1_jyOX4tVM/f8vsYDuUWL0J

In Cython code do an explicit conversion to 'bytes' when you do not want unicode string.

See the ApplicationSettings.unicode_to_bytes_encoding option. This option is also used when converting bytes to unicode, so it's name is confusing, it should be renamed.

How do we know what kind of encoding should be used when decoding the javascript strings? Web pages might have encodings other than utf-8. Is there an API in CEF to get the encoding of a current Frame? Making a fixed encoding through application settings doesn't make too much sense, as different websites might use different encodings. Still, the encoding of the strings that users pass to cefpython should be configurable through some option, as it might be different than the encoding that website in current context uses.

Take a look at this video explaining of why "Unicode is poison to python performance":
http://www.youtube.com/watch?v=oK3EQH5Wdqo&feature=youtu.be&t=24m26s

Don't stick unicode everywhere.

Original issue reported on code.google.com by czarek.t...@gmail.com on 4 Jun 2013 at 2:10

@GoogleCodeExporter
Copy link
Author

Strings returned from javascript and strings passed to javascript should probably be always utf-8 and for these conversions the ApplicationSettings.string_encoding should not be used (not 100% sure about that, need to test it).

Original comment by czarek.t...@gmail.com on 5 Jun 2013 at 6:52

@GoogleCodeExporter
Copy link
Author

Starting with next commit all the encoding/decoding of strings will be kept in one file string_utils.pyx, this will make it easier to make the unicode the default strings, but still you have to check all the calls to CefToPyString(), CharToPyString(), VoidPtrToStr() to check to what type the string is assigned to,
what is the context of this operation, whether unicode won't break anything.

The documentation on the wiki pages needs to be updated, "str" types need to be replaced with "unicode".

That's not all, there are other fixed strings in the code, if we decide to use unicode then we must stick to it and all the strings passed to python should be unicode, this is going to be a bit of a nightmare, we would be forced to use u"" syntax (in Py3 such syntax is disallowed, but in Cython it is allowed, so it is a bit easier to write portable code for both Py2/Py3), but what if we pass normal byte string instead in Py2? Then this is going to be a hell in user code, as concatenating bytes string with unicode string will throw a TypeError "can't concat bytes to str".

Original comment by czarek.t...@gmail.com on 5 Jun 2013 at 7:12

@GoogleCodeExporter
Copy link
Author

More discussion on making unicode default in Py27 here:
https://groups.google.com/d/msg/cython-users/VICzhVn-zPw/B0U4_AK36UkJ

Original comment by czarek.t...@gmail.com on 9 Jan 2014 at 7:36

@GoogleCodeExporter
Copy link
Author

Marking as Won't Fix. Use Python 3 if you need unified unicode strings. Fixing this would break backwards compatibility.

Original comment by czarek.t...@gmail.com on 10 Aug 2014 at 6:03

  • Changed state: WontFix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant