Strings should be unicode by default, if bytes is required make it explicit #60

GoogleCodeExporter · 2015-08-28T10:53:52Z

Right now in Py2 byte strings are passed to python, but in Py3 unicode strings are passed. This will cause troubles in the future when upgrading existing python code from Py2 to Py3.

If you write code in Py2 that decodes the byte strings then this will break in Py3 with decode errors, because when calling decode on a unicode string doesn't make sense and what python does is it encodes it to byte string using the 'ascii' codec and then decodes it back using the 'utf-8' codec and this will cause errors.

Strings that are paths to files should be bytes strings in Py2, otherwise you might get into trouble see this post: https://groups.google.com/d/msg/cython-users/Q1_jyOX4tVM/f8vsYDuUWL0J

In Cython code do an explicit conversion to 'bytes' when you do not want unicode string.

See the ApplicationSettings.unicode_to_bytes_encoding option. This option is also used when converting bytes to unicode, so it's name is confusing, it should be renamed.

How do we know what kind of encoding should be used when decoding the javascript strings? Web pages might have encodings other than utf-8. Is there an API in CEF to get the encoding of a current Frame? Making a fixed encoding through application settings doesn't make too much sense, as different websites might use different encodings. Still, the encoding of the strings that users pass to cefpython should be configurable through some option, as it might be different than the encoding that website in current context uses.

Take a look at this video explaining of why "Unicode is poison to python performance":
http://www.youtube.com/watch?v=oK3EQH5Wdqo&feature=youtu.be&t=24m26s

Don't stick unicode everywhere.

Original issue reported on code.google.com by czarek.t...@gmail.com on 4 Jun 2013 at 2:10

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-08-28T10:53:52Z

Strings returned from javascript and strings passed to javascript should probably be always utf-8 and for these conversions the ApplicationSettings.string_encoding should not be used (not 100% sure about that, need to test it).

Original comment by czarek.t...@gmail.com on 5 Jun 2013 at 6:52

GoogleCodeExporter · 2015-08-28T10:53:52Z

Starting with next commit all the encoding/decoding of strings will be kept in one file string_utils.pyx, this will make it easier to make the unicode the default strings, but still you have to check all the calls to CefToPyString(), CharToPyString(), VoidPtrToStr() to check to what type the string is assigned to,
what is the context of this operation, whether unicode won't break anything.

The documentation on the wiki pages needs to be updated, "str" types need to be replaced with "unicode".

That's not all, there are other fixed strings in the code, if we decide to use unicode then we must stick to it and all the strings passed to python should be unicode, this is going to be a bit of a nightmare, we would be forced to use u"" syntax (in Py3 such syntax is disallowed, but in Cython it is allowed, so it is a bit easier to write portable code for both Py2/Py3), but what if we pass normal byte string instead in Py2? Then this is going to be a hell in user code, as concatenating bytes string with unicode string will throw a TypeError "can't concat bytes to str".

Original comment by czarek.t...@gmail.com on 5 Jun 2013 at 7:12

GoogleCodeExporter · 2015-08-28T10:53:52Z

More discussion on making unicode default in Py27 here:
https://groups.google.com/d/msg/cython-users/VICzhVn-zPw/B0U4_AK36UkJ

Original comment by czarek.t...@gmail.com on 9 Jan 2014 at 7:36

GoogleCodeExporter · 2015-08-28T10:53:52Z

Marking as Won't Fix. Use Python 3 if you need unified unicode strings. Fixing this would break backwards compatibility.

Original comment by czarek.t...@gmail.com on 10 Aug 2014 at 6:03

Changed state: WontFix

GoogleCodeExporter added Priority-Medium labels Aug 28, 2015

GoogleCodeExporter closed this as completed Aug 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strings should be unicode by default, if bytes is required make it explicit #60

Strings should be unicode by default, if bytes is required make it explicit #60

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

Strings should be unicode by default, if bytes is required make it explicit #60

Strings should be unicode by default, if bytes is required make it explicit #60

Comments

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015

GoogleCodeExporter commented Aug 28, 2015