My favorites | Sign in
Project Logo
                
New issue | Search
for
| Advanced search | Search tips
Issue 48: dumps with ensure_ascii=False fails with mix of unicode and non-unicode
3 people starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  ----
Closed:  Apr 2009
Type-Defect
Priority-Medium


Sign in to add a comment
 
Reported by jabronson, Apr 14, 2009
What steps will reproduce the problem?

>>> s = {'foo': u'bar', 'quux': 'Arr\xc3\xaat sur images'}
>>> simplejson.dumps(s)
'{"quux": "Arr\\u00eat sur images", "foo": "bar"}'
>>> simplejson.dumps(s, ensure_ascii=False)
Traceback (most recent call last):
  ...
  File ".../lib/python2.6/json/encoder.py", line 368, in encode
    return ''.join(chunks)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)



What is the expected output?
u'{"quux": "Arr\xc3\xaat sur images", "foo": "bar"}'



What version of the product are you using?
2.0.9

On what operating system?
Mac OS X

Comment 1 by jabronson, Apr 14, 2009
sorry, the expected out should have been:
u'{"quux": "Arr\u00eaat sur images", "foo": "bar"}'

Comment 2 by jabronson, Apr 14, 2009
or equivalently, u'{"quux": "Arr\xeaat sur images", "foo": "bar"}'

Comment 3 by jabronson, Apr 14, 2009
not sure about the choice of name for the ensure_ascii parameter as json is always encoded in utf-something according to http://tools.ietf.org/html/rfc4627. assuming the 
parameter is meant to ensure we get back out a unicode value, JSONEncoder.encode should be checking for it in the non-basestring case (see http://code.google.com/p/simplejson/source/browse/trunk/simplejson/encoder.py?r=174#181).

Here's a failing test demonstrating one facet of the problem:

Index: tests/test_unicode.py
===================================================================
--- tests/test_unicode.py	(revision 183)
+++ tests/test_unicode.py	(working copy)
@@ -78,4 +78,7 @@
     def test_unicode_preservation(self):
         self.assertEquals(type(json.loads(u'""')), unicode)
         self.assertEquals(type(json.loads(u'"a"')), unicode)
+
+    def test_empty_list(self):
+        self.assertEquals(type(json.dumps([], ensure_ascii=False)), unicode)

======================================================================
FAIL: test_empty_list (simplejson.tests.test_unicode.TestUnicode)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/tmp/json/src/simplejson/simplejson/tests/test_unicode.py", line 84, in test_empty_list
    self.assertEquals(type(json.dumps([], ensure_ascii=False)), unicode)
AssertionError: <type 'str'> != <type 'unicode'>

----------------------------------------------------------------------


Comment 4 by david.novalis.turner, Apr 14, 2009
ensure_ascii is simply an incoherent parameter for a json encoder.  The JSON RFC
specifies that JSON is always encoded in some form of Unicode.  What you might want
is a parameter that chooses between (1) returning a Python unicode object, which
clients can later encode, or (2) returning a UTF-8 encoded Python str, or (3)
returning a Python str in some other Unicode encoding. I would suggest keeping the
current encoding parameter to choose among #2 and #3, deleting the ensure_ascii
parameter and raising an error message containing the text of this comment (or some
other explanation of why it was removed), and adding a new return_unicode parameter,
defaulting to the opposite of whatever ensure_ascii defaults to.  If return_unicode
is True, and there is an encoding parameter, then raise an exception.  All of the
intermediate work of encoding should probably be done in unicode -- that is, when a
Python str is encoded, it should be (first) decoded from ascii (that is, call
unicode(x)).  Also, make sure that all string literals are u'string literals' (see
Josh Bronson's comment, above).  
Comment 5 by bob.ippolito, Apr 14, 2009
re commend #4 *READ THE DOCS*  --- you are thoroughly confused as to what these parameters do.
Comment 6 by david.novalis.turner, Apr 14, 2009
The docstring says:
    If ``ensure_ascii`` is false, then the return value will be a
    ``unicode`` instance subject to normal Python ``str`` to ``unicode``
    coercion rules instead of being escaped to an ASCII ``str``.

This does not describe what the code actually does -- ensure_ascii=False doesn't
necessarily return unicode.  It does appear on reflection that ensure_ascii=True does
something permissible, but it is somewhat bizarre.

Oh, I was wrong about what encoding does.  Sorry about that.
Comment 7 by bob.ippolito, Apr 14, 2009
r184
Status: Fixed
Sign in to add a comment

Hosted by Google Code