non-ascii CSV data not handled by google.appengine.ext.bulkload (Unicode errors) [35874396]

Fixed

Bug

Status Update

No update yet.

Description

ku...@gmail.com

created issue #1

Apr 12, 2008 12:15AM

What steps will reproduce the problem?

1. follow the Bulk Upload article here :

http://code.google.com/appengine/articles/bulkload.html and make sure your
CSV has non-ascii values. You could try "Ivan Krsti\xc4\x87" which is
UTF-8 encoding of Ivan Krsti\u0107

2. Somewhere internally this creates Unicode strings so you get a
UnicodeDecodeError in google/appengine/ext/bulkload/__init__.py in Load()
on this line:

for columns in reader:
...

This is because the csv reader is not Unicode aware, see

http://docs.python.org/lib/module-csv.html

to fix it, you'll need a "wrapper" that temporarily encodes Unicode objects
to UTF-8 byte strings, passes a line to CSV, then decodes back into Unicode.

This could be cleaned up some but I got it to work with these two methods:

def utf_8_encoder(unicode_data):
"""yields utf-8 encoded str objects for each chunk in
iterable, unicode_data

each chunk in unicode_data may or may not be unicode
(this is handled seemlessly)

Code is from

http://docs.python.org/lib/csv-examples.html#csv-examples
"""
for line in unicode_data:
if isinstance(line, unicode):
line = line.encode('utf-8')
yield line

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
"""reads a csv file as unicode data

This is copied from

http://docs.python.org/lib/csv-examples.html#csv-examples

You use it just like the stdlib csv.reader
"""
# csv.py doesn't do Unicode; encode temporarily as UTF-8 str objects:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]

...then I changed this line in google/appengine/ext/bulkload/__init__.py:Load()

reader = csv.reader(buffer, skipinitialspace=True)

to

reader = unicode_csv_reader(buffer, skipinitialspace=True)

also see issue155

Comments

[Deleted User] <[Deleted User]> #2Apr 21, 2008 08:29PM

Note that the solution given here is equivalent to the attached patch, and was taken
directly out of

http://docs.python.org/lib/csv-examples.html

IMHO, there should be an argument to the bulkloader script to take a charset, which
can then be added to the Content-Type header on the request. This is handled
automatically by WebOb
(

http://pythonpaste.org/webob/reference.html#unicode-variables), and then the
"unicode_csv_reader" can just be a general encoding CSV reader.

Until then, the following patch works for UTF-8, so it will also work for ASCII.

issue157.patch

1.1 KB

View

Download

wa...@gmail.com <wa...@gmail.com> #3May 16, 2008 01:27PM

This works for the SDK, but doesn't work for the appengine itself. :(

vi...@gmail.com <vi...@gmail.com> #4May 20, 2008 06:48AM

Just for the record: a simple workaround is to patch __init__.py as suggested in the

issue 157

, and then save it
under bulkload.py name in your App Engine directory. After that, simply replace

from google.appengine.ext import bulkload

with

import bulkload

and Unicode import will work!

ma...@google.com <ma...@google.com> Aug 17, 2008 07:42PM

Assigned to ma...@google.com.

fr...@gmail.com <fr...@gmail.com> #5Sep 25, 2008 04:03PM

this is my patch on SDK 1.1.3

http://fred.fivery.com/weblog/entry/362

th...@gmail.com <th...@gmail.com> #6Oct 6, 2008 01:28PM

Currently after doing all the patches I can successfuly deal with bulk load only on
the local development server, but I can't upload my data to the application on appspot.

Is this bug scheduled for fixing in the near future ?

we...@gmail.com <we...@gmail.com> #7Oct 26, 2008 05:28PM

A resolução do problema é simples:

Conforme o post inicial copie o arquivo __init__.py do módulo google.appengine.ext.bulkload e crie um pasta no seu aplicativo qualquer, ex.:
bulkload. Efetue as alterações conforme recomendado no post inicial ( o mesmo já está
em anexo com as correções, para facilitar :) ) depois inclua ele no seu arquivo de
load do csv para importação : import bulkload e remova : from google.appengine.ext
import bulkload, pronto agora vai!!! Até que essa correção seja feita no appspot isso
irá funcionar perfeitamente.

__init__.py

16 KB

View

Download

sa...@gmail.com <sa...@gmail.com> #8Nov 13, 2008 10:46AM

I patched the __init__.py and changed the import bulkload

Stiil got the error message --

line337, in LoadEntities
new_entities = leader.CreateEntity(columns, key_name=key_name)
line233,, in CreateEntity
entity[name] = converter(val)

UnicodeEncodeError: 'ascii' codec can't encode characters u'\ufeff' in position 0:
ordinal not in range(128)

Any solution??

yo...@gmail.com <yo...@gmail.com> #9Nov 18, 2008 08:48AM

I hit the problem also. Seems that the proceeding '\ufeff' is not acceptable. I
truncate it then it works

jo...@gmail.com <jo...@gmail.com> #10Dec 29, 2008 10:26PM

I have the same problem:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 8: ordinal not in range(128)

What do you mean with "truncate it"?

hu...@gmail.com <hu...@gmail.com> #11Jan 2, 2009 07:32PM

any progress on this?? It is essential, yet easy to fix stuff. Solution is mentioned.
What are we waiting for?? It is already been 8 months since this issue been opened

sa...@gmail.com <sa...@gmail.com> #12Jan 7, 2009 01:04PM

I gave up the bulkloader. I use the flash to read the cvs file and send url request
to appengine to update the datastore. Everything solved. Just few lines to solve this
problem.

1. read the cvs in Flash-swf (in flash)
2. call LoadVariables (in flash)
e.g. (LoadVariables("

http://myaccount_123.appspot.com/cvs_loader/", this, "POST");
3. check success flag, go to next record, then loop back to Load Variable(in flash)
4. write a function in python
e.g.
def cvs_loader(request):
if request.method == 'POST':
form = MyDataForm(request.POST)
if form.is_valid():
newdata = MyData(name=form.cleaned_data['name'],
tel=form.cleaned_data['tel'],
description=form.cleaned_data['description'])
newdata.put()
return render_to_response('success.html')

ss...@gmail.com <ss...@gmail.com> #13Jan 21, 2009 06:44AM

@Websyther,

Tentei seu __init__.py mas ainda o error occure:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)
ERROR 2009-01-20 23:39:54,251 bulkload_client.py] Import failed

O que falta?

hu...@gmail.com <hu...@gmail.com> #14Feb 15, 2009 05:04PM

it says in sdk released notes for 1.1.8 that this issue has been fixed. But I am
trying to upload a unicode data with new bulkloader in 1.1.9 and it still fails to
upload data.

so unicode strings are still a problem.
COuld you please, check this. new bulkloader seems like a very useful tool, it would
be shame if it doesn't accept unicode.

Thanks

ne...@gmail.com <ne...@gmail.com> #15Feb 20, 2009 05:57AM

[Comment deleted]

ku...@gmail.com <ku...@gmail.com> #16Feb 20, 2009 04:47PM

neoedmund has described the problem more succinctly.

In SDK 1.1.9, around line 1123 of
google_appengine\google\appengine\api\datastore_types.py I think this code would be
more appropriate:

if not isinstance(value, unicode):
# make a unicode object with best-guess for encoding:
value = value.decode('utf-8')
pbvalue.set_stringvalue(value.encode('utf-8')) # make a byte string

I had reported this in

issue 155

but my report was poorly worded.

Note that the above bug cannot be patched on the appengine, since that code is
restricted for patching (as far as I can tell).

ke...@gmail.com <ke...@gmail.com> #17Apr 13, 2009 08:02PM

The new bulkloader can be used to load unicode data, but you need to set up your
Loader subclass properly. You can use something like "lambda x: unicode(x, 'utf-8')"
as your conversion function to make it work. e.g.

class MyModel(db.Model):
field1 = db.StringProperty()

class MyLoader(Loader):
def __init__(self):
Loader.__init__(self, 'MyModel', [('field1', lambda x: unicode(x, 'utf-8'))])

jo...@google.com <jo...@google.com> Apr 13, 2009 08:45PM

Marked as fixed.

fu...@gmail.com <fu...@gmail.com> #18Jul 17, 2009 01:40AM

I don't think it's been fixed.

Now its version is 1.2.3.

In the document `Types and Property Classes' says that for a StringProperty field
its value would be either `str' or `unicode'.

Here is a model Author;

class Author(db.Model):
name = db.StringProperty()

its loader class could be like;

class AuthorLoader(bulkloader.Loader):
def __init__(self):
bulkloader.Loader.__init__(self, 'Author', [('name', unicode)])

then when you upload a csv contains non-ascii chars you'll get the well known error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in
range(128)

For this issue you'll need a patch like this;

*** /local/src/google_appengine/google/appengine/tools/bulkloader.py~
--- /local/src/google_appengine/google/appengine/tools/bulkloader.py

on_class = db.class_for_kind(kind_or_class_key)
return implementation_class

***************
*** 3196,3202 ****
for (name, converter), val in zip(self.__properties, values):
if converter is bool and val.lower() in ('0', 'false', 'no'):
val = False
! properties[name] = converter(val)

entity = model_class(**properties)
entities = self.handle_entity(entity)
--- 3195,3204 ----
for (name, converter), val in zip(self.__properties, values):
if converter is bool and val.lower() in ('0', 'false', 'no'):
val = False
! if converter is unicode:
! properties[name] = converter(val, 'utf-8')
! else:
! properties[name] = converter(val)

entity = model_class(**properties)
entities = self.handle_entity(entity)

ke...@gmail.com <ke...@gmail.com> #19Jul 17, 2009 01:52AM

Your Loader.__init__ call is not correct. In the (name, converter) tuples, the
converter is not a type but a function from str to the appropriate type. You need to
use "lambda x: unicode(x, 'utf-8')" as your conversion function. This will correctly
turn the utf-8 encoded str into a unicode. By specifying just "unicode" as the
conversion function, python uses the default codec (ascii) to try to create a unicode
instance from the str, and fails in this case.

I hope this clarifies things.

fu...@gmail.com <fu...@gmail.com> #20Jul 17, 2009 02:49AM

In the above patch, you can see the line;

if converter is bool and val.lower() in ('0', 'false', 'no'):

So the code expect `bool' as a converter.
Why not use `unicode' as a converter?

I think it makes sense to users more than force them to use lambda.

mr...@gmail.com <mr...@gmail.com> #21Jul 23, 2009 05:15PM

What kevingdon, suggested works as a charm :) thanks

So people use the lambda x: unicode(x, 'utf-8')

pf...@gmail.com <pf...@gmail.com> #22Jan 27, 2010 03:27PM

I follow the instructions here, but don't solve the problem to me.

I'm using SDK 1.3.0 and I trying to export data to csv with this:

bulkloader.Exporter.__init__(self, 'MigrationResult',
[('companyName', lambda x: x.decode('utf-8'), None),
('failure', lambda x: x.decode('utf-8'), None),
('email', lambda x: x.decode('utf-8'), None),
])

and get this:

File "/home/getsense/appengine_pyton/google/appengine/tools/bulkloader.py", line
2784, in __EncodeEntity
writer.writerow(self.__ExtractProperties(entity))
File "/home/getsense/appengine_pyton/google/appengine/tools/bulkloader.py", line
2763, in __ExtractProperties
encoding.append(fn(entity[name]))
File "latamvalley/exporter2.py", line 13, in <lambda>
('failure', lambda x: x.decode('utf-8'), None),
AttributeError: 'NoneType' object has no attribute 'decode'

then I tried with this:

bulkloader.Exporter.__init__(self, 'MigrationResult',
[('companyName', lambda x: unicode(x, 'utf-8'),
None),
('failure', lambda x: unicode(x, 'utf-8'), None),
('email', lambda x: unicode(x, 'utf-8'), None),
])

And have this:

File "/home/getsense/appengine_pyton/google/appengine/tools/bulkloader.py", line
2763, in __ExtractProperties
encoding.append(fn(entity[name]))
File "latamvalley/exporter.py", line 12, in <lambda>
[('companyName', lambda x: unicode(x, 'utf-8'), None),
TypeError: decoding Unicode is not supported

Can you help me please ?

ku...@gmail.com <ku...@gmail.com> #23Jan 27, 2010 09:15PM

double check the values you are passing in the CSV for companyName, failure, email.
It looks like one of them is a None type when you are expecting it to be a string
(i.e. maybe you forgot the column or the row has a blank value?)

pf...@gmail.com <pf...@gmail.com> #24Jan 28, 2010 02:11PM

I'm exporting to CSV, so I don't miss any column or data. I could be possible that
some rows contains some empty fields, but this should be managed by the exporter. I
think my error it's related with this bug. Thanks

ku...@gmail.com <ku...@gmail.com> #25Jan 28, 2010 03:22PM

it doesn't sound related to the bug at all. You need to change this :

'failure', lambda x: unicode(x, 'utf-8')

to:

def convert_failure(value):
if value is None:
return value
else:
return value.decode('utf-8')

'failure', convert_failure

pf...@gmail.com <pf...@gmail.com> #26Jan 28, 2010 03:57PM

Sorry, I'm newbie in python, I'm just coding to export data from my Java Application,
I put this in my code like this, but I got "NameError: global name 'convert_failure'
is not defined", it's well defined ?, what happen with the lambda function ?

Thanks

from google.appengine.ext import db
from google.appengine.tools import bulkloader

class MigrationResult(db.Model):
companyName = db.StringProperty()
failure = db.StringProperty()
email = db.StringProperty()

class MigrationResultExporter(bulkloader.Exporter):
def __init__(self):
bulkloader.Exporter.__init__(self, 'MigrationResult',
[('companyName', convert_failure(x)),
('failure', convert_failure(x)),
('email', convert_failure(x)),
])

def convert_failure(value):
if value is None:
return value
else:
return value.decode('utf-8')

exporters = [MigrationResultExporter]

ku...@gmail.com <ku...@gmail.com> #27Jan 28, 2010 04:31PM

The documentation on App Engine for CSV uploading is very complete. Please read
through it carefully and keep in mind that the issue tracker is *not* a discussion
forum.

As stated in the documentation you need to pass in a callable. So just change
convert_failure(x) to convert_failure and move its definition to above the Exporter
subclass.

pf...@gmail.com <pf...@gmail.com> #28Jan 28, 2010 05:08PM

Ok Kumar, Finally I can export to csv. I think this would be useful to somebody else,
this is the final working code.
Thanks

from google.appengine.ext import db
from google.appengine.tools import bulkloader

class MigrationResult(db.Model):
companyName = db.StringProperty()
failure = db.StringProperty()
email = db.StringProperty()

def convert_failure(value):
if value is None:
return value
else:
return value.encode('utf-8')

class MigrationResultExporter(bulkloader.Exporter):

def __init__(self):
bulkloader.Exporter.__init__(self, 'MigrationResult',
[('companyName', convert_failure, None),
('failure', convert_failure, None),
('email', convert_failure, None),
])

exporters = [MigrationResultExporter]

invoked with this:
./appcfg.py download_data --config_file=latamvalley/exporter2.py
--filename=album_data_archive.csv --kind=MigrationResult
--url=

http://python.latest.sandbox-getsense-it.appspot.com/remote_api latamvalley