Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.dump in apsw.Shell broken on non ascii locales in windows #142

Closed
rogerbinns opened this issue Dec 29, 2013 · 11 comments
Closed

.dump in apsw.Shell broken on non ascii locales in windows #142

rogerbinns opened this issue Dec 29, 2013 · 11 comments
Assignees

Comments

@rogerbinns
Copy link
Owner

From adsense@calibre-ebook.com on August 23, 2013 05:25:49

This line in command_dump() http://code.google.com/p/apsw/source/browse/tools/shell.py#1094 Uses time.strftime('%c')

This produces a locale dependent time bytestring on python 2.x. This bytestring is later co-erced to unicode here http://code.google.com/p/apsw/source/browse/tools/shell.py#2404 as unicode(text)

This fails on some windows installs, presumably installs where strftime produces a bytestring with non-ascii bytes.

The call to strftime should be changed to use a local independent time tring, perhaps:

time.ctime()

For an example of this bug in the field, see https://bugs.launchpad.net/bugs/1215819

Original issue: http://code.google.com/p/apsw/issues/detail?id=142

@ghost ghost assigned rogerbinns Dec 29, 2013
@rogerbinns
Copy link
Owner Author

From rogerbinns on August 23, 2013 11:00:05

Thanks for the report. I'll need to do some further investigation.

comment() is used in several places. When its input comes directly from the SQLite database (eg table names) then it will already be in unicode, but this example and a few others can be bytestrings (eg getuser) and they may not be valid in the current encoding, so a safe conversion needs to be done.

Status: Accepted

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on August 23, 2013 19:25:51

Here is what I use in my python 2 projects:

filesystem_encoding = sys.getfilesystemencoding()
if filesystem_encoding is None:
filesystem_encoding = 'utf-8'
else:
try:
if codecs.lookup(filesystem_encoding).name == 'ascii':
filesystem_encoding = 'utf-8'
# On linux, unicode arguments to os file functions are coerced to an ascii
# bytestring if sys.getfilesystemencoding() == 'ascii', which is
# just plain dumb. This is fixed by the icu.py module which, when
# imported changes ascii to utf-8
except:
filesystem_encoding = 'utf-8'

def isbytestring(obj):
return isinstance(obj, (str, bytes))

def force_unicode(obj, enc=preferred_encoding):
if isbytestring(obj):
try:
obj = obj.decode(enc)
except:
try:
obj = obj.decode(filesystem_encoding if enc ==
preferred_encoding else preferred_encoding)
except:
try:
obj = obj.decode('utf-8')
except:
obj = repr(obj)
if isbytestring(obj):
obj = obj.decode('utf-8')
return obj

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on August 23, 2013 19:26:35

Forgot

try:
preferred_encoding = locale.getpreferredencoding()
codecs.lookup(preferred_encoding)
except:
preferred_encoding = 'utf-8'

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on August 23, 2013 22:48:23

A somewhat related issue. If a db is damaged so that a text column contains some random bytes, .dump fails in apsw because apsw tries to co-erce the random bytes to utf-8.

.dump works in sqlite3 command line client and in pysqlite by using iterdump() one can simply ignore the line causing the problem, and recover the rest of the database.

Is there any way to do that in apsw? I could just close the apsw connection and use pysqlite to dump and recover, but it would be nice to be able to do it in apsw.

Example of such a damaged database: https://bugs.launchpad.net/calibre/+bug/1215981/+attachment/3784730/+files/metadata.db

@rogerbinns
Copy link
Owner Author

From rogerbinns on August 24, 2013 00:18:37

The SQLite shell doesn't actually do unicode. It blindly just dumps the bytes completely ignorant. pysqlite used to be just as bad, but I thought had changed the defaults. However string factories can override this to let you put in and retrieve garbage.

APSW insists on being unicode correct as SQLite intended. All strings are unicode (a dump bug aside which is interacting with non-SQLite!).

The "bug" is actually happening inside Python. APSW hands it a sequence of bytes that are supposed to be utf8 coming directly from the database, and that turns out not to be the case. There is no way of telling APSW to not treat SQLite strings as unicode. The issue is mentioned on the tips page http://apidoc.apsw.googlecode.com/hg/tips.html#unicode Trying to add a continue on error flag might be quite hard. I'll investigate. (The cursor iteration needs to update internal bookkeeping before raising the exception so that it can be called again. IIRC it currently goes into an error state and cannot get out of it.)

The general advice when the database is corrupt is to give up and use a backup. Running integrity checks and backups as part of Calibre would be a way of helping to do that.

If the underlying issue had been on output (rather than SQLite to Python string conversion) then the .encoding command does allow specifying an error handler - eg .encoding utf8:replace. Sadly of no help here, but should address the original report.

Owner: rogerbinns

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on August 24, 2013 00:24:37

It's no big deal, calibre maintains a distributed backup of all data in the database from which the database can be rebuilt if necessary. Having the ability to dump and restore is just "nice to have". Here is the pysqlite code I use that, works for the bad db I linked to earlier.

def reinit_db(dbpath, callback=None):
from contextlib import closing
from calibre import as_unicode
from calibre.ptempfile import TemporaryFile
from calibre.utils.filenames import atomic_rename
if callback is None:
callback = lambda x, y: None
# We have to use sqlite3 instead of apsw ass apsw has no way to discard
# problematic statements
import sqlite3
from calibre.library.sqlite import do_connect
with TemporaryFile(suffix='_tmpdb.db', dir=os.path.dirname(dbpath)) as tmpdb:
with closing(do_connect(dbpath)) as src, closing(do_connect(tmpdb)) as dest:
dest.execute('create temporary table temp_sequence(id INTEGER PRIMARY KEY AUTOINCREMENT)')
dest.commit()
uv = int(src.execute('PRAGMA user_version;').fetchone()[0])
dump = src.iterdump()
last_restore_error = None
while True:
try:
statement = next(dump)
except StopIteration:
break
except sqlite3.OperationalError as e:
prints('Failed to dump a line:', as_unicode(e))
if last_restore_error:
prints('Failed to restore a line:', last_restore_error)
last_restore_error = None
try:
dest.execute(statement)
except sqlite3.OperationalError as e:
last_restore_error = as_unicode(e)
# The dump produces an extra commit at the end, so
# only print this error if there are more
# statements to be restored
dest.execute('PRAGMA user_version=%d;'%uv)
dest.commit()
callback(1, True)
atomic_rename(tmpdb, dbpath)
callback(1, False)
prints('Database successfully re-initialized')

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on August 25, 2013 09:53:37

Oh and by the way, because I forgot to say it earlier, thank you very much for apsw.

@rogerbinns
Copy link
Owner Author

From rogerbinns on August 26, 2013 00:52:13

This issue was closed by revision 8a6d9932717c .

Status: Fixed

@rogerbinns
Copy link
Owner Author

From rogerbinns on August 26, 2013 00:54:16

I'm glad you find APSW useful, and also glad you found bugs :)

The fix will be in APSW 3.8.0 which should happen pretty soon. SQLite 3.8.0 got tagged a few hours ago.

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 06, 2013 22:22:49

FYI, dump + restore with apsw also fails with databases with large amounts of data in them, with Out of memory errors. https://bugs.launchpad.net/calibre/+bug/1217988 I've worked around it by using pysqlite to do the dump and restore which does it one statement at a time, thereby avoiding the large memory usage, as shown above.

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 22:44:49

This is a separate issue created as https://code.google.com/p/apsw/issues/detail?id=143

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant