Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental dump (statement by statement) #143

Closed
rogerbinns opened this issue Dec 29, 2013 · 13 comments
Closed

Incremental dump (statement by statement) #143

rogerbinns opened this issue Dec 29, 2013 · 13 comments

Comments

@rogerbinns
Copy link
Owner

From rogerbinns on September 06, 2013 22:42:09

https://bugs.launchpad.net/calibre/+bug/1217988

Original issue: http://code.google.com/p/apsw/issues/detail?id=143

@ghost ghost assigned rogerbinns Dec 29, 2013
@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 22:49:36

APSW only does the restore one statement at a time too. More accurately it reads a line and calls sqlite3_complete to see if it is complete which will be true if there is a trailing semicolon.

The maximum size of something added to the statement cache is 16384 bytes and the default size is 100 entries. Add in the size used by the SQLite virtual machine and things could cumulatively add up.

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 22:50:57

I also recommend you use augmented stack traces - ie also print out local variables for each frame. You can see how to do that half way down http://apidoc.apsw.googlecode.com/hg/exceptions.html#augmentedstacktraces

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 06, 2013 23:21:43

Running the the dump and restore on the database attached to the ticket causes memory usage to climb by about 700MB. I suspect that most of that is simply from storing the dumped sql in memory. iterdump() from pysqlite avoids that since it produces only one statement at a time.

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 23:26:09

Starting with a process only running the APSW shell in Python 2.7 and the database at launchpad on a 64 bit Linux machine, I get the resident set size climbing to 12MB when doing a dump and 23MB when doing a restore.

Virtual size is about double that. This seems to indicate that the APSW shell is the victim of something else exhausting memory, not the cause.

(Incidentally I do run the entire test suite under valgrind multiple times.)

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 23:30:12

The dump is 75MB of UTF-8 encoded text.

Surely pysqlite will also require the same amount of memory to store the whole thing?

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 06, 2013 23:30:29

The problem isn't that there is a memory leak. The problem is that the dump and restore operate on the full dumped sql, which is stored in memory and can be very large. Is there some API I can use that allows the intermediate sql to not be stored in memory.

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 23:36:55

There is no intermediate SQL. The dump is only in memory if you set stdin/out for the shell to be a StringIO. If you set it to a file then the dump ends up in the file.

During dumping, as each statement is generated it is written (self.write on self.stdout is called). During read, a line at a time is read using Python iteration on files.

The shell itself never does some sort of internal buffering or keeping around intermediate bits of SQL.

I usually use the tempfile module when I need stuff on disk rather than in memory.

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 06, 2013 23:51:15

Comment #5 on issue 143 by rogerbinns: MemoryError while restoring
http://code.google.com/p/apsw/issues/detail?id=143 >
The dump is 75MB of UTF-8 encoded text.

Yeah python2 does not store strings in UTF-8, it stores them in either
UCS-2 or UCS-4.

Surely pysqlite will also require the same amount of memory to store the
whole thing?

the code is pysqlite looks (basically) like this

for line in conn.iterdump():
newconn.execute(line)

in apsw it looks like this:

shell.process_command('.dump')
sql = buf.getvalue()
newconn.execute(sql)

What I'm asking for, is some API that lets me replace newconn.execute()
with something that can read a statement at a time from the dump. From
your other reply, I'm guessing that is
shell.process_command('.restore')?

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 06, 2013 23:52:57

Or from reading the apsw shell source code it looks like it needs to be

shell.process_command('.dump') into a disk file

and then

newshell.process_command('.restore') from the disk file.

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 06, 2013 23:53:33

Sorry make that

newshell.process_command('.read')

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 06, 2013 23:59:05

You never said earlier that you were executing the dump at the same time as generating it! That will just make a backup. There is a perfectly good API exposed for it: http://apidoc.apsw.googlecode.com/hg/backup.html It isn't really possible to change things to provide the dump as a generator. A limitation of Python generators is they are only one stack frame deep.

@rogerbinns
Copy link
Owner Author

From adsense@calibre-ebook.com on September 07, 2013 00:01:36

It isn't at the same time, it is dump first on database 1, then after the dump is completed, execute the dump on database 2. Finally replace database 1 with database2 if the previous operations were successful.

And this is not equivalent to a backup, since it can recover from various types of minor corruption in the database.

@rogerbinns
Copy link
Owner Author

From rogerbinns on September 07, 2013 00:15:11

The only corruption you can recover from with APSW would be in the indices since they are read by the backup API but not by dumping.

The general advice is to fix the root cause of the corruption and make backups. https://www.sqlite.org/howtocorrupt.html You may also be able to do things with timing when certain commands are issued. For example use WAL mode with auto-checkpointing off, and do an integrity check at a useful point (eg start of session). If all is clear do the checkpoint, and if not delete the WAL since it most likely has the corruption.

Summary: Incremental dump (statement by statement) (was: MemoryError while restoring)
Status: WontFix
Labels: -Type-Defect -Priority-Critical Type-Enhancement Priority-Medium

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant