Export to GitHub

dawgdic - issue #8

dawgdic-build gives `error: failed to insert key:` on certain keys


Posted on Oct 11, 2012 by Quick Ox

What steps will reproduce the problem? 1. download test150.txt (attached) 2. run dawgdic-build test150.txt

What is the expected output? What do you see instead?

Trying to get a built dawgdic (traced this issue back from a project I'm working on with the dawg module in python.

Instead I get the error: error: failed to insert key: +.n|+.x|a.x|c.n;b.nAgAmAC0A

What version of the product are you using? On what operating system?

using version 0.4.4 (latest) on mac os X 10.6

Please provide any additional information below.

The attached file is the output of a python process that uses f.write(dawg.BytesDawg()._raw_key(k, v)) to generate the lines. It works for virtually all values I throw at it it, but very rarely (maybe one out of several million keys) fails like this.

I'm hoping to at least find a work-around -- this library is really extraordinary.

Thanks.

Attachments

Comment #1

Posted on Oct 12, 2012 by Quick Bear

Thanks for your report.

I've found that test150.txt is not correctly ordered. In detail, the 65th line is greater than the 66th line.

65: +.n|+.x|a.x|c.n;b.nAgAmACoA 66: +.n|+.x|a.x|c.n;b.nAgAmAC0A

I've checked this as follows:

$ LC_ALL=C sort test150.txt > sorted.txt $ diff test $ diff test150.txt sorted.txt 65,66d64 < +.n|+.x|a.x|c.n;b.nAgAmACoA < +.n|+.x|a.x|c.n;b.nAgAmAC0A 67a66,67

+.n|+.x|a.x|c.n;b.nAgAmAC0A +.n|+.x|a.x|c.n;b.nAgAmACoA

Would you test again after this correction?

Comment #2

Posted on Oct 13, 2012 by Quick Ox

yes sorry it does seem that the file is out of order, thanks.

However now, using my full data set (a 2.3GB text file) I get the following error:

bash-3.2$ dawgdic-build -g testBundleSubs2sorted.txt testBundleSubs2.dawg no. keys: 46112348 no. states: 307452395 no. transitions: 348225617 no. merged states: 1073607010 no. merging states: 39222660 no. merged transitions: 1032833787 error: failed to build Dictionary

I think in fact that this is the problem I was actually tracing back from the python wrapper. The data is quite large, does dawgdic have any kind of size limits that may be the problem? Unfortunately I can't share the full data file.

Thanks

Comment #3

Posted on Oct 14, 2012 by Quick Bear

The number of units in Dictionary is limited to 2^29 (around 536M).

I'm not sure how many keys you can put in a Dictionary, but the number of units must be greater than the number of transitions.

In this case, maybe you can put a half of the data set in one Dictionary.

Comment #4

Posted on Oct 14, 2012 by Quick Ox

Ok, I suspected I was hitting some sort of size limit. Rather than split the data over multiple dictionaries I've switched over to using marisa-trie (I haven't yet hit any upper limits there). Thanks for the help, and for the library, which will almost certainly be useful for the future.

Comment #5

Posted on Oct 15, 2012 by Quick Bear

It's a nice idea.

I heard that 1 billion keys (probably several times larger than your data set) can be stored in one marisa-trie instance.

But please note that marisa-trie is much slower than dawgdic. If you need a fast data structure, it might be a bad choice.

Good luck.

Status: Done

Labels:
Type-Defect Priority-Medium