Export to GitHub

dawgdic - issue #4

Binary keys


Posted on Aug 21, 2012 by Massive Ox

Hello,

It seems there is no way to store binary data containing 0 bytes in DAWG, at least it seems that Completer doesn't support it (completer.key() and completer.length() returns values truncated at first 0 byte).

My use case is to store binary payload as a part of keys. The keys are

<utf8-encoded unicode key> + chr(255) + <binary_value>

this will allow implementing mapping from unicode to a list of possible binary values using DAWG where values will be compressed just like keys; in order to get all values Completer class may be used.

Zero bytes are not an issue for the utf8 keys because utf8 text is guaranteed to not include them. chr(255) also can't be a part of utf8-encoded string so it is fine as a separator. But zeroes in binary data renders this scheme non-working.

Is there a workaround? Thanks.

Comment #1

Posted on Aug 25, 2012 by Massive Ox

Small clarification: I was not correct about utf8: utf8 may include zero byte because NUL unicode codepoint is represented as zero byte in utf8. There are no other possible zero bytes in UTF8 (multibyte codepoints have some bits set in all bytes) so this is usually not an issue.

Comment #2

Posted on Sep 4, 2012 by Massive Ox

In a Python wrapper I've resolved this by encoding to base64. Encoding/decoding speed doesn't become bottleneck (when using http://libb64.sourceforge.net/ decoding is orders of magnitude faster than Completer iteration speed).

Comment #3

Posted on Sep 7, 2012 by Quick Bear

I'm sorry that I couldn't help you.

As you mentioned, Completer uses '\0' as the sentinel and it seems to be impossible to fix this without changing the structure of its auxiliary data. Also, the change will increase the size of the auxiliary data.

So, now I'm planning to implement a new Completer for binary keys.

Comment #4

Posted on Sep 7, 2012 by Massive Ox

Thanks! This would be a very useful feature.

Comment #5

Posted on Sep 14, 2012 by Quick Bear

I'm really sorry. It has turned out that it is impossible to support binary keys without changing its core data structure (major update).

When I posted the last comment, I expected that dawgdic can support binary keys without changing Dictionary. However, I've found that not only Completer but also Dictionary require modification. In addition, the modification degrades the space efficiency and the time efficiency of dawgdic. This drawback may be unacceptable.

In short, base64-encoding may support binary values better than natively supporting binary keys.

Comment #6

Posted on Sep 14, 2012 by Massive Ox

Thanks for all your work on this. Binary keys would be good but base64 works good enough in practice. So maybe the limitation should be documented somewhere?

Comment #7

Posted on Sep 14, 2012 by Quick Bear

Thanks for your comments. I'll update dawgdic to reject binary keys and then update documentation.

Comment #8

Posted on Sep 14, 2012 by Massive Ox

Looks like a good plan.

Comment #9

Posted on Sep 15, 2012 by Quick Bear

I've finished the update of dawgdic. Also, the documentation has been updated.

Status: WontFix

Labels:
Type-Defect Priority-Medium