|
BinaryDictionaries
Binary dictionary is the dictionary format in softkeyboard. This pages guides how to create them.
Phase-Implementation Binary DictionariesBinary dictionary format is an efficient and compact way to store dictionaries. The format is defined in LatinIME input method (Android default keyboard), which is part of the open source Android platform. XML DictionaryXML Dictionary is intermediate format which can be easily transformed to binary dictionary format. XML file should be stored using UTF-8 encoding for maximum compatibility. UTF-8 encoding is supported practically by all XML tools and it allows any character to appear in the document. Each w element defines one word, which has frequency of f. Frequencies don't have to be normalized (different dictionaries have different frequency scales), but they should be integer values and at least 1. SamplesSome samples of XML dictionaries: Hebrew (excerpt): <?xml version="1.0" encoding="UTF-8" ?> <wordlist> <w f="3847">לא</w> <w f="3344">את</w> <w f="2288">של</w> <w f="2114">זה</w> <w f="2023">על</w> <w f="1890">אני</w> <w f="1496">לי</w> <w f="1242">כל</w> <w f="1143">עם</w> <w f="1095">גם</w> <w f="1061">מה</w> <w f="1058">הוא</w> <w f="1013">אבל</w> <w f="901">שלי</w> <w f="890">יש</w> <w f="836">אם</w> <w f="763">או</w> <w f="703">היא</w> </wordlist> Finnish (excerpt): <?xml version="1.0" encoding="UTF-8" ?> <wordlist> <w f="2">rkp</w> <w f="1">ja</w> <w f="1">on</w> <w f="1">ei</w> <w f="1">että</w> <w f="1">oli</w> <w f="1">se</w> <w f="1">hän</w> <w f="1">mutta</w> <w f="1">ovat</w> <w f="1">kuin</w> <w f="1">myös</w> <w f="1">kun</w> <w f="1">ole</w> <w f="1">sen</w> <w f="1">tai</w> <w f="1">joka</w> <w f="1">niin</w> <w f="1">mukaan</w> <w f="1">jo</w> <w f="1">vain</w> <w f="1">ollut</w> <w f="1">jos</w> <w f="1">nyt</w> <w f="1">olisi</w> <w f="1">voi</w> <w f="1">hänen</w> <w f="1">sitä</w> </wordlist> XML->Binary Dictionary conversionThese steps guide you through to convert XML dictionaries to the binary format. Guide works with both nix and Windows platforms. Probably OSX too.
The resulting binary dictionary <lang>.dict can copied under assets folder. Old SQLite Database dictionary -> XML Dictionary conversionPlease see Issue 240 before converting old dictionaries.
The resulting XML Dictionary can be converted to Binary dictionary Where to get word-lists (thanks to Jacob Nordfalk)A good (but not perfect - since it is not an email/sms/im kind of source) source for word-lists is Wikipedia.
bzcat archive.bz2 | grep -v '<[a-z]*\s' | grep -v '&[a-z0-9]*;' | tr '[:punct:][:blank:][:digit:]' '\n' | tr 'A-Z' 'a-z' | tr 'ÆØÅŜĴĤĜŬ' 'æøåŝĵĥĝŭ' | uniq | sort -f | uniq -c | sort -nr | head -50000 | tail -n +2 | awk '{print "<w f=\""$1"\">"$2"</w>"}' > dict.xmlbzcat archive.bz2 | grep -v '<[a-z]*\s' | grep -v '&[a-z0-9]*;' | tr '[:punct:][:blank:][:digit:]' '\n' | tr 'A-Z' 'a-z' | uniq | grep -o '^[a-z]*$' | sort -f | uniq -c | sort -nr | head -50000 | awk '{print "<w f=\""$1"\">"$2"</w>"}' > en.xml
<?xml version="1.0" encoding="UTF-8" ?> <wordlist> </wordlist> |
Do the dictionaries/softkeyboard support accented matches on non-accented input? That is, if I made a Polish dictionary and then tried to type "lodz", would it match "łódź"?
I should
Wie also should use
How to create language pack for my country? Where to download makedict script? Can i just copy the resulting file somewhere?
Thanks
To create a lang pack, you should use our Eclipse template project: http://code.google.com/p/softkeyboard/downloads/detail?name=AnySoftKeyboardLanguagePackTemplate_5.zip&can=2&q=
Hello Menny! I have checked out the source code. Can you tell me where the word-list about the existing binary-format dictionary( enLarge_binary.mp3 ) is? I mean the wikipedia list. Couldnt locate it on http://en.wikipedia.org/wiki/Wikipedia_database ps:Love the project! Finally found a replacement for the onscreen kbd, yay!
Okay I'm confused... How do you open up Eclipse to work on the template? I need a guide here...
How do i convert .dict to .mp3 file?
How can i convert .mp3 to .dict?
How can I run those commands by Jacob Nordfalk in windows?
For those who ask how to convert dict to mp3: just change the file extension and it will work
Hi, How can create word list from wikipedia database for unicode(complex script). Please help me to do it.
The Wikipedia archive file is no longer located at the above link. Try http://en.wikipedia.org/wiki/Wikipedia:Database_download instead.
How can I create my own smilies pack?