DESCRIPTION
This module is a word tokenizer for CJK texts. It supports n-gram tokenization. It is handy for users if they are building inverted indexes using Xapian or any other search engine tool. The module is originally written to be used with Xapian. Please also read this post on xapian-discuss mailing list.
If you are a Perl user, you can also use the perl binding.
Currently, there is totally no documentation. Please check out the repository and hack it.
FEATURES
- N-gram tokenization on CJK texts.
- Conversion from Traditional Chinese to Simplified Chinese, and vice versa.
USERS
http://code.google.com/p/cjk-tokenizer/wiki/Users
TODO
full-width <-> half-width conversion |