My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members

DESCRIPTION

This module is a word tokenizer for CJK texts. It supports n-gram tokenization. It is handy for users if they are building inverted indexes using Xapian or any other search engine tool. The module is originally written to be used with Xapian. Please also read this post on xapian-discuss mailing list.

If you are a Perl user, you can also use the perl binding.

Currently, there is totally no documentation. Please check out the repository and hack it.

FEATURES

  • N-gram tokenization on CJK texts.
  • Conversion from Traditional Chinese to Simplified Chinese, and vice versa.

USERS

http://code.google.com/p/cjk-tokenizer/wiki/Users

TODO

full-width
<->
half-width conversion

Powered by Google Project Hosting