My favorites | Sign in
Project Logo
                
Code license: MIT License
Labels: string, cjk, chinese, japanese, korean, cplusplus, perl
Feeds:
People details
Project owners:
  henearkrxern
Project committers:
fabrice.colin

DESCRIPTION

This module is a word tokenizer for CJK texts. It supports n-gram tokenization. It is handy for users if they are building inverted indexes using Xapian or any other search engine tool. The module is originally written to be used with Xapian. Please also read this post on xapian-discuss mailing list.

If you are a Perl user, you can also use the perl binding.

Currently, there is totally no documentation. Please check out the repository and hack it.

FEATURES

USERS

http://code.google.com/p/cjk-tokenizer/wiki/Users

TODO

full-width
<->
half-width conversion









Hosted by Google Code