My favorites | Sign in
Project Home Downloads Issues
READ-ONLY: This project has been archived. For more information see this post.
Project Information
Members
Links

分词盒子是一个纯Java开发的、100%与Lucene兼容的多语言分词工具包,因为开发的时候Lucene 3.0还没出来,所以暂时不支持Lucene 3.x的API,不久将推出完全兼容Lucene 3.x的版本。

此工具包主要用于搜索类的应用,对切分速度和召回率的要求要高于准确率,故采用了正向最大/最小匹配和全切分算法,同时支持用户自定义词典扩展,支持任何语言(如:中文、日文、韩文)的词语和英文短语的识别,支持用户自定义词性。

代码示例

//import com.ithezi.analyzer.WordAnalyzer;

WordAnalyzer analyzer = null;

try {
     String stopchar = "+-:/"; //需直接过滤的字符
     analyzer = new WordAnalyzer(stopchar);
     analyzer.setIgnoreWhitespace(true); //忽略CJK字符中间的空白
     analyzer.setSegmentationMode(WordAnalyzer.MAX_MATCHING); //最大匹配
     //analyzer.setSegmentationMode(WordAnalyzer.MIN_MATCHING); //最小匹配

     //添加词语到词典中
     analyzer.addWord("盒子", "n");
     analyzer.addWord("分词", "adj");
     analyzer.addWord("中文", "n");

     String text = "分词盒子,中文分词工具包";
     TokenStream ts = analyzer.tokenStream("", new StringReader(text));
     long start = System.currentTimeMillis();
     
     final Token t = new Token();
     while (ts.next(t) != null) {
         System.out.println(t.term() + ", " + t.type());
     }

     long end = System.currentTimeMillis();
     System.out.println("耗时:" + (end - start) + " ms");
} catch (IOException ex) {
     System.err.println("分词出错: " + ex.getMessage());
}
Powered by Google Project Hosting