|
Project Information
Members
Featured
Downloads
Wiki pages
|
What is TCC ?TCC or Thai Character Cluster (proposed in Character Cluster Based Thai Information Retrieval) is a group of inseparable Thai characters. This inseparability derives from Thai writing system which is independent of any context. As a result, TCC can be determined by a simple list of rules describing e.g., what characters need to follow/precede other characters. What is JTCC ?JTCC is a Java library to tokenize Thai text into a list of TCCs. The rules used to determine TCCs' boundaries are implemented using ANTLR as grammar. JTCC was designed with emphasis on ease of use. Programmers simply supply the Thai text to the provided facade, and get the output as a list of TCCs. TCC Examples
Note that we only put the delimiter at the end of each TCC. Applications of TCCsThe TCC itself has no use to the end users. TCC is mostly used in a bigger natural language processing system by acting as the first step of processing input text. An obvious merit of TCC is that it can be used to eliminate impossible word boundary positions in the running text. NoteJTCC is not a mature project nor does it provide a standard way of grouping inseparable Thai characters. The term inseparable is, in fact, ambiguous in some cases. For example, given an input "ถุงให้", by relying on the original definition of TCC, the output TCCs should be "ถุ|ง|ให้|". However, some might argue that the delimiter after "ถุ" can be removed without much effort to make it as "ถุง|ให้|". One method to do so might be to look ahead one more character. In this case, it is "ใ". Since "ใ" cannot be grouped with "ง" (i.e.,/ it is impossible to have "งใ"), so it might be tempting to group "ง" to the previous TCC, thus forming "ถุง". I agree that this argument makes sense. But, be reminded that the goal of this project is to create a library capable of tokenizing an input text into TCCs. The mentioned idea above seems to go beyond TCC (probably to syllable level). Therefore, we will stick with the global context-independent TCC tokenizing rules for now. At least, the mentioned look-ahead strategy will not be implemented in the near future. |