
langdet
Language detector - as the name suggests is a program that is capable of detecting the language for any given description. The system will have a specific pattern for each language, which it uses to identify the language of the given description based on the closest matching pattern. In data analysis operations, we may need to restrict to a limited set of languages getting into the system - where the Language detectors comes in handy.
The existing language detector available for python is 'oice.langdet' - it lacks several features that a STANDARD language detector is expected to have. Few of the features are,
(i) Ability to detect multiple languages (currently only 3 languages supported)
(ii) It does a "Bi-gram" analysis on the input data. Which can lead to wrong predictions in some cases? (Lesser accuracy)
(iii) It is available only for 'python' / usable only by python-programs. Shouldn't it be usable by other programming languages?
The well-known standard for any language-detection system is "TextCat" by Gertjan van Noord. Textcat supports 76 language/encoding pairs. This project is to extend the 'Langdet' module to use the "Text-cat" language modules which will overcome the drawbacks of the existing language detector like:
· It can support 76 extra languages
· Implementation of an n-gram analysis on the text for more accuracy in detecting the languages.
· An interface that will enable the application to be started in 'server' mode in addition to the ability to be imported as a python-module and make it listen on any given port will be developed. This gives the application the ability to serve other programming languages also. . Also it can check the unicode data and guess what language it was.
Project Information
- License: GNU GPL v2
- 3 stars
- hg-based source control
Labels:
LanguageDetector
Langdet
Language
LanguageGuesser
Detectlanguage