PTStemmer - A Java stemming toolkit for the Portuguese language
FEATURES
- Java implementation of Orengo and Porter stemmers
- Stopword and named entity removal
- Least Recently Used (LRU) stem cache
FILES
- ptstemmer.jar
- The compiled library.
- doc
- The Javadoc API documentation.
- src
- The source code.
- data
- Stopword and Named Entity lists.
REQUIREMENTS
- Java 5
USAGE
Console
java -jar ptstemmer.jar
API
Stemmer st = new OrengoStemmer();
st.ignoreNamedEntities(new NamedEntitiesFromFile("data/namedEntities.txt")); //Optional
st.ignoreStopWords(new StopWordsFromFile("data/stopwords.txt")); //Optional
st.enableCaching(1000); //Optional
System.out.println(st.wordStemming("extremamente"));TODO
- Improve performance (e.g., Tries)
- Improve natural language processing (e.g., tokenization, normalization, etc.)
- Add more funcionalities (e.g., gramatical correction, WSD, etc.)
DEVELOPED & SUPPORTED
Pedro Oliveira http://student.dei.uc.pt/~pcoliv