
saphre
Saphre is a Java-based suffix array library which has been inspired by the work in Abouelhoda et al (2004) and Yamamoto and Church (2001) and which implements the Linearized Suffix Tree (Kim et al. 2008). It can be used to analyze large amounts of textual data and it is especially suited for linguistic purposes (corpora with large alphabet size). Saphre offers a very simple interface to indexing (string) data, serialization, the computation of statistics, etc.
Special features of Saphre include: * the extraction of "gappy" (discontinuous) phrases * maximal and supermaximal repeats * long repetitions finder (corpus cleaning) * computation of different metrics over all words in a text collection (term frequency, document frequency, different variations of mutual information, residual inverse document frequency, log-entropy, etc.)
Moreover, Saphre can be used to detect potential plagiarism in a set of documents by computing various statistical measures over all n-grams in the corpus.
Saphre has initially been implemented by Dale Gerdemann (Universität Tübingen) and is currently developed by Niko Schenk (Universität Frankfurt am Main). Anyone willing to contribute is welcome to join the project!
Project Information
The project was created on Sep 13, 2013.
- License: GNU GPL v3
- 7 stars
- svn-based source control
Labels:
SuffixArray
SuffixArrays
Java
Academic
SuffixTree
LinearisedSuffixTree
CorpusLinguistics
DiscontinuousPhrases
DiscontinuousRepeats
GappyPhrases
SuffixSorting
LongestCommonPrefix
BurrowsWheelerTransform
IntervalTree
MaximalPhrases