saphre


Saphre - Suffix Arrays For Phrase Extraction

Saphre is a Java-based suffix array library which has been inspired by the work in Abouelhoda et al (2004) and Yamamoto and Church (2001) and which implements the Linearized Suffix Tree (Kim et al. 2008). It can be used to analyze large amounts of textual data and it is especially suited for linguistic purposes (corpora with large alphabet size). Saphre offers a very simple interface to indexing (string) data, serialization, the computation of statistics, etc.

Special features of Saphre include: * the extraction of "gappy" (discontinuous) phrases * maximal and supermaximal repeats * long repetitions finder (corpus cleaning) * computation of different metrics over all words in a text collection (term frequency, document frequency, different variations of mutual information, residual inverse document frequency, log-entropy, etc.)

Moreover, Saphre can be used to detect potential plagiarism in a set of documents by computing various statistical measures over all n-grams in the corpus.

Saphre has initially been implemented by Dale Gerdemann (Universität Tübingen) and is currently developed by Niko Schenk (Universität Frankfurt am Main). Anyone willing to contribute is welcome to join the project!