My favorites | Sign in
Project Home Wiki Issues Source
Project Information

Lexical substitutes have found use in the context of word sense disambiguation, unsupervised part-of-speech induction, paraphrasing, machine translation, and text simplification. Using a statistical language model to find the most likely substitutes in a given context is a successful approach, but the cost of a naive algorithm is proportional to the vocabulary size. This paper presents the Fastsubs algorithm which can efficiently and correctly identify the most likely lexical substitutes for a given context based on a statistical language model without going through most of the vocabulary. The efficiency of Fastsubs makes large scale experiments based on lexical substitutes feasible. For example, it is possible to compute the top 10 substitutes for each one of the 1,173,766 tokens in Penn Treebank in about 6 hours on a typical workstation. The same task would take about 6 days with the naive algorithm. An implementation of the algorithm and a dataset with the top 100 substitutes of each token in the WSJ section of the Penn Treebank are available from the author's website at

Powered by Google Project Hosting