My favorites | Sign in
Project Logo
                
Search
for
Updated Jun 26, 2009 by David.Jurgens
Labels: Featured
LatentSemanticAnalysis  
A description of the Latent Semantic Analysis implementation of the S-Space package.

Introduction

Latent Semantic Analysis (LSA) is an algorithm that uses a collection of documents to construct a a semantic space. The algorithm constructs a word-by-document matrix where each row corresponds to a unique word in the document corpus and each row corresponds to a document. The value at each position is how many times the row's word occurs in the column's document. Then the Singular Value Decomposition (SVD) is calculated for the word-document matrix to produce three matrices (UΣV), U - the wordspace, Σ - the singular values, and V - the document space. (See Wikipedia for more details on the SVD). The columns of U are then truncated to a small number of dimensions (typically 300), which produces the final semantic vectors.

For more information on LSA, see the Wikipedia page on LSA. Also the following papers give a good introduction to the uses of LSA:

  • T. K. Landauer and S. T. Dumais, "A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge," Psychological Review, vol. 104, pp. 211–240, 1997. Available here
  • T. K. Landauer, P. W. Foltz, and D. Laham, "Introduction to Latent Semantic Analysis," Discourse Processes, no. 25, pp. 259–284, 1998. Available here.

S-Space Implementation

The current S-Space implementation of LSA is captured in two files. LatentSemanticAnalysis.java contains all of the algorithmic implementation, and is suitable for use in other code as a library. LSAMain.java is a command-line invokable version of LSA that uses the LatentSemanticAnalysis class. This class is provided as lsa.jar on the release packages.

Software Requirements

The S-Space implementation uses existing software implementations of the SVD. At least one of the following packages should be installed:

  1. SVDLIBC
  2. Matlab
  3. GNU Octave. Note that the required sparse svd method is in an optional package and requires that the ARPACK bindings for Octave are installed.
  4. JAMA. Note that should JAMA be used, it needs to be specified in the CLASSPATH variable when LSA is run. If LSA is being invoked from a .jar (e.g. lsa.jar) and JAMA is to be used for computing the SVD, then the path to the JAMA .jar file must be specified using the system property jama.path. To set this on the command-line, use -Djama.path=<.jar location>.

The S-Space implementation will work with any of these implementations. However, note that each has its own scalability limitations. We recommend SVDLIBC as it is the most scalable option.

Preprocessing the word-document matrix

Many studies have shown that preprocessing the word-document matrix can improve the resulting word semantics. The S-Space package provides three preprocessing classes that are commonly used:

In addition, the S-Space implementation also provides the ability for users to provide their own transformation. Users can implement the MatrixTransformer interface, and specify their class as the transform that LSA should use.

For further details of preprocessing, see the following two papers:

  • S. Dumais, “Enhancing performance in latent semantic indexing (LSI) retrieval,” Bellcore, Morristown (now Telcordia Technologies), Tech. Rep. TM-ARH-017527, 1990.
  • P. Nakov, A. Popova, and P. Mateev, “Weight functions impact on LSA performance,” in Proceedings of the EuroConference Recent Advances in Natural Language Processing, (RANLP’01), 2001, pp. 187–193.

Running LSA from the command-line

LSA can be invoked either using java edu.ucla.sspace.mains.LSAMain or through the jar release java -jar lsa.jar. Both ways are equivalent.

We provide the following options for changing the behavior of LSA and how the program is run.

  • LSA Options
    • -n | --dimensions <int> how many dimensions to use for the LSA vectors. See LatentSemanticAnalysis for default value
    • -p | --preprocess <class name> specifies an instance of [MatrixTransform to use in preprocessing the word-document matrix compiled by LSA prior to computing the SVD.
    • -F | --tokenFilter=FILE[include|exclude][,FILE...] specifies a list of one or more files to use for filtering the documents. An option flag may be added to each file to specify how the words in the filter filter should be used: include if only the words in the filter file should be retained in the document; exclude if only the words not in the filter file should be retained in the document. The default value is include. An example configuration might look like: --tokenFilter=english-dictionary.txt=include,stop-list.txt=exclude
  • Program Options
    • -o | --outputFormat={text|binary} Specifies the output formatting to use when generating the semantic space (.sspace) file. See FileFormats for format details.
    • -t | --threads <int> how many threads to use when processing the documents. The default is one per core.
    • -w | --overwrite <boolean> specifies whether to overwrite the existing output files. The default is true. If set to false, a unique integer is inserted into the file name.
    • -v | --verbose specifies whether to print runtime information to standard out

The program will then produce a file that contains the entire semantic space. See FileFormats for exact details on the output formatting.

Acknowledgments


Sign in to add a comment
Hosted by Google Code