Introduction
Latent Semantic Analysis (LSA) is an algorithm that uses a collection of documents to construct a a semantic space. The algorithm constructs a word-by-document matrix where each row corresponds to a unique word in the document corpus and each column corresponds to a document. The value at each position is how many times the row's word occurs in the column's document. Then the Singular Value Decomposition (SVD) is calculated for the word-document matrix to produce three matrices (UΣV), U - the wordspace, Σ - the singular values, and V - the document space. (See Wikipedia for more details on the SVD). The columns of U are then truncated to a small number of dimensions (typically 300), which produces the final semantic vectors.
For more information on LSA, see the Wikipedia page on LSA. Also the following papers give a good introduction to the uses of LSA:
- T. K. Landauer and S. T. Dumais, "A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge," Psychological Review, vol. 104, pp. 211–240, 1997. Available here
- T. K. Landauer, P. W. Foltz, and D. Laham, "Introduction to Latent Semantic Analysis," Discourse Processes, no. 25, pp. 259–284, 1998. Available here.
The regular user should download lsa.jar from the downloads page. LSA should work right out of the box with no configuring and no external software, and has command line documentation on all the options.
S-Space Implementation
The current S-Space implementation of LSA is captured in two files. LatentSemanticAnalysis.java contains all of the algorithmic implementation, and is suitable for use in other code as a library. LSAMain.java is a command-line invokable version of LSA that uses the LatentSemanticAnalysis class. This class is provided as lsa.jar on the release packages.
Software Requirements
The S-Space implementation of LSA uses existing software implementations of the SVD. We currently bundle SVDLIBJ, an open source port of SVDLIBC to Java by Adrian Kuhn and David Erni. In addition, the S-Space Package also supports the following SVD libraries:
- SVDLIBC
- Matlab
- GNU Octave. Note that the required sparse svd method is in an optional package and requires that the ARPACK bindings for Octave are installed.
- JAMA.
- COLT.
Note that should JAMA or COLT are to be used, they need to be specified in the
CLASSPATH variable when LSA is run. If LSA is being invoked from a .jar (e.g.
lsa.jar) you will need to specify the system property
jama.path or
colt.path to point to the appropriate jar. To set this on the command-line, use
-Djama.path=<.jar location>.
The S-Space implementation will work with any of these implementations. However, note that each has its own scalability limitations. We recommend SVDLIBC as it is the most scalable option. However, our performance tests have shown that SVDLIBJ performs around ~92% as fast as SVDLIBC (when compiled with icc on a core2 processor), which is fairly competitive.
Preprocessing the word-document matrix
Many studies have shown that preprocessing the word-document matrix can improve the resulting word semantics. The S-Space package provides three preprocessing classes that are commonly used:
- Log-Entropy - LogEntropyTransform.java.
- Term-Frequency Inverse Document-Frequency - TfIdfTransform.java. See the Wikipedia page for details.
- None - NoTransform.java. Does nothing to the matrix.
In addition, the S-Space implementation also provides the ability for users to provide their own transformation. Users can implement the MatrixTransformer interface, and specify their class as the transform that LSA should use.
For further details of preprocessing, see the following two papers:
- S. Dumais, “Enhancing performance in latent semantic indexing (LSI) retrieval,” Bellcore, Morristown (now Telcordia Technologies), Tech. Rep. TM-ARH-017527, 1990.
- P. Nakov, A. Popova, and P. Mateev, “Weight functions impact on LSA performance,” in Proceedings of the EuroConference Recent Advances in Natural Language Processing, (RANLP’01), 2001, pp. 187–193.
Running LSA from the command-line
LSA can be invoked either using java edu.ucla.sspace.mains.LSAMain or through the jar release java -jar lsa.jar. Both ways are equivalent.
We provide the following options for changing the behavior of LSA and how the program is run.
- Input document sources (must provide at least one)
- -f, --fileList=FILE[,FILE...] one or more files, each containing a list of file names, each of which is treated as a separate document.
- -d, --docFile=FILE[,FILE...] one or more files, in which each line is treated as a separate document. This is the preferred option for LSA operations for large numbers of documents due to reduced I/O demands.
- LSA Options
- -n, --dimensions <int> how many dimensions to use for the LSA vectors. See LatentSemanticAnalysis for default value
- -p, --preprocess <class name> specifies an instance of [MatrixTransform to use in preprocessing the word-document matrix compiled by LSA prior to computing the SVD.
- Tokenizing Options (For more information, see Tokenizing
- -C, --compoundWords=FILE This specifies a file where each line is a recognized compound word. Compount tokenization is greedy and will select the longest compound token present. For example if "bar exam" and "California bar exam" are both compound tokens, the latter will always be returned as a single token, rather than returning the two tokens "California" and "bar exam".
- -F, --tokenFilter=CONFIG A configuration lists sets of files that contain tokens to be included or excluded. The behavior, "include" or "exclude" is specified first, followed by one or more file names, each separated by colons. Multiple behaviors may be specified one after the other using a ',' character to separate them. For example, a typicaly configuration may look like: include=top-tokens.txt:test-words.txt,exclude=stop-words.txt Note behaviors are applied in the order they are presented on the command-line.
- -z, --wordLimit=INT Set the maximum number of words a document can have. (This feature is most useful in artificially truncating the corpus for testing or efficiency.)
- -Z, --stemmingAlgorithm=CLASSNAME This option specifices the stemming algorithm to use on tokens while iterating. The default is no stemming. A stemmer is specified by a fully qualified class name such as org.tartarus.snowball.englishStemmer.
- Program Options
- -o, --outputFormat={text|binary|sparse_text|sparse_binary} Specifies the output formatting to use when generating the semantic space (.sspace) file. See FileFormats for format details. The default is binary.
- -t, --threads <int> how many threads to use when processing the documents. The default is one per core.
- -w, --overwrite <boolean> specifies whether to overwrite the existing output files. The default is true. If set to false, a unique integer is inserted into the file name.
- -v, --verbose specifies whether to print runtime information to standard out
- Advanced Options
- -S, --svdAlgorithm The --svdAlgorithm provides a way to manually specify which algorithm should be used internally. This option should not be used normally, as LSA will select the fastest algorithm available. However, in the event that it is needed, valid options are: SVDLIBC, MATLAB, OCTAVE, JAMA and COLT The compound word option specifies a file whose contents are compount tokens, e.g. white house. Each compound token should be specified on its own line. Compount tokenization is greedy and will select the longest compound token present. For example if "bar exam" and "California bar exam" are both compound tokens, the latter will always be returned as a single token, rather than returning the two tokens "California" and "bar exam".
The program will then produce a file that contains the entire semantic space. See FileFormats for exact details on the output formatting.
Important Note
The LSA program is the definitive authority on the current set of options and their configurations. If you find an option is incorrectly specified on this page, please let us know. Full documentation may be found on the command line by running the lsa.jar program without any options.
Example Command Lines
- Generates a simple .sspace file with the default 300 dimensions.
- java -jar lsa.jar -d corpus.txt my-lsa-output.sspace
- Has the JVM use 4GB of ram when performing LSA (more ram is almost always better)
- java -Xmx8g -jar lsa.jar -d corpus.txt my-lsa-output.sspace
- Removes stop words from the corpus while processing. (Note: LSA doesn't do this in the original papers)
- java -Xmx8g -jar lsa.jar -d corpus.txt -F exclude=stopwords.txt my-lsa-output-no-stopwords.sspace
- Generates an LSA space with 500 dimensions
- java -Xmx8g -jar lsa.jar -d corpus.txt -n 500 my-lsa-output-500dim.sspace
- Generates an LSA space with known compound words
- java -Xmx8g -jar lsa.jar -d corpus.txt -C my-list-of-ngrams.txt my-lsa-output-with-ngrams.sspace
- Runs LSA with SVDLIBJ specifically (Note: the algorithm choice shouldn't affect the final vector values - only the runtime of LSA)
- java -Xmx8g -jar lsa.jar -d corpus.txt -S SVDLIBJ my-lsa-output.sspace
Acknowledgments
- We are grateful for the advice and assistance of Tom Landauer, Walter Kintsch and Praful Mangalath of the Latent Semantic Analysis group at the University of Colorado, Boulder.
- We are grateful to Doug Rohde for making the SVDLIBC program freely available.
- We are very grateful to Adrian Kuhn and David Erni for creating SVDLIBJ by porting SVDLIBC to Java.
i downloaded lsa.jar and created a corpus and tried to run it and got the following error:
Mark-Kleins-Computer-3:Downloads markklein$ java -jar lsa.jar -d E-3JOYFY-52.txt my-lsa-output.sspace Aug 15, 2010 7:58:00 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: performing log-entropy transform Aug 15, 2010 7:58:00 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions java.lang.UnsupportedOperationException?: No SVD algorithms are available
What can I do to get it to work?
Hi, I get the same error msg. what is the next step taken if such error is encountered. Thanks. SJ
Please email s-space-research-dev@googlegroups.com and we'll start debugging from there. I believe the bug reported earlier is fixed in the trunk and in the current lsa.jar file, so we'll have to see what's causing your error.
How long should conversion from MATLAB to SVDLIBC last? My computer is running for 48h already. processing 450k files (each 2kb)
The conversion process should be fast, even for very large matrices. However, the SVD step may take some time. 48 hours seems like a very long time for either one given the file sizes you listed. Could you email our dev mailing list (s-space-research-dev@googlegroups.com) with more information? We should be able to figure out why the program would be taking so long.
I have sent an exception to dev list, but haven't received an answer. (IndexOutofBounds? in class MatrixIO, row 558) I was surprised to receive it after 2 days of SVD conversion.
Hi, Can we build document classifier based on the class LatentSemanticAnalysis? Something similar to:
LatentSemanticAnalysis lsa=new LatentSemanticAnalysis(); for(String document:documents) lsa.processDocument(new BufferedReader(new StringReader(document.toLowerCase()))); lsa.processSpace(p); double simillarity=0; int simIndex=-1; DoubleVector targetDocument=lsa.getDocumentVector(documents.length-1); for(int i=0;i<documents.length-1;i++){ double sim=Similarity.getSimilarity(Similarity.SimType.COSINE, lsa.getDocumentVector(i), targetDocument); if(sim>simillarity){ simillarity=sim; simIndex=i; } } System.out.println(String.format("Target document:'%s'",documents[documents.length-1])); System.out.println(String.format("Similar document found:'%s'",documents[simIndex]));The problem with the above approach is that I don’t want to recalculate the Semantic Space all the time. Rather, the Semantic Space should be calculated for some training set of documents and after that new documents should be compared with the existing document vectors. Is that possible with the current code?
Hi, it's almost possible with the current code, but you'd need to do some more work. Specifically, the new document you'd like to compare exists in a different geometric space than the document vectors in the LSI/LSA space. (During processSpace, the SVD step is mapping your set of documents to a different basis). To compare a new document, you have to project its document vector into the space where your reference set exists. This actually isn't too difficult to do. (See the Wikipedia page for the full math details.)
Assuming LSA used the SVD to covert your term-document matrix into U, S, and V matrices (U is the word space and V is the document space), you'd need to perform the following multiplication: U-1 ST d, where d is your new document vector. Currently, the only unimplemented part of this procedure in the S-Space Package is finding the matrix inverse of U. To get your use case to work we'd need to implement matrix inversion and then add a few bits to LSA to save the required matrices.
Also, note that this projection doesn't change the original LSI space; so any new vectors that you compute will not affect your training set. If you're going to be comparing a lot of new documents, it may make sense to recompute the LSI space periodically. Also, the projection won't include information for any new terms in the new documents. If your test documents all start referencing terms that aren't in the training set, then recomputing the space also seems necessary.
Hi David,
I understand. The space (SVD) must be recalculated eventually. For document classifier it should be on demand, for example if we flag some document as important or something (Google Priority Inbox), the space will be recalculated for the upcoming classifications. It would be nice if S-Space has out of the box support for this scenario:
Steps 3-5 are repeating until a new document is flagged as semantic “governor” and the process continues from the step 1.
Nice work! I was reading through the code. It is well written and easy to follow.
I’m just curious, is there any specific reason for not using the existing class edu.ucla.sspace.common.DocumentVectorBuilder? for the classification purposes? All that we need is save/load support for the space’s document vectors, the same way we save/load the term vectors. The builder produced vector (the query) will be compared with each space document vector for classification match.
I actually do have support for Serializing the LSA Space to disk in a branch somewhere. Another person had requested it, so we were looking at what the best way to do it is. (There's a few issues with how we treat the document space, since its matrix might exist on disk). I'll see if I can clean those up soon and add them in. This still doesn't resolve the missing matrix inversion functionality for step 4, but it's a step in that direction.
As for the second question, it's definitely possible to use the DocumentVectorBuilder? (DVB) for what you're trying to do. However, given that LSA has an explicit document representation, I think it would make more sense to use that. Also, I'm not certain we've really tested how well the DVB works for classification or clustering. The DVB has a very naive treatment of what it means to represent a document, where it just sums the vectors of the document's tokens. The summation ignores all aspects of compositionality, word order, and collocations. If you decide to try using it, we'd love to hear what you discover.
Also, it's probably easiest to continue this discussion on the developer mailing list: s-space-reserach-dev@googlegroups.com.
I get the following error:
java -d64 -Xms2G -Xmx7G -jar lsa-1.3.jar -d all.txt -n 300 my-lsa-output-300.sspace Apr 7, 2011 6:17:00 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: performing log-entropy transform Apr 7, 2011 6:17:03 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace INFO: reducing to 300 dimensions Apr 7, 2011 6:17:04 PM edu.ucla.sspace.matrix.MatrixIO matlabToSvdlibcSparseBinary INFO: Converting from Matlab double values to SVDLIBC float values; possible loss of precision Apr 7, 2011 6:17:24 PM edu.ucla.sspace.matrix.SVD svd SEVERE: convertFormat java.io.EOFException
java.lang.UnsupportedOperationException?: Unknown algorithm: ANYAny ideas?
Is there any way to set retainDocumentSpace to "true" during the launching of lsa-1.3.jar?
I have created the lsa space using the lsa.jar . Now I'm able to get the similarity between any two words, and also the number of neighbors for a given word using COSINE distance measure. How do I get similarity between a sentence and a word in the given lsa space, or similarity between document and a word? Please do the needful.
Hello there, I am new to this and never worked on LSA before.
I want to use LSA to compare sentences for semantic similarities and wanted to know if lsa-1.3.jar can be used for this purpose.
I have downloaded it and seen that it works with documents only and I really can't understand what the output means or how do it convert it into a more readable format. Just to test,I inserted a document with some sentences and specified the output as text file,the output was incomprehensible, how do I read the output?
Can someone tell me if is possible to find similarties between sentences and know how similar they are as output?
Hi, i want use the lib to compute the co-occurrence of terms in many docs, which class i should use, sorry, there is no examples, so its a little bit difficult.
If you're wanting to compute the co-occurrence, you'll want to use the GenericWordSpace? class, which records word co-occurrences as features. There's not a .jar file for running that class, but we could probably get one up and running easily if you don't want to write the code. Please email the development team at s-space-reserach-dev@googlegroups.com and we can talk more there.
Just to note, LSA doesn't directly record word co-occurrences (only document co-occurrences), so you probably don't want to use the lsa.jar for what you're describing.
I am trying to have the LSA output go to a text, rather than binary, file. When I run the following command line:
java -jar lsa-1.3.jar -d E-3O6ICQ-61.txt –o TEXT output.txt
the output is in a binary file named "-o". What am I doing wrong?
Mark Klein Principal Research Scientist MIT Center for Collective Intelligence http://cci.mit.edu/klein/
Hi, I Have installed the S-Space package and tried it out on a small test file, but I get following error - any ideas what's going wrong? Tx!!
20-apr-2012 12:10:24 edu.ucla.sspace.util.LoggerUtil? info INFO: performing log-entropy transform 20-apr-2012 12:10:24 edu.ucla.sspace.util.LoggerUtil? info INFO: reducing to 300 dimensions 20-apr-2012 12:10:24 edu.ucla.sspace.matrix.MatrixIO matlabToSvdlibcSparseBinary INFO: Converting from Matlab double values to SVDLIBC float values; possible loss of precision java.lang.NumberFormatException?: For input string: "1,584963"