|
DocumentSearch
Document Search in SemanticVectors.
Document Search in SemanticVectorsIt's easy to search for documents in SemanticVectors, by telling the search program to use an appropriate vector store. This means that there is no special class or interface for searching documents - this functionality is provided by enabling the more general freedom to take query terms and search over a variety of different vector stores. The query file is set using the -queryvectorfile option and the search file is set using the -searchvectorfile options. Searching for Documents using TermsThe default BuildIndex command build both term vectors (termvectors.bin) and document vectors (docvectors.bin). To search for document vectors closest to the vector for Abraham, you would therefore use the command: java pitt.search.semanticvectors.Search -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin Abraham Using Documents as QueriesYou can also use the document file as a source of queries. For example, to find terms most closely related to Chapter 1 of Genesis, you'd use java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase bible_chapters/Genesis/Chapter_1 With default settings, this brings up pretty generic terms like "unto", "i", "them, "have". I'm not exactly sure why this is so, but have had better results with traditional LSA, e.g., using the King James Bible corpus: $ java pitt.search.semanticvectors.LSA index/ $ java pitt.search.semanticvectors.Search -queryvectorfile svd_docvectors.bin -searchvectorfile svd_termvectors.bin -matchcase bible_chapters/Luke/Chapter_2 Opening query vector store from file: svd_docvectors.bin Opening search vector store from file: svd_termvectors.bin Searching term vectors, searchtype sum Found vector for 'bible_chapters/Luke/Chapter_2' Search output follows ... 0.9513617955705869:anna 0.9513617955705869:cyrenius 0.9513617955705869:lineage 0.9513617955705869:phanuel 0.9513617955705869:pondered 0.9513617955705869:swaddling 0.9513617930195419:manger 0.8963310879414658:taxed For Document to Document Search, use commands just like those above but set -searchvectorfile to the same file as -queryvectorstore, that is, the path to the document vectors. Lucene Term WeightingAny term weighting for documents is computed when the document vectors are created, as part of the index building process. So giving a -luceneindexdir argument when using documents as queries will not help you at all, and can cause SemanticVectors to discard your query terms (since, for example, /files/file1.txt isn't a term that the Lucene index recognizes). Document FilteringFiltering search results based on some regular expression can be particularly useful for restricting document searches to specific parts of a corpus. See FilteredSearchResults. Comparing Two DocumentsIf you want a pairwise comparison score between two documents, Doc1 and Doc2, then you can use the CompareTerms tool. For documents that are already part of your index, use something like: java pitt.search.semanticvectors.CompareTerms -queryvectorfile docvectors.bin ./path/to/Doc1 ./path/to/Doc2 See Deswick's comment below for a complete example. For new documents that were not indexed when the model was built, you can still compare the vectors produced by summing the constituent terms, which is conceptually equivalent. (Note the 'conceptually' part: any differences in term-weighting, stemming, case-normalization, etc., will affect your results.) The terms in each document need to be assembled into query statements, surrounded by double quotes. This could be done on Unix-like systems using something like: java pitt.search.semanticvectors.CompareTerms -queryvectorfile termvectors.bin "`cat Doc1`" "`cat Doc2`" Of course, if there are double quotes inside your documents, this will lead to escaping problems. Be prepared to do a certain amount of cutting, pasting, search and replace, and escaping of special characters using your favorite tools. |
Example Use case to compare 2 documents explicitly:
Example using Windows and Eclipse (Semantic Vectors 2.2 and Lucene 3.1.0) Compatibility is important: http://code.google.com/p/semanticvectors/wiki/LuceneCompatibility
Make sure Lucene and Semactic Vectors jar files are referenced in the classpath of the Eclipse project. Data directory: C:\workspace\SV\src\docsDir (contains file1.txt and file2.txt) Index Directory: C:\workspace\SV\src\indexDir
1) Run Lucene index in Eclipse with arguments: org.apache.lucene.demo.IndexFiles?.java -index \workspace\SV\src\indexDir -docs \workspace\SV\src\docsDir
2) Run Semantic Vectors index on the lucene index: pitt.search.semanticvectors.BuildIndexjava? \workspace\SV\src\indexDir It will create termvectors.bin and docvectors.bin in C:/workspace/SV
3) Run compare documents in eclipse pitt.search.semanticvectors.CompareTerms.java -queryvectorfile docvectors.bin \workspace\SV\src\docsDir\file1.txt \workspace\SV\src\docsDir\file2.txt
It will display the semantic similarity of the 2 documents like this: INFO: Outputting similarity of "\workspace\SV\src\docsDir\file1.txt" with "\workspace\SV\src\docsDir\file2.txt" ... 0.437294
If you encounter an error that 'No vector found for the file names' then you need to ascertain the absolute path information with which Lucene has indexed the data files. This can be done using: org.apache.lucene.demo.SearchFiles?.java -index \workspace\SV\src\indexDir Eclipse console will ask for some query term and then will print the list of documents matching that query term along with EXACT path information. Use that path information to compare 2 documents in Semantic Vectors.
Detailed post: http://groups.google.com/group/semanticvectors/browse_thread/thread/e2c7d9bcf8bf33a0