My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
DocumentSearch  
Document Search in SemanticVectors.
Updated May 8, 2012 by dwidd...@gmail.com

Document Search in SemanticVectors

It's easy to search for documents in SemanticVectors, by telling the search program to use an appropriate vector store. This means that there is no special class or interface for searching documents - this functionality is provided by enabling the more general freedom to take query terms and search over a variety of different vector stores.

The query file is set using the -queryvectorfile option and the search file is set using the -searchvectorfile options.

Searching for Documents using Terms

The default BuildIndex command build both term vectors (termvectors.bin) and document vectors (docvectors.bin). To search for document vectors closest to the vector for Abraham, you would therefore use the command:

java pitt.search.semanticvectors.Search -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin Abraham

Using Documents as Queries

You can also use the document file as a source of queries. For example, to find terms most closely related to Chapter 1 of Genesis, you'd use

java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase bible_chapters/Genesis/Chapter_1

With default settings, this brings up pretty generic terms like "unto", "i", "them, "have". I'm not exactly sure why this is so, but have had better results with traditional LSA, e.g., using the King James Bible corpus:

$ java pitt.search.semanticvectors.LSA index/
$ java pitt.search.semanticvectors.Search -queryvectorfile svd_docvectors.bin -searchvectorfile svd_termvectors.bin -matchcase  bible_chapters/Luke/Chapter_2
Opening query vector store from file: svd_docvectors.bin
Opening search vector store from file: svd_termvectors.bin
Searching term vectors, searchtype sum
Found vector for 'bible_chapters/Luke/Chapter_2'
Search output follows ...
0.9513617955705869:anna
0.9513617955705869:cyrenius
0.9513617955705869:lineage
0.9513617955705869:phanuel
0.9513617955705869:pondered
0.9513617955705869:swaddling
0.9513617930195419:manger
0.8963310879414658:taxed

For Document to Document Search, use commands just like those above but set -searchvectorfile to the same file as -queryvectorstore, that is, the path to the document vectors.

Lucene Term Weighting

Any term weighting for documents is computed when the document vectors are created, as part of the index building process. So giving a -luceneindexdir argument when using documents as queries will not help you at all, and can cause SemanticVectors to discard your query terms (since, for example, /files/file1.txt isn't a term that the Lucene index recognizes).

Document Filtering

Filtering search results based on some regular expression can be particularly useful for restricting document searches to specific parts of a corpus. See FilteredSearchResults.

Comparing Two Documents

If you want a pairwise comparison score between two documents, Doc1 and Doc2, then you can use the CompareTerms tool. For documents that are already part of your index, use something like:

java pitt.search.semanticvectors.CompareTerms -queryvectorfile docvectors.bin ./path/to/Doc1 ./path/to/Doc2

See Deswick's comment below for a complete example.

For new documents that were not indexed when the model was built, you can still compare the vectors produced by summing the constituent terms, which is conceptually equivalent. (Note the 'conceptually' part: any differences in term-weighting, stemming, case-normalization, etc., will affect your results.) The terms in each document need to be assembled into query statements, surrounded by double quotes.

This could be done on Unix-like systems using something like:

java pitt.search.semanticvectors.CompareTerms -queryvectorfile termvectors.bin "`cat Doc1`" "`cat Doc2`"

Of course, if there are double quotes inside your documents, this will lead to escaping problems. Be prepared to do a certain amount of cutting, pasting, search and replace, and escaping of special characters using your favorite tools.

Comment by deswick....@gmail.com, May 24, 2011

Example Use case to compare 2 documents explicitly:

Example using Windows and Eclipse (Semantic Vectors 2.2 and Lucene 3.1.0) Compatibility is important: http://code.google.com/p/semanticvectors/wiki/LuceneCompatibility

Make sure Lucene and Semactic Vectors jar files are referenced in the classpath of the Eclipse project. Data directory: C:\workspace\SV\src\docsDir (contains file1.txt and file2.txt) Index Directory: C:\workspace\SV\src\indexDir

1) Run Lucene index in Eclipse with arguments: org.apache.lucene.demo.IndexFiles?.java -index \workspace\SV\src\indexDir -docs \workspace\SV\src\docsDir

2) Run Semantic Vectors index on the lucene index: pitt.search.semanticvectors.BuildIndexjava? \workspace\SV\src\indexDir It will create termvectors.bin and docvectors.bin in C:/workspace/SV

3) Run compare documents in eclipse pitt.search.semanticvectors.CompareTerms.java -queryvectorfile docvectors.bin \workspace\SV\src\docsDir\file1.txt \workspace\SV\src\docsDir\file2.txt

It will display the semantic similarity of the 2 documents like this: INFO: Outputting similarity of "\workspace\SV\src\docsDir\file1.txt" with "\workspace\SV\src\docsDir\file2.txt" ... 0.437294

If you encounter an error that 'No vector found for the file names' then you need to ascertain the absolute path information with which Lucene has indexed the data files. This can be done using: org.apache.lucene.demo.SearchFiles?.java -index \workspace\SV\src\indexDir Eclipse console will ask for some query term and then will print the list of documents matching that query term along with EXACT path information. Use that path information to compare 2 documents in Semantic Vectors.

Detailed post: http://groups.google.com/group/semanticvectors/browse_thread/thread/e2c7d9bcf8bf33a0


Sign in to add a comment
Powered by Google Project Hosting