My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
DocumentSearch  
Document Search in SemanticVectors.
Updated Mar 6, 2013 by dwidd...@gmail.com

Document Search in SemanticVectors

Command-Line Search

It's easy to search for documents in SemanticVectors, by telling the search program to use an appropriate vector store. This means that there is no special class or interface for searching documents - this functionality is provided by enabling the more general freedom to take query terms and search over a variety of different vector stores.

The query file is set using the -queryvectorfile option and the search file is set using the -searchvectorfile options.

Searching for Documents using Terms

The default BuildIndex command build both term vectors (termvectors.bin) and document vectors (docvectors.bin). To search for document vectors closest to the vector for Abraham, you would therefore use the command:

java pitt.search.semanticvectors.Search -queryvectorfile termvectors.bin -searchvectorfile docvectors.bin Abraham

Using Documents as Queries

You can also use the document file as a source of queries. For example, to find terms most closely related to Chapter 1 of Genesis, you'd use

java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase bible_chapters/Genesis/Chapter_1

There have been some reports of good performance in using SemanticVectors in this way for document keyword and tag recommendations. In most situations, effort is required to tune the process effectively. Known problems include:

  • With default settings (using random projection with BuildIndex and the King James Bible corpus, results appear to be pretty generic terms like "unto", "i", "them, "have".
  • With term weighting switched on using -termweight idf or -termweight logentropy, results appear to be very specific terms like "whales", "yielding", "firmament".
  • Results often look promising with traditional LSA and idf term weighting, e.g., using the King James Bible corpus:

$ java pitt.search.semanticvectors.LSA -termweight idf positional_index/
$ java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase "bible_chapters\Luke\Chapter_2"
Opening query vector store from file: docvectors.bin
Opening search vector store from file: termvectors.bin
Searching term vectors, searchtype sum
Found vector for 'bible_chapters\Luke\Chapter_2'
Search output follows ...
0.9894455515331493:anna
0.9894455515331493:cyrenius
0.9894455515331493:lineage
0.9894455515331493:phanuel
0.9894455515331493:pondered
0.9894455515331493:swaddling
0.9894455487129858:manger

For Document to Document Search, use commands just like those above but set -searchvectorfile to the same file as -queryvectorfile, that is, the path to the document vectors.

Lucene Term Weighting

Any term weighting for documents is computed when the document vectors are created, as part of the index building process. So giving a -luceneindexdir argument when using documents as queries will not help you at all, and can cause SemanticVectors to discard your query terms (since, for example, /files/file1.txt isn't a term that the Lucene index recognizes).

Document Filtering

Filtering search results based on some regular expression can be particularly useful for restricting document searches to specific parts of a corpus. See FilteredSearchResults.

Comparing Two Documents

If you want a pairwise comparison score between two documents, Doc1 and Doc2, then you can use the CompareTerms tool. For documents that are already part of your index, use something like:

java pitt.search.semanticvectors.CompareTerms -queryvectorfile docvectors.bin ./path/to/Doc1 ./path/to/Doc2

See Deswick's comment below for a complete example.

For new documents that were not indexed when the model was built, you can still compare the vectors produced by summing the constituent terms, which is conceptually equivalent. (Note the 'conceptually' part: any differences in term-weighting, stemming, case-normalization, etc., will affect your results.) The terms in each document need to be assembled into query statements, surrounded by double quotes.

This could be done on Unix-like systems using something like:

java pitt.search.semanticvectors.CompareTerms -queryvectorfile termvectors.bin "`cat Doc1`" "`cat Doc2`"

Of course, if there are double quotes inside your documents, this will lead to escaping problems. Be prepared to do a certain amount of cutting, pasting, search and replace, and escaping of special characters using your favorite tools.

Comparing Terms and Documents Explicitly

This question was raised in the group discussions. Can you explicitly compare a term vector and a document vector? The answer is yes, but it needs a workaround. See CompareTerms for more details.

Programmatic / API-driven Search

For users wishing to incorporate document search into programmatic calls, the basic ideas are the same as above, but you have to translate them into API calls instead of command-line calls.

For example, if you want to search for documents related to "my search terms", you might write something like the following:

FlagConfig config = FlagConfig.getFlagConfig( ... appropriate command-line string arguments ... );
CloseableVectorStore queryVecReader = VectorStoreReader.openVectorStore(config.termvectorsfile(), config), 
CloseableVectorStore resultsVecReader = VectorStoreReader.openVectorStore(config.docvectorsfile(), config);
LuceneUtils luceneUtils = new LuceneUtils(config.luceneindexpath()); 
VectorSearcher  vecSearcher = new VectorSearcher.VectorSearcherCosine( 
                queryVecReader, resultsVecReader, luceneUtils, config, new String[] {"my", "search", "terms"}); 
LinkedList<SearchResult> results = vecSearcher.getNearestNeighbors(maxResults);

for (SearchResult result: results) {
  System.out.println(String.format(
      "%f:%s",
      result.getScore(),
      result.getObjectVector().getObject().toString()));
}

In this case, this will print the paths (extracted from the index field and configured with the -docidfield flag) to the resulting documents to STDOUT.

(The -docidfield defaults to "path", which is also the default in Lucene indexes built by org.apache.lucene.demo.IndexFiles and pitt.search.lucene.IndexFilePositions, but this field name is not invariant and can easily be changed by other index-building tools.)

Comment by deswick....@googlemail.com, May 24, 2011

If someone who is using SV for the first time, here are the basic steps to compare 2 documents explicitly for their semantic association:

Example using Windows and Eclipse (Semantic Vectors 2.2 and Lucene 3.1.0 is important: http://code.google.com/p/semanticvectors/wiki/LuceneCompatibility?) :

Data directory: C:\workspace\SV\src\docsDir (contains file1.txt and file2.txt) Index Directory: C:\workspace\SV\src\indexDir

1) First index the data files through Lucene using eclipse \lucene-3.1.0-src\lucene-3.1.0\contrib\demo\src\java\org\apache\lucene\demo\IndexFiles?.java -index C:\workspace\SV\src\indexDir -docs C:\workspace\SV\src\docsDir

2) Make note of the exact path of the data files and also make sure that lucene has indexed the data files with path information (Optional Step): Search the documents based on terms using eclipse \lucene-3.1.0-src\lucene-3.1.0\contrib\demo\src\java\org\apache\lucene\demo\SearchFiles?.java -index C:\workspace\SV\src\indexDir\ Eclipse console will ask for some term query and then will print the list of documents matching the term provided (along with EXACT path information)

3) Run Semantic Vectors index on the lucene indexed files using eclipse src\pitt\search\semanticvectors\BuildIndex.java C:\workspace\SV\src\indexDir\

4) We have the exact path of the data files (from step 2) which we want to compare: Now run CompareTerms using eclipse src\pitt\search\semanticvectors\CompareTerms.java -queryvectorfile docvectors.bin C:\workspace\SV\src\docsDir\file1.txt C:\workspace\SV\src\docsDir\file2.txt

Here you go, it will show you the output something like this: INFO: Outputting similarity of "C:\workspace\SV\src\docsDir\file1.txt" with "C:\workspace\SV\src\docsDir\file2.txt" ... 0.43729395

Detailed post: http://groups.google.com/group/semanticvectors/browse_thread/thread/e2c7d9bcf8bf33a0

Comment by deswick....@googlemail.com, May 24, 2011

Example Use case to compare 2 documents explicitly:

Example using Windows and Eclipse (Semantic Vectors 2.2 and Lucene 3.1.0) Compatibility is important: http://code.google.com/p/semanticvectors/wiki/LuceneCompatibility

Make sure Lucene and Semactic Vectors jar files are referenced in the classpath of the Eclipse project. Data directory: C:\workspace\SV\src\docsDir (contains file1.txt and file2.txt) Index Directory: C:\workspace\SV\src\indexDir

1) Run Lucene index in Eclipse with arguments: org.apache.lucene.demo.IndexFiles?.java -index \workspace\SV\src\indexDir -docs \workspace\SV\src\docsDir

2) Run Semantic Vectors index on the lucene index: pitt.search.semanticvectors.BuildIndexjava? \workspace\SV\src\indexDir It will create termvectors.bin and docvectors.bin in C:/workspace/SV

3) Run compare documents in eclipse pitt.search.semanticvectors.CompareTerms.java -queryvectorfile docvectors.bin \workspace\SV\src\docsDir\file1.txt \workspace\SV\src\docsDir\file2.txt

It will display the semantic similarity of the 2 documents like this: INFO: Outputting similarity of "\workspace\SV\src\docsDir\file1.txt" with "\workspace\SV\src\docsDir\file2.txt" ... 0.437294

If you encounter an error that 'No vector found for the file names' then you need to ascertain the absolute path information with which Lucene has indexed the data files. This can be done using: org.apache.lucene.demo.SearchFiles?.java -index \workspace\SV\src\indexDir Eclipse console will ask for some query term and then will print the list of documents matching that query term along with EXACT path information. Use that path information to compare 2 documents in Semantic Vectors.

Detailed post: http://groups.google.com/group/semanticvectors/browse_thread/thread/e2c7d9bcf8bf33a0

Comment by visi...@gmail.com, Mar 19, 2014

I have few questions about LSA, and document comparison. LSA is very new to me, and up until now I was relying on plain old cosine similarity against a Lucene index.

Here's what I have done- 1. Built a Lucene index with my corpus, and used BuildIndex. Since BuildIndex uses RandomProjection, then Cosine similarity output for document comparison varies whenever I rebuild BuildIndex. If I understand correctly, because of RandomProjection one cannot guarantee to get same Cosine similarity for same pair of documents compared. Am I right? 2. For my purpose, I need to rebuild indexes periodically, and also ensure that Cosine similarity is deterministic. Thus, instead of BuildIndex, I instead chose to use LSA. Now whenever I compare the same two documents, I get the exact same similarity (cosine?), even if I rebuild LSA indexes over and over again.

But what I need to check with you is 1. Document comparison either using RandomProject (BuildIndex) or LSA indexes is different from comparison. Comparison is always done using Cosine similarity? Is this correct? 2. Results of document comparison using BuildIndex indexes fall between 0.0 to 1.0 range. But the same done using LSA index gives me exponential (very small) numbers like 6.23749915238891E-11. Why is that the case? In any case, do I have to get this output and format it myself to fall between 0.0 - 1.0 somehow?

Note, my corpus is very small snippets of text like Twitter feeds, and I am evaluating your library to establish not just Cosine similarity, but semantic similarity between these.

Appreciate your great work on making this excellent library available.

Comment by project member dwidd...@gmail.com, Mar 20, 2014

Hi there - quick reply to the above comment.

- You should be able to use '-elementalmethod contenthash' to give deterministic pseudorandom elemental vectors as well. I should write some tests to make sure that this gives the same results every time, but it should do. We should probably make the default be to use deterministic vectors. In the meantime I've at least updated the documentation at https://code.google.com/p/semanticvectors/wiki/ElementalVector.

- Both LSA and random projection use cosine similarity (for real vectors, at least). Similarities may be small and indeed negative, since cosines fall in the range [-1, 1]. If you need values in [0, 1], you might start by mapping all negative values to zero.

- Are all your LSA similarities very small? That would be a problem. But if insignificant ones are small, that's a good thing.

Thanks for writing!

Comment by visi...@gmail.com, Mar 24, 2014

Great. Thanks for the quick reply, Dominic.

Using java pitt.search.semanticvectors.BuildIndex -elementalmethod contenthash gives me deterministic results. Perhaps I will be using this.
Note, I also chose to include java pitt.search.semanticvectors.BuildIndex -elementalmethod contenthash -termweight idf which seems to better fit my compare expectations
Related question, if you may. I indexed two sample documents, sample1.txt, and sample2.txt
Using BuildIndex, and termweight idf gives me below terms in my sample document {{{java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase sample1.txt 0.418782:random 0.337068:basic 0.331961:trainingcycles 0.280890:2 0.229819:pitt.search.semanticvectors.buildindex 0.199177:indexing 0.199177:yet 0.188963:option 0.188963:based 0.178748:so 0.178748:inside 0.178748:create 0.173641:cyclical 0.168534:incremental 0.163427:rri 0.163427:further 0.158320:doc1 0.158320:either 0.158320:lucene 0.148106:connections}}}
Using LSA, and termweight idf gives me terms, but I am curious about the weights. It looks like the weights appear to use vectortype binary, although I think your documentation says only real vector types are supported for LSA? {{{java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -searchvectorfile termvectors.bin -matchcase sample1.txt
  1. 000000:10
  2. 000000:2
  3. 000000:2010
  4. 000000:240
  5. 000000:3
  6. 000000:43
  7. 000000:56
  8. 000000:able
  9. 000000:above
  10. 000000:advantage
  11. 000000:approach
  12. 000000:apr
  13. 000000:associations
  14. 000000:avoid
  15. 000000:back
  16. 000000:based
  17. 000000:basic
  18. 000000:bilingualmodels
  19. 000000:biomed
  20. 000000:both}}}
Has this term weights something to do with exponentially small values for cosine similarity between documents when using LSA indexing?
Comment by project member dwidd...@gmail.com, Mar 24, 2014

I suspect that using only two documents is not going to give very meaningful results with LSA. I'm surprised you got as much variety as you did with the BuildIndex / Random Projection approach!

Thus saying, there seems to be something amiss here - I would rather expect there to be lots of higher similarities because the terms that appear only in sample1.txt should have similarity should contribute heavily.

The numbers at the side - the 1., 2., 3., etc - are those added by you? Also, I suspect that you're not rally using -vectortype binary by accident, but it may be that the small number of documents makes results look rather binary.

One thing you could run is "java pitt.search.semanticvectors.Search -queryvectorfile docvectors.bin -matchcase sample1.txt -searchtype printquery". This will tell you if the vector for sample1.txt has become zero for some reason.


Sign in to add a comment
Powered by Google Project Hosting