My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
SearchOptions  
Semantic Vectors Search Options
Updated Jun 20, 2011 by sid....@gmail.com

Semantic Vectors Search Options

As the SemanticVectors package grows more sophisticated, the options for building and searching semantic vector indexes have grown more complex. The command line interface for all of these options is still all through Search.java, which is by now quite complex but still (we hope) fairly usable and maintainable.

The purpose of this Wiki page is to document some of the search options in a slightly friendlier fashion: however, if any gradual discrepancies arise between this Wiki page and the code / javadoc in the svn repository, the svn repository should be regarded as more authoritative.

Basic Searching

Searching is performed using the command java pitt.search.semanticvectors.Search QUERYARGS, as documented in the InstallationInstructions. If none of the special command line arguments are given, the default behavior is to presume that the arguments given are all query terms to be looked up in a vector file called termvectors.bin. The query vector will be produced by adding up the vectors for all query terms, and the search will be performed using the cosine similarity measure.

Several other options are available: these fall broadly into the categories of file arguments (where to find the vectors and what formats to expect), search types (how to combine several terms into a single query expression), and query terms (which terms to look up and use as query terms).

The simplest way to find out more about these arguments is to run java pitt.search.semanticvectors.Search with no arguments, which will result in a basic usage message being written to the console. All changes to interface of Search.java should be reflected in the http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/Search.html#usage() usage function].

Other Useful Tools and Options

See DocumentSearch, FilteredSearchResults, PermutationSearch, and ClusteringAndVisualization.

File Arguments

The Search program needs a file to look up vectors to form the query, and a file to search through vectors to find nearest neighbors. By default, these are the same file (termvectors.bin), but it's sometimes useful to have different files -- for example, to use term vectors to look up nearby documents, or to use terms from one language to look up neighbors from another (see BilingualModels).

To change the file from which the queries are built, use the -q option. to change the file from which search results are found, use the -s option.

See also VectorStoreFormats for a description of the formats that are supported for reading vectors from disk.

Search Type Arguments

The available search types are chosen using the -searchtype argument. The available search types are listed in an enumeration and documented at http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/Search.html

Most of these options correspond directly to implementations of the VectorSearcher class.

Note that the options (like all command line options at the moment) are case insensitive.

All of the examples below are generated from the default termvectors.bin file derived from the King James Bible corpus.

SUM

Default option - build a query by adding together (weighted) vectors for each of the query terms, and search using cosine similarity.

Example:

$ java pitt.search.semanticvectors.Search -searchtype sum abraham isaac
Opening query vector store from file: termvectors.bin
Dimensions = 200
Searching term vectors, searchtype SUM ... Search output follows ...
0.8739137:abraham
0.8739133:isaac
0.57702935:rebekah
0.5297739:bethuel
0.4821766:digged
0.4661227:gerar
...

SPARSESUM

Build a query as with SUM option, but quantize to sparse vectors before taking scalar product at search time.

Example:

$ java pitt.search.semanticvectors.Search -searchtype sparsesum abraham isaac
Opening query vector store from file: termvectors.bin
Dimensions = 200
Searching term vectors, searchtype SPARSESUM ... Search output follows ...
16.0:abraham
15.0:isaac
10.0:bethuel
10.0:rebekah
8.0:room
8.0:stopped
...

(Careful readers may note that these are scalar products, not cosine similarities, i.e., the scores are not normalized. This is another story, feel free to write to the group if you're interested.)

SUBSPACE

"Quantum disjunction" - get vectors for each query term, create a representation for the subspace spanned by these vectors, and score by measuring cosine similarity with this subspace.

Example:

$ java pitt.search.semanticvectors.Search -searchtype subspace abraham isaac
Opening query vector store from file: termvectors.bin
Dimensions = 200
Searching term vectors, searchtype SUBSPACE ... Search output follows ...
1.3770361:isaac
1.0000002:abraham
0.89358807:rebekah
0.73906654:gerar
0.7289439:digged
0.7102588:bethuel
...

Careful readers will note that there is something wrong with these scores - they should never go above 1. I have been unable to track down the source of this problem, the unit tests for the orthogonalization VectorUtils all work just fine. Any help with this problem would be much appreciated.

MAXSIM

"Closest disjunction" - get vectors for each query term, score by measuring distance to each term and taking the minimum.

Example:

$ java pitt.search.semanticvectors.Search -searchtype maxsim abraham isaac
Opening query vector store from file: termvectors.bin
Dimensions = 200
Searching term vectors, searchtype MAXSIM ... Search output follows ...
1.0000002:isaac
1.0:abraham
0.6413876:sarah
0.64067477:rebekah
0.5391283:gerar
0.51310706:digged
...

TENSOR

A product similarity that trains by taking ordered pairs of terms, a target query term, and searches for the term whose tensor product with the target term gives the largest similarity with training tensor.

The queryterms should be a list of one or more tilde-separated training pairs, e.g., paris~france berlin~germany followed by a list of one or more search terms, e.g., london birmingham.

Example:

$ java pitt.search.semanticvectors.Search -searchtype tensor abraham~isaac jacob~joseph jesse
Opening query vector store from file: termvectors.bin
Dimensions = 200
Training pair: abraham~isaac
Training pair: jacob~joseph
Searching term vectors, searchtype TENSOR ... Search output follows ...
0.08903535:jacob
0.052856527:leah
0.052095436:rachel
0.051172566:bilhah
0.048837323:laban
0.04426618:speckled
...

In an ideal world, this product would be designed to find the item that stands in the same relation to jesse that isaac stands in with respect to abraham and joseph stands in with respect to jacob, so we might hope that the model picks out david (who was the most famous son of jesse in the Bible). As you can see, it doesn't behave so nicely, but seems to be just finding terms related to jacob.

CONVOLUTION

Similar to TENSOR, product similarity that trains by taking ordered pairs of terms, a target query term, and searches for the term whose convolution product with the target term gives the largest similarity with training convolution.

Example:

$ java pitt.search.semanticvectors.Search -searchtype convolution abraham~isaac jacob~joseph jesse
Opening query vector store from file: termvectors.bin
Dimensions = 200
Training pair: abraham~isaac
Training pair: jacob~joseph
Searching term vectors, searchtype CONVOLUTION ... Search output follows ...
0.30361828:faithfulness
0.28568172:reproached
0.2765901:lo
0.2596051:wondrous
0.24092554:governors
0.23591144:coat
...

Results look considerably less good that the results with tensor similarity.

It would be premature of course to dismiss these options as "not working", it just may be that these examples are not what they are good for.

PRINTQUERY

Build an additive query vector (as with SUM and print out the query vector for debugging.

Comment by JKasm...@gmail.com, Dec 15, 2009
Comment by dominicw...@gmail.com, Aug 27, 2010

Hey, sorry to take so long to see these comments. The grand flags refactor did away with the explicit searchtype enum, so the links to the relevant VectorSearcher? implmentations are expressed (somewhat uglily) in the Javadoc at http://semanticvectors.googlecode.com/svn/trunk/doc/pitt/search/semanticvectors/Search.html.

Comment by project member sid....@gmail.com, Jun 20, 2011

Ehsan,fixed all the broken links


Sign in to add a comment
Powered by Google Project Hosting