|
SearchOptions
Semantic Vectors Search Options
Semantic Vectors Search OptionsAs the SemanticVectors package grows more sophisticated, the options for building and searching semantic vector indexes have grown more complex. The command line interface for all of these options is still all through Search.java, which is by now quite complex but still (we hope) fairly usable and maintainable. The purpose of this Wiki page is to document some of the search options in a slightly friendlier fashion: however, if any gradual discrepancies arise between this Wiki page and the code / javadoc in the svn repository, the svn repository should be regarded as more authoritative. Basic SearchingSearching is performed using the command java pitt.search.semanticvectors.Search QUERYARGS, as documented in the InstallationInstructions. If none of the special command line arguments are given, the default behavior is to presume that the arguments given are all query terms to be looked up in a vector file called termvectors.bin. The query vector will be produced by adding up the vectors for all query terms, and the search will be performed using the cosine similarity measure. Several other options are available: these fall broadly into the categories of file arguments (where to find the vectors and what formats to expect), search types (how to combine several terms into a single query expression), and query terms (which terms to look up and use as query terms). The simplest way to find out more about these arguments is to run java pitt.search.semanticvectors.Search with no arguments, which will result in a basic usage message being written to the console. All changes to interface of Search.java should be reflected in the http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/Search.html#usage() usage function]. Other Useful Tools and OptionsSee DocumentSearch, FilteredSearchResults, PermutationSearch, and ClusteringAndVisualization. File ArgumentsThe Search program needs a file to look up vectors to form the query, and a file to search through vectors to find nearest neighbors. By default, these are the same file (termvectors.bin), but it's sometimes useful to have different files -- for example, to use term vectors to look up nearby documents, or to use terms from one language to look up neighbors from another (see BilingualModels). To change the file from which the queries are built, use the -q option. to change the file from which search results are found, use the -s option. See also VectorStoreFormats for a description of the formats that are supported for reading vectors from disk. Search Type ArgumentsThe available search types are chosen using the -searchtype argument. The available search types are listed in an enumeration and documented at http://semanticvectors.googlecode.com/svn/javadoc/latest-stable/pitt/search/semanticvectors/Search.html Most of these options correspond directly to implementations of the VectorSearcher class. Note that the options (like all command line options at the moment) are case insensitive. All of the examples below are generated from the default termvectors.bin file derived from the King James Bible corpus. SUMDefault option - build a query by adding together (weighted) vectors for each of the query terms, and search using cosine similarity. Example: $ java pitt.search.semanticvectors.Search -searchtype sum abraham isaac Opening query vector store from file: termvectors.bin Dimensions = 200 Searching term vectors, searchtype SUM ... Search output follows ... 0.8739137:abraham 0.8739133:isaac 0.57702935:rebekah 0.5297739:bethuel 0.4821766:digged 0.4661227:gerar ... SPARSESUMBuild a query as with SUM option, but quantize to sparse vectors before taking scalar product at search time. Example: $ java pitt.search.semanticvectors.Search -searchtype sparsesum abraham isaac Opening query vector store from file: termvectors.bin Dimensions = 200 Searching term vectors, searchtype SPARSESUM ... Search output follows ... 16.0:abraham 15.0:isaac 10.0:bethuel 10.0:rebekah 8.0:room 8.0:stopped ... (Careful readers may note that these are scalar products, not cosine similarities, i.e., the scores are not normalized. This is another story, feel free to write to the group if you're interested.) SUBSPACE"Quantum disjunction" - get vectors for each query term, create a representation for the subspace spanned by these vectors, and score by measuring cosine similarity with this subspace. Example: $ java pitt.search.semanticvectors.Search -searchtype subspace abraham isaac Opening query vector store from file: termvectors.bin Dimensions = 200 Searching term vectors, searchtype SUBSPACE ... Search output follows ... 1.3770361:isaac 1.0000002:abraham 0.89358807:rebekah 0.73906654:gerar 0.7289439:digged 0.7102588:bethuel ... Careful readers will note that there is something wrong with these scores - they should never go above 1. I have been unable to track down the source of this problem, the unit tests for the orthogonalization VectorUtils all work just fine. Any help with this problem would be much appreciated. MAXSIM"Closest disjunction" - get vectors for each query term, score by measuring distance to each term and taking the minimum. Example: $ java pitt.search.semanticvectors.Search -searchtype maxsim abraham isaac Opening query vector store from file: termvectors.bin Dimensions = 200 Searching term vectors, searchtype MAXSIM ... Search output follows ... 1.0000002:isaac 1.0:abraham 0.6413876:sarah 0.64067477:rebekah 0.5391283:gerar 0.51310706:digged ... TENSORA product similarity that trains by taking ordered pairs of terms, a target query term, and searches for the term whose tensor product with the target term gives the largest similarity with training tensor. The queryterms should be a list of one or more tilde-separated training pairs, e.g., paris~france berlin~germany followed by a list of one or more search terms, e.g., london birmingham. Example: $ java pitt.search.semanticvectors.Search -searchtype tensor abraham~isaac jacob~joseph jesse Opening query vector store from file: termvectors.bin Dimensions = 200 Training pair: abraham~isaac Training pair: jacob~joseph Searching term vectors, searchtype TENSOR ... Search output follows ... 0.08903535:jacob 0.052856527:leah 0.052095436:rachel 0.051172566:bilhah 0.048837323:laban 0.04426618:speckled ... In an ideal world, this product would be designed to find the item that stands in the same relation to jesse that isaac stands in with respect to abraham and joseph stands in with respect to jacob, so we might hope that the model picks out david (who was the most famous son of jesse in the Bible). As you can see, it doesn't behave so nicely, but seems to be just finding terms related to jacob. CONVOLUTIONSimilar to TENSOR, product similarity that trains by taking ordered pairs of terms, a target query term, and searches for the term whose convolution product with the target term gives the largest similarity with training convolution. Example: $ java pitt.search.semanticvectors.Search -searchtype convolution abraham~isaac jacob~joseph jesse Opening query vector store from file: termvectors.bin Dimensions = 200 Training pair: abraham~isaac Training pair: jacob~joseph Searching term vectors, searchtype CONVOLUTION ... Search output follows ... 0.30361828:faithfulness 0.28568172:reproached 0.2765901:lo 0.2596051:wondrous 0.24092554:governors 0.23591144:coat ... Results look considerably less good that the results with tensor similarity. It would be premature of course to dismiss these options as "not working", it just may be that these examples are not what they are good for. PRINTQUERYBuild an additive query vector (as with SUM and print out the query vector for debugging. |
Link http://semanticvectors.googlecode.com/svn/trunk/doc/pitt/search/semanticvectors/Search.SearchType.html is broken.
Can we get a working link for the search types? http://semanticvectors.googlecode.com/svn/trunk/doc/pitt/search/semanticvectors/Search.SearchType.html is still broken
Hey, sorry to take so long to see these comments. The grand flags refactor did away with the explicit searchtype enum, so the links to the relevant VectorSearcher? implmentations are expressed (somewhat uglily) in the Javadoc at http://semanticvectors.googlecode.com/svn/trunk/doc/pitt/search/semanticvectors/Search.html.
http://semanticvectors.googlecode.com/svn/trunk/doc/pitt/search/semanticvectors/Search.html is also broken
Ehsan,fixed all the broken links