My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
TextPairParameters  
Parameters for the new() method of the Text::Pair object
Updated Feb 5, 2010 by artfl.pr...@gmail.com

Introduction

There are three different modes of use for the Text::Pair object: index, server, and query mode. This page describes the options for initializing a Text::Pair object for each of these modes.

Parameters

data_directory

This directory contains the shingle index, bibliography and a directory called query_docs which contains uploaded documents sent as queries to the server. This directory must be readable and writeable by the www user.

document_queue

An reference to an array of hashes of the form {name => 'Gettysburg Address', file => '/path/to/gb.doc'} . Name can be any string and file is the full path to the file to be indexed.

shingle_size

default 3

How many words to include in each shingle.

stop_words

Words to be ignored and not used in shingles. This is represented internally as a hash, but is set by an array reference.

start_tags

A reference to an array of tags that indicate the start of document for processing purposes. If multiple tags are present, only one is required to open a document and begin accumulating shingles. If no start tags are submitted, the entire document will be processed.

omit_tags

A reference to an array of hashes of the form {start => '<div1>', end => '</div1>'} , containing opening and closing tags that describing sections of text to be skipped during indexing.

min_word_length

An integer specifying the minimum byte length for a word. Words shorter than this length will not be indexed.

normalize_case

default 1

0 or 1; 1 to flatten upper case characters to lower case when building shingles.

omit_numerals

default 1 0 or 1; 1 to remove numerals from shingles.

word_pattern

default [\&A-Za-z0-9\177-\377][\&A-Za-z0-9\177-\377\';]* A string, a regular expression defining a pattern that matches a word. This pattern controls word sementation.

replacement_patterns

A reference to an array of arrays regular expressions to be run on each word before shingling. Arrays look like [[qr/\&lt/o, ''], with the 0th element being the pattern to match and the 1st element being the replacement pattern. It can be useful to compile the patterns using qr for efficiency.

break_before_patterns, break_after_patterns

A reference to an arrays of patterns that should split a word into two, with the split occurring either before or after that match. The matching characters are retained in the word and hence in the shingle. Most often used to split words on apostrophes.

min_bilateral_overlap

default .5 Decimal from 0 to 1, the minimum fraction of distinct shingles from either match that must appear in both matches. For example, if set to .5, half of all the distinct shingles in the set of all shingles that appear in either match range must appear in both match ranges.

min_unilateral_overlap

default .5 Decimal from 0 to 1, the minimum fraction of disinct shingles from each match range that must appear in the other match. For example, if set to .3, then 30% of the distinct shingles in each match range must appear in the other match range.

max_gap

default 5

Integer, the maximum number of continous shingles in a match range that do not appear in the other match range. For example, if set to 5, then any match range in a pair may contain at most 5 running shingles that do not appear in the other match range.

min_pair_shingles

default 2

Integer, the minimum number of shingles that must be shared by both matches.

check_overmatch

default 1

0 or 1, 1 to check to see if the query document appears to be a near-duplicate of an existing document. If this is set 0, no check will be done, and duplicate query documents may take a very long time to process.

max_doc_percent_matches

default .50

Decimal 0 to 1, the fraction of shingles that can be shared between two documents before they are considered to be duplicates.

max_doc_shingles_matches

default 10000

Integer, the number of shingles that can be shared between two documents before they are considered to be duplicates.

min_doc_length_match_test

default 5000

Integer, the minimum document shingle length for considering a document to be a duplicate or partial duplicate of another document. It may be useful to submit shorter query documents that are wholly contained within other documents in the corpus, so duplicates are not reported for smaller documents.

num_shingle_buckets

default 1000000

Integer, the number of shingle buckets to be used in the shingle index. Shingles are looked up by loading into memory a range of indexed values for shingles sharing the same hash value, or "bucket".

==shingle_server_port Integer, the port to run the shingle server on.

Bloom Filter Parameters

bloom_capacity

default 150000000

Integer, the desired Bloom filter capacity. This should be set to the number of distinct shingles in the corpus.

bloom_error

default 0.001

Decimal, the desired error rate for the Bloom filter. Lower error rates will require a larger amount of memory for the Bloom filter.


Sign in to add a comment
Powered by Google Project Hosting