My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
Usage  

Featured
en, ja
Updated Apr 27, 2011 by takahik...@gmail.com

Input data format

Oluolu input files need to have the following three components,

  • user id - use id (or IP address) who submit a query
  • time - the time when the query was submitted
  • query string - the query string submitted by the user
The input data has three corresponding rows split by tab and the order is user id, query string, and time.

The following is the sample input.

  1234	yhao	2006-03-01 07:17:00
  1234	yahoo	2006-03-01 07:17:12
  6519	yhao	2006-03-01 08:10:00
  6519	yahoo	2006-03-01 08:10:12
  2534	yhao	2006-03-01 07:16:00
  2534	yahoo	2006-03-01 07:16:12
  43778	yahoo	2006-03-01 08:16:12
  43778	yahoo news	2006-03-01 08:16:12
  438	yahoo	2006-03-02 08:16:12
  438	yahoo news	2006-03-02 08:16:12

The first column means user 1234 submitted query 'yhao' at '2006-03-01 07:17:00'.

Spelling correction dictionary creation

bin/oluolu   spellcheck
             -input INPUT                 use INPUT  as input resource
             -output OUTPUT               use OUTPUT as output prefix
             [-timeRange TIME]            use TIME   as the threshold of the maximum time range in one session
             [-queriesPerPerson NUM]      use NUM    as the threshold of the maximum number of queryies per person
             [-minimumSupport MIN]        use MIN    as the threshold of minimum frequency of the query strings
             [-likelihoodThredhold VALUE] use VALUE  as the threshold of log likelihood ratio
             [-lengthThredhold VALUE]     use VALUE  as the threshold of length difference threshold
             [-inputLanguage VALUE]       use VALUE  as the language of input data
             [-help]                      show help
             [-version]                   show version

Related queries dictionary creation

bin/oluolu   context
             -input INPUT                 use INPUT  as input resource
             -output OUTPUT               use OUTPUT as output prefix
             [-timeRange TIME]            use TIME   as the threshold of the maximum time range in one session
             [-queriesPerPerson NUM]      use NUM    as the threshold of the maximum number of queryies per person
             [-minimumSupport MIN]        use MIN    as the threshold of minimum frequency of the query strings
             [-ratioThredhold VALUE]      use VALUE  as the threshold of the frequency ratio
             [-help]                      show help
             [-version]                   show version

Further Configuration

We can configure through the configuration xml (oluolu-site.xml) file in the conf directory. The following is the list of the configuration by the xml file.

Input time pattern

The time pattern of input format is configurable changing the value of "oluolu.date.format". The pattern should be written following the date and time patterns for SImpleDataFormat class (see the url). The default value is 'yyyy-MM-dd HH:mm:ss'.

Number of reducers

We can change the number of reducers by overriding the value of "oluolu.reduces". This value should be changed to handle large data.


Sign in to add a comment
Powered by Google Project Hosting