Input data formatOluolu input files need to have the following three components,
The following is the sample input. 1234 yhao 2006-03-01 07:17:00 1234 yahoo 2006-03-01 07:17:12 6519 yhao 2006-03-01 08:10:00 6519 yahoo 2006-03-01 08:10:12 2534 yhao 2006-03-01 07:16:00 2534 yahoo 2006-03-01 07:16:12 43778 yahoo 2006-03-01 08:16:12 43778 yahoo news 2006-03-01 08:16:12 438 yahoo 2006-03-02 08:16:12 438 yahoo news 2006-03-02 08:16:12 The first column means user 1234 submitted query 'yhao' at '2006-03-01 07:17:00'. Spelling correction dictionary creationbin/oluolu spellcheck
-input INPUT use INPUT as input resource
-output OUTPUT use OUTPUT as output prefix
[-timeRange TIME] use TIME as the threshold of the maximum time range in one session
[-queriesPerPerson NUM] use NUM as the threshold of the maximum number of queryies per person
[-minimumSupport MIN] use MIN as the threshold of minimum frequency of the query strings
[-likelihoodThredhold VALUE] use VALUE as the threshold of log likelihood ratio
[-lengthThredhold VALUE] use VALUE as the threshold of length difference threshold
[-inputLanguage VALUE] use VALUE as the language of input data
[-help] show help
[-version] show version
Related queries dictionary creationbin/oluolu context
-input INPUT use INPUT as input resource
-output OUTPUT use OUTPUT as output prefix
[-timeRange TIME] use TIME as the threshold of the maximum time range in one session
[-queriesPerPerson NUM] use NUM as the threshold of the maximum number of queryies per person
[-minimumSupport MIN] use MIN as the threshold of minimum frequency of the query strings
[-ratioThredhold VALUE] use VALUE as the threshold of the frequency ratio
[-help] show help
[-version] show versionFurther ConfigurationWe can configure through the configuration xml (oluolu-site.xml) file in the conf directory. The following is the list of the configuration by the xml file. Input time pattern The time pattern of input format is configurable changing the value of "oluolu.date.format". The pattern should be written following the date and time patterns for SImpleDataFormat class (see the url). The default value is 'yyyy-MM-dd HH:mm:ss'. Number of reducers We can change the number of reducers by overriding the value of "oluolu.reduces". This value should be changed to handle large data.
|