My favorites | Sign in
Logo
                
Code license: Apache License 2.0
Labels: Google
Feeds:
People details
Project owners:
  dscul...@google.com, sculleyd
Project committers:
gm...@google.com

sofia-ml

The suite of fast incremental algorithms for machine learning (sofia-ml) can be used for training models for classification or ranking, using several different techniques. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets.

Supported learners include:

These learners can be configured for binary classification, ranking, and optimizing ROC area, through the use of several available sampling methods for stochastic gradient descent.

This implementation is very fast. For example, 100,000 Pegasos SVM training iterations can be performed on data from the CCAT task from the RCV1 benchmark data set (with roughly 780,000 examples) in 0.1 CPU seconds on an ordinary 2.4GHz laptop, with no loss in classification performance compared with other SVM methods. On LETOR learning to rank benchmark tasks, training time with 100,000 Pegasos SVM rank steps complete 0.2 CPU seconds on an ordinary laptop.

The primary computational bottleneck is actually reading the data off of disk; sofia-ml reads and parses data from disk substantially faster than other SVM packages we tested. For example, sofia-ml can read and parse data nearly 10 times faster than the reference Pegasos implementation by Shalev-Shwartz, and nearly 3 times faster than svm_perf by Joachims.

This package provides a commandline utility for training models and using them to predict on new data, and also exposes an API for model training and prediction that can be used in new applications. The underlying libraries for data sets, weight vectors, and example vectors are also provided for researchers wishing to use these classes to implement other algorithms.




Quick Start

These quick-start instructions assume the use of the unix/linux commandline, with g++ installed. There are no external code dependencies.

Step 1 Check out the code:

> svn checkout http://sofia-ml.googlecode.com/svn/trunk/sofia-ml sofia-ml-read-only

Step 2 Compile the code:

> cd sofia-ml-read-only/src/
> make
> ls ../sofia-ml
# Executable should be in main sofia-ml-read-only directory.

# If the above did not succeed, run the unit tests to help locate the problem:
> make clean
> make all_test

Step 3 Test the code:

> cd ..
> ./sofia-ml
# This should display the set of commandline flags and descriptions.

# Train a model on the demo training data.
> ./sofia-ml --learner_type pegasos --loop_type stochastic --lambda 0.1 --iterations 100000 --dimensionality 150000 --training_file demo/demo.train --model_out demo/model
# This should display something like the following:
Reading training data from: demo/demo.train
Time to read training data: 0.056134
Time to complete training: 0.075364
Writing model to: demo/model
   Done.

# Test the model on the demo data.
> ./sofia-ml --model_in demo/model --test_file demo/demo.train --results_file demo/results.txt
# Should display the following:
Reading model from: demo/model
   Done.
Reading test data from: demo/demo.train
Time to read test data: 0.046729
Time to make test prediction results: 0.000844
Writing test results to: demo/results.txt
   Done.

# Examine a few results in the results file:
> head -5 demo/results.txt
# Format is: <prediction value>\t<label from test file>.  Each line in the results
# file corresponds to the same line (in order) in the test file.
1.02114	1
1.18046	1
-1.24609	-1
-1.12822	-1
-1.41046	-1
# Note that exact results may vary slightly because these algorithms train
# by randomly sampling one example at a time.

# Evaluate the results:
> perl eval.pl demo/results.txt
# Should display something like:

Results for demo/results.txt: 

Accuracy  0.9880  (using threshold 0.00) (988/1000)
Precision 0.9719  (using threshold 0.00) (311/320)
Recall    0.9904  (using threshold 0.00) (311/314)
ROC area: 0.999406 

Total of 1000 trials. 

# Note that this evaluation script has limited functionality.  For more
# options, we recommend using the perf software by Rich Caruana (developed for
# the KDD Cup 2004), available at: http://kodiak.cs.cornell.edu/kddcup/software.html



Data Format

This package uses the popular SVM-light sparse data format.

<class-label> <feature-id>:<feature-value> ... <feature-id>:<feature-value>\n
<class-label> qid:<optional-query-id> <feature-id>:<feature-value> ... <feature-id>:<feature-value>\n
<class-label> <feature-id>:<feature-value> ... <feature-id>:<feature-value># Optional comment or extra data, following the optional "#" symbol.\n

The feature id's are expected to be in ascending numerical order. The lowest allowable feature-id is 1 (0 is reserved for the bias term internally.) Any feature not specified is assumed to have value 0 to allow for sparse representation.

The class label for test data is required but not used; it's okay to put in a dummy placeholder value such as 0 for test data. For binary-class classification problems, the training labels should be 1 or -1. For ranking problems, the labels may be any numeric value, with higher values being judged as "more preferred".

Currently, the comment string is not used. However, it is available for use in other algorithms, and can also be useful to aid in bookkeeping of data files.

Examples:

# Class label is 1, feature 1 has value 1.2, feature 2 (not listed) has value 0,
# and feature 3 has value -0.5.
1 1:1.2 3:-0.5

# Class label is -1, belongs to qid 3, and all feature values are zero except
# for feature 5011 with value 1.2.
-1 qid:3 5011:1.2

# Class label is -1, feature 1 has value 7, comment string is
# "This example is especially interesting."
-1 1:7 3:-0.5#This example is especially interesting.



Commandline Details

File Input and Output

--model_in file

--model_out file

--training_file file

--test_file file

--results_file file

Learning Options

--learner_type type

--loop_type type

--eta_type type

--no_bias_term

--dimensionality int

--iterations int

--lambda float

--passive_aggressive_c float

--passive_aggressive_lambda float

--perceptron_margin_size float

--hash_mask_bits int

Other Options

--random_seed int

--training_objective

References

If you use this source code for scientific research, please cite the following:

Additional reading and references:









Powered by Google Project Hosting