What's new? | Help | Directory | Sign in
Google
icsiboost
Open-source implementation of Boostexter (Boosting based classifier)
  
  
  
  
    
Show all Featured Downloads:
icsiboost-r102-static.gz

Boosting is a meta-learning approach that aims at combining an ensemble of weak classifiers to form a strong classifier. Adaptive Boosting (Adaboost) is a greedy search for a linear combination of classifiers by overweighting the examples that are misclassified by each classifier. icsiboost implements Adaboost over stumps (one-level decision trees) on discrete and continuous attributes (words and real values). See http://en.wikipedia.org/wiki/AdaBoost and the papers by Y. Freund and R. Schapire for more details. This approach is one of the most efficient and simple to combine continuous and nominal values. Our implementation is aimed at allowing training from millions of examples by hundreds of features (or millions of sparse features) in a reasonable time/memory.

Here is an excellent tutorial on Boosting: http://nips.cc/Conferences/2007/Program/event.php?ID=575

NEWS:

Get and Compile (you need PCRE >= 01-December-2003):

svn checkout http://icsiboost.googlecode.com/svn/trunk/ .
cd icsiboost
./configure CFLAGS=-O3
make

Program usage (revision 100):

USAGE: icsiboost [options] -S <stem>
  --version               print version info
  -S <stem>               defines model/data/names stem
  -n <iterations>         number of boosting iterations (also limits test time classifiers, if model is not packed)
  -E <smoothing>          set smoothing value (default=0.5)
  -V                      verbose mode
  -C                      classification mode -- reads examples from <stdin>
  -o                      long output in classification mode
  -N <text_expert>        choose a text expert between fgram, ngram and sgram
  -W <ngram_length>       specify window length of text expert
  --dryrun                only parse the names file and the data file to check for errors
  --cutoff <freq>         ignore nominal features occuring unfrequently (shorten training time)
  --no-unk-ngrams         ignore ngrams that contain the "unk" token
  --jobs <threads>        number of threaded weak learners
  --do-not-pack-model     do not pack model (this is the default behavior)
  --pack-model            pack model (for boostexter compatibility)
  --output-weights        output training examples weights at each iteration
  --posteriors            output posterior probabilities instead of boosting scores
  --model <model>         save/load the model to/from this file instead of <stem>.shyp
  --resume                resume training from a previous model (can use another dataset for adaptation)
  --train <file>          bypass the <stem>.data filename to specify training examples
  --dev <file>            bypass the <stem>.dev filename to specify development examples
  --test <file>           bypass the <stem>.test filename to specify test examples
  --names <file>          use this column description file instead of <stem>.names
  --ignore <columns>      ignore a comma separated list of columns (synonym with "ignore" in names file)
  --ignore-regex <regex>  ignore columns that match a given regex
  --only <columns>        use only a comma separated list of columns (synonym with "ignore" in names file)
  --only-regex <regex>    use only columns that match a given regex
  --interruptible         save model after each iteration in case of failure/interruption
  --optimal-iterations    output the model at the iteration that minimizes dev error (or max fmeasure if specified)
  --max-fmeasure <class>  display maximum f-measure of specified class instead of error rate
  --fmeasure-beta <float> specify weight of recall compared to precision in f-measure
  --abstaining-stump      use abstain-on-absence text stump (experimental)
  --no-unknown-stump      use abstain-on-unknown continuous stump (experimental)
  --sequence              generate column __SEQUENCE_PREVIOUS from previous prediction at test time (experimental)
  --anti-prior            set initial weights to focus on classes with a lower prior (experimental)

The input data is defined in a format similar to the UCI repository (http://www.ics.uci.edu/~mlearn/MLRepository.html). You will have to remove blank lines, comments and add a period at the end of each lines. The data must contain a <stem>.data file with training examples, a <stem>.names file describing the classes/features and may contain a .dev and .test file that will be used for error rate computation.

Currently, icsiboost is used at ICSI (http://www.icsi.berkeley.edu/) for sentence boundary detection and other speech understanding related classification tasks. If you find icsiboost useful, please send us an email describing your work.

icsiboost is still limited: see the MISSING features. Next releases will focus on code cleanup, stabilization and usability. After that, the project will diverge from Boostexter in providing a different user interface (command line options, file formats...) and in implementing other approaches. In the long term, we may add script bindings (perl, python, ...) and a nice library interface.

NOTES:

MISSING: