|
|
Download and build the last version
You must have subversion and common build tools ready on your system to install icsiboost from the latest subversion commit. Then, you need to configure the source, build icsiboost and optionally install it in a binary directory. Use --prefix=<dir> to install the program in <dir>/bin. Don't forget to add this directory to your path.
svn checkout http://icsiboost.googlecode.com/svn/trunk/icsiboost icsiboost cd icsiboost/icsiboost ./configure --prefix=$HOME CFLAGS=-O3 make make install export PATH=$PATH:$HOME/bin
icsiboost has been reported to work on Linux and Mac OSX (however, you will need to install PCRE from macports, fink, or from its sources).
A simple example
Let's first download example files from the UCI repository using wget:
wget ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult/adult.*
The database downloaded is the adult database: a classification problem where you have to determine the income of a person knowning various facts about him/her. The database contains a file describing the different classes and features (adult.names), a file containing training examples (adult.data) and a file containing test examples (adult.test).
The names file: adult.names
>50K, <=50K. age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. ... sex: Female, Male. capital-gain: continuous.
The .names file contains a first line defining a comma separated list of classes ended by a period (here: >50K and <=50K being the classes of income that we try to predict).
Then one feature column is described on every line. The line consists of a feature name (age, workclass, ...) followed by a column and information about values allowed for that feature. Features can be real valued (continuous, as for age), space separated words (text) or a set of nominal values (the values themselves, as for sex). icsiboost implement decision stumps that will partition the training examples in 2 classes (above or below a threshold for continuous values; present or absent for text and nominal values).
The training examples: adult.data
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K. 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K. 38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K.
The training examples are input in a single file containing one instance per line. The different feature columns are comma separated and must appear in the same order and following the same constraints described in the .names file. The last column contains the real class of the example, followed by a period.
Some features can be unknown because of privacy concerns or other restrictions on training data collection. These features use a value of "?" (question mark) and will get a special processing in the training/testing stages.
Training a classifier
icsiboost -S adult -n 100 rnd 1: wh-err= 0.724256 th-err= 0.724256 dev= nan test= 0.236226 train= 0.240810 rnd 2: wh-err= 0.908697 th-err= 0.658130 dev= nan test= 0.196917 train= 0.199073 rnd 3: wh-err= 0.928621 th-err= 0.611153 dev= nan test= 0.157791 train= 0.158472 rnd 4: wh-err= 0.960223 th-err= 0.586843 dev= nan test= 0.155764 train= 0.157335 rnd 5: wh-err= 0.980548 th-err= 0.575428 dev= nan test= 0.151711 train= 0.152053 rnd 6: wh-err= 0.982552 th-err= 0.565388 dev= nan test= 0.151711 train= 0.152053 rnd 7: wh-err= 0.988624 th-err= 0.558956 dev= nan test= 0.151035 train= 0.151807 rnd 8: wh-err= 0.991000 th-err= 0.553925 dev= nan test= 0.149438 train= 0.149780 rnd 9: wh-err= 0.993974 th-err= 0.550587 dev= nan test= 0.146797 train= 0.148583 rnd 10: wh-err= 0.993322 th-err= 0.546911 dev= nan test= 0.146367 train= 0.148337 ...
You may want to read the papers about AdaBoost before training a classifier to know more about the whole process. When you invoke icsiboost, you have to provide a <stem> for the names and training files (adult for adult.names and adult.data) and a number of iterations to proceed. At each iteration a new weak classifier will be trained and used in the final decision. icsiboost outputs the iteration number (rnd), the weighted error (wh-err) which is the objective function minimized when selected a classifier (Z() in the papers) and the theoretical error (th-err, see the papers) and the test and train classification error (ratio of misclassified examples over the number of examples). This last value is the most interesting as you want to reduce it a maximum. Adding more iterations will reduce the training error further, but there is a risk of over-training where the test error will increase while the training error still decreases. You should stop iterating before that phenomenon.
The model resulting from a training is output in the <stem>.shyp file. It contains all informations to rebuild the ensemble of weak classifiers during a testing stage.
Testing the performance of the classifier
icsiboost -S adult -C < adult.test 0 1 -0.321597856090 0.321597856090 0 1 -0.008942643370 0.008942643370 1 0 -0.047642775890 0.047642775890 1 0 0.229640678420 -0.229640678420 0 1 -0.313200595760 0.313200595760 0 1 -0.277185452430 0.277185452430 0 1 -0.169981706080 0.169981706080 1 0 0.036429657930 -0.036429657930 0 1 -0.263797352750 0.263797352750 0 1 -0.154846522240 0.154846522240 ...
icsiboost will load a previously trained model and predict the classes of the examples from the standard input. A binary flag is first output for each true class (if available) and then, the prediction score for each class in the order of the class definition in the .names file. A positive value means that the class has been predicted by the classifier.
Getting the learning curve in gnuplot
You can plot the error curves of the training and test sets using gnuplot and the following command:
icsiboost -n 100 -S adult | tee adult.iter gnuplot plot 'adult.iter' using 2:12 with lines title 'training error', '' using 2:10 with lines title 'test error'
Sign in to add a comment

I want to know the complete purpose of 2 people that are from the opposite sex will do sex. It's unnessisary and very discusting. So am I right? Don't get me wrong, but I think that we should find the causes then the solutions.