What's new? | Help | Directory | Sign in
Google
                
Search
for
Updated Feb 18, 2008 by benoit.favre
Labels: Featured
FileFormats  
description of the file formats used by icsiboost

The file formats are mostly compatible with BoosTexter.

Column description: <stem>.names

This file defines the classes, the column names and type of weak learners to generate for them. All the lines are ended by a period. The first line contains a comma separated list of class names. Then, each line contains a column definition. A column definition consists of a name, a colon, and a type. Column names consist of letters, underscores and digits (other characters can be used, but it may induce parse errors in the model file). If the type is "text", the data will be split in words (on spaces) and binary n-gram experts will be generated (the type of gram and length can be selected by the -N and -W command line options). If the type is "continuous", thresholding experts will be generated. They are ternary because they consider the cases where the feature is above or below the threshold and the case where the feature is unknown. If the type is "ignore", then the column is ignored (this is NOT compatible with BoosTexter). The type can also be a comma separated list of nominal values. This type is deprecated and will generate the same thing as the "text" expert.

class1, class2, class3.
column1: text.
column2: text.
column3: text.
column4: continuous.
column5: continuous.
column6: continuous.
column7: continuous.
column8: ignore.
column9: ignore.

Example files: <stem>.{data,dev,test}

The example files contain one instance per line. Each line consists of columns separated by commas and is ended by a period. Columns should be populated in the same order as defined in the .names file. The last column is the true class (or nothing at test time). Words in "text" columns are separated by spaces. Unknown values are represented by a question mark. The .data file contains training examples, the .dev file contains the development set and the .test file contains the test set.

word1 word2 word3, word4, word5, 0.1, 0.2, 0.3, 0.4, garbage1, garbage2, class1.
word1 word3, ?, word6, 0.6, 0.2, ?, 0.4, garbage3, garbage4, class2.

Model file: <stem>.shyp

The model file contains definition of the weak classifiers along with their contribution to the final score. See the articles and the source code for more details. Here are two examples of a text and a continuous classifier. The contribution to each class is a space separated list of floating point values in the same order as the class definition in the names file.

   weight Text:SGRAM:column_name:token_value

contribution to each class if absent or unknown

contribution to each class if present


   weight Text:THRESHOLD:column_name:

contribution to each class if unknown

contribution to each class if below threshold

contribution to each class if above threshold

threshold_value

The model can be packed. That means that identical classifiers are averaged and reweighed in order to reduce the number of classification steps at test time. This is Boostexter's default behavior, but you should not pack your model if you want to analyze or use the individual training steps (the unpacked models conserves the order of the classifiers). The number on the first line of a .shyp file represents the number of weak learners (ie. iterations) to load at test time. It is not necessarily the actual number of weak learners in the file (can be affected by the --optimal-iterations option) but can be overridden with the -n option.

Classification mode, short output

The prediction of an example are output on a line. There are two group of values and as many values as there are classes in the .names file. The order is the same as in the .names file. The first group is one binary flag per class corresponding to the reference activation of each class (if available). The second group correspond to the actual predictions. A prediction above zero means that the class should be output. When dealing with multi-labels, multiple classes can be predicted at the same time.

0 1 -0.018425116509 0.018425116509
0 1 -0.004071426535 0.004071426535
1 0 -0.001862755005 0.001862755005
1 0 0.014314053147 -0.014314053147
0 1 -0.037851334231 0.037851334231
0 1 -0.015030095760 0.015030095760
0 1 -0.011415782398 0.011415782398
1 0 0.004168640079 -0.004168640079
0 1 -0.014499579535 0.014499579535
0 1 -0.007088537326 0.007088537326
1 0 0.001905840098 -0.001905840098

Classification mode, long output

All the features corresponding to the example are output along with their name. Then, the correct labels (correct label = ...) followed by the scores and decisions. ** means reference label, > means decision. When the decision is right: *>.

age: 25
workclass: Private
fnlwgt: 226802
education: 11th
education-num: 7
marital-status: Never-married
occupation: Machine-op-inspct
relationship: Own-child
race: Black
sex: Male
capital-gain: 0
capital-loss: 0
hours-per-week: 40
native-country: United-States
correct label = <=50K
   -0.018425 : >50K
*>  0.018425 : <=50K


age: 38
workclass: Private
fnlwgt: 89814
education: HS-grad
education-num: 9
marital-status: Married-civ-spouse
occupation: Farming-fishing
relationship: Husband
race: White
sex: Male
capital-gain: 0
capital-loss: 0
hours-per-week: 50
native-country: United-States
correct label = <=50K
   -0.004071 : >50K
*>  0.004071 : <=50K

Sign in to add a comment