GitHub - IlyaLab/rf-ace: (backup fork since google code is going down)

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 718 Commits
R		R
data		data
man		man
matlab		matlab
src		src
test		test
tmp		tmp
DESCRIPTION		DESCRIPTION
Makefile		Makefile
NAMESPACE		NAMESPACE
README		README
doxy.cfg		doxy.cfg
install_R.sh		install_R.sh
make_package.sh		make_package.sh
make_win32.bat		make_win32.bat
make_win64.bat		make_win64.bat
rf-ace-launcher.sh		rf-ace-launcher.sh
rf_ace_batch.py		rf_ace_batch.py
test_103by300_mixed_matrix.afm		test_103by300_mixed_matrix.afm
test_103by300_mixed_nan_matrix.afm		test_103by300_mixed_nan_matrix.afm
test_2by10_text_matrix.afm		test_2by10_text_matrix.afm
test_2by8_numerical_matrix.tsv		test_2by8_numerical_matrix.tsv
test_3by10_categorical_matrix.tsv		test_3by10_categorical_matrix.tsv
test_6by10_mixed_matrix.tsv		test_6by10_mixed_matrix.tsv
test_fullSplitterSweep.txt		test_fullSplitterSweep.txt
test_fullSplitterSweep_class.txt		test_fullSplitterSweep_class.txt
test_predictor.sf		test_predictor.sf
test_rfacer.R		test_rfacer.R
testdata.tsv		testdata.tsv

Repository files navigation

#summary Manual pages.

*The manual pages have been written on the basis of RF-ACE verson 0.5.5*

= Description =

RF-ACE is an efficient C++ implementation of a robust machine learning algorithm for uncovering multivariate associations from large and diverse data sets. RF-ACE natively handles numerical and categorical data with missing values, and potentially large quantities of noninformative features are handled gracefully utilizing artificial contrast features, bootstrapping, and p-value estimation.

= Installation =

Download the latest stable release from the [http://code.google.com/p/rf-ace/downloads/list download page], or checkout the latest development version (to directory rf-ace/) by typing
{{{
svn checkout http://rf-ace.googlecode.com/svn/trunk/ rf-ace
}}}

Compiler makefiles for Linux (`Makefile`) and Visual Studio for Windows (`make.bat`) are provided in the package. In Linux, you can compile the program by typing 
{{{
make
}}}
or
{{{
make rf_ace
}}}

In Windows and using Visual Studio, first open up the Visual Studio terminal and execute `make.bat` by typing
{{{
make
}}}
Simple as that! If you feel lucky, check for compiled binaries at the [http://code.google.com/p/rf-ace/downloads/list download page]. 

= Supported data formats =
RF-ACE currently supports two file formats, Annotated Feature Matrix (AFM) and Attribute-Relation File Format (ARFF).

== Annotated Feature Matrix (AFM) ==

Annotated Feature Matrix represents the data as a tab-delimited table, where both columns and rows contain headers describing the samples and features. Based on the headers, the AFM reader is able to discern the right orientation (features as rows or columns in the matrix) of the matrix. Namely AFM feature headers must encode whether the feature is (`N`)umerical, (`C`)ategorical, (`O`)rdinal, or (`B`)inary, followed by colon and the actual name of the feature as follows:

 * `B:is_alive`
 * `N:age`
 * `C:tumor_grage` 
 * `O:anatomic_organ_subdivision`

In fact any string, even including colons, spaces, and other special characters, encodes a valid feature name as long as it starts with the preamble `N:`/`C:`/`O:`/`B:`. Thus, the following is a valid feature header:

 * `N:GEXP:TP53:chr17:123:456`

Sample headers are not constrained, except that they must not contain preambles `N:`/`C:`/`O:`/`B:`, being reserved for the feature headers. 

== Attribute-Relation File Format (ARFF) ==

[http://www.cs.waikato.ac.nz/~ml/weka/arff.html ARFF specification].      

= Usage =
The following examples follow Linux syntax. Type 
{{{
bin/rf_ace --help
}}}
or 
{{{
bin/rf_ace -h
}}}
to bring up help:
{{{
REQUIRED ARGUMENTS:
 -I / --input        input feature file (AFM or ARFF)
 -i / --target       target, specified as integer or string that is to be matched with the content of input
 -O / --output       output association file

OPTIONAL ARGUMENTS:
 -n / --ntrees       number of trees per RF (default nsamples/nrealsamples)
 -m / --mtry         number of randomly drawn features per node split (default sqrt(nfeatures))
 -s / --nodesize     minimum number of train samples per node, affects tree depth (default max{5,nsamples/20})
 -p / --nperms       number of Random Forests (default 50)
 -t / --pthreshold   p-value threshold below which associations are listed (default 0.1)
 -g / --gbt          Enable (1 == YES) Gradient Boosting Trees, a subsequent filtering procedure (default 0 == NO)
}}} 

So all that is required is an input file (`-I/--input`), either of type `.arff` or `.afm`, and a target (`-i/--target`) to build the RF-ACE model upon. Target in this case corresponds to a feature in the input file, and it can be identified with an index corresponding to it's order of appearance in the file, or with it's name. Thus, if the target is `N:age` (we would be looking for features associated with age) existing on row `123` (0-base and omitting the header row), one execute RF-ACE by typing
{{{
bin/rf_ace --input featurematrix.afm --target 123 --output associations.tsv 
}}}
or with the short-hand notation equivalently as
{{{
bin/rf_ace -I featurematrix.afm -i 123 -O associations.tsv 
}}}
or by using the header "N:age" instead of the index by typing
{{{
bin/rf_ace -I featurematrix.afm -i N:age -O associations.tsv
}}}
In case a provided (sub)string identifies multiple target candidates, RF-ACE will be executed serially for all target candidates, results catenated in the specified output file.

The above will execute RF-ACE with the default parameters; as the help documentation points out, most of the parameters are estimated dynamically based on the data dimensions and content, so running RF-ACE with no information about the algorithm itself is possible.

= Output = 
The following call (assuming now the substring `age` uniquely identifies just one feature, `N:age`)
{{{
bin/rf_ace -I featurematrix.afm -i age -O associations.tsv
}}}
produces the output
{{{


 ---------------------------------------------------------------
| RF-ACE -- efficient feature selection with heterogeneous data |
|                                                               |
|  Version:      RF-ACE v0.5.5, July 4th, 2011                  |
|  Project page: http://code.google.com/p/rf-ace                |
|  Contact:      timo.p.erkkila@tut.fi                          |
|                kari.torkkola@gmail.com                        |
|                                                               |
|              DEVELOPMENT VERSION, BUGS EXIST!                 |
 ---------------------------------------------------------------

Reading file 'featurematrix.afm'
File type is unknown -- defaulting to Annotated Feature Matrix (AFM)
AFM orientation: features as rows

RF-ACE parameter configuration:
  --input      = featurematrix.afm
  --nsamples   = 223 / 282 (20.922% missing)
  --nfeatures  = 48912
  --targetidx  = 123, header 'N:age'
  --ntrees     = 356
  --mtry       = 221
  --nodesize   = 12
  --nperms     = 50
  --pthresold  = 0.1
  --output     = associations.tsv

Growing 50 Random Forests (RFs), please wait...
  RF 1: 4880 nodes (avg. 13.7079 nodes / tree)
  RF 2: 4810 nodes (avg. 13.5112 nodes / tree)
  RF 3: 4856 nodes (avg. 13.6404 nodes / tree)
  RF 4: 4994 nodes (avg. 14.0281 nodes / tree)
  RF 5: 5036 nodes (avg. 14.1461 nodes / tree)
  RF 6: 5016 nodes (avg. 14.0899 nodes / tree)
  RF 7: 5132 nodes (avg. 14.4157 nodes / tree)
...
  RF 47: 4736 nodes (avg. 13.3034 nodes / tree)
  RF 48: 5234 nodes (avg. 14.7022 nodes / tree)
  RF 49: 4582 nodes (avg. 12.8708 nodes / tree)
  RF 50: 5210 nodes (avg. 14.6348 nodes / tree)
50 RFs, 17800 trees, and 247516 nodes generated in 102.91 seconds (2405.17 nodes per second)
Gradient Boosting Trees *DISABLED*

Association file created. Format:
TARGET   PREDICTOR   P-VALUE   IMPORTANCE   CORRELATION

Done.
}}}

If there are no associations found, the program would end as follows:
{{{
No significant associations found, quitting...
}}}

= RF-ACE configuration =

Information will be added in the future