|
Project Information
-
Project feeds
- Code license
-
Apache License 2.0
-
Labels
CPlusPlus,
MachineLearning,
FeatureSelection,
Algorithm,
DecisionTree,
Classification,
Regression,
Prediction,
RandomForest,
GradientBoostingTrees,
Fast,
Bootstrap,
FalseDiscoveryRate,
Optimized,
UTest
Featured
Links
|
Latest version: RF-ACE v1.0.4 Mar 25th 2012 ( download )RF-ACE is an efficient implementation of a robust machine learning algorithm for uncovering multivariate associations, building predictors, and predicting novel data, either with classification or regression tree ensembles, from large and diverse data sets. RF-ACE natively handles numerical and categorical data with missing values, and in feature selection potentially large quantities of noninformative features are handled gracefully utilizing artificial contrast features, bootstrapping, and p-value estimation. RF-ACE implements both Random Forest (RF) and Gradient Boosting Tree (GBT) algorithms, and is strongly related to ACE, originally outlined in http://jmlr.csail.mit.edu/papers/volume10/tuv09a/tuv09a.pdf. RF-ACE highlights: - Data can be provided in various formats
- Estimates default model parameters based on dimensions of input data
- Extensive support for customization
- Importance score is normalized and is thus comparable across parallel RF-ACE runs having different target features
- Useful in construction of "all-vs-all" association maps
- Importance score is further translated to a p-value based on empirical background model and t-test
- Implements GBT and RF for prediction
- Intuitive interface
- rf-ace-filter performs feature selection
- rf-ace-build-predictor builds predictor based on training data
- rf-ace-predict makes predictions with novel data
- ... and more to come!
The algorithm is implemented in C++, and has been tested under 32bit and 64bit Linux and Windows environments. Windows binaries as well as the sources are located at the download page. Makefiles for Linux, which works also in Cygwin in Windows, and Visual Studio (make_win32.bat and make_win64.bat) are also provided. Join the Google Groups mailing list to receive updates by e-mailCase study: Large-scale data exploration in The Cancer Genome AtlasIn a joint effort together with Tampere University of Technology and Institute for Systems Biology, associations uncovered within The Cancer Genome Atlas using RF-ACE can be viewed at Regulome Explorer, an interactive web application developed for exploring associations across molecular features spanning the human genome. With help of Techila and Golem, CPU intensive but embarassingly parallel computation was distributed across a collection of ~1000 CPUs, cutting down computation from years to days. DisclaimerThe algorithm is currently under development. Although many of the intended features are working as expected in the algorithm, existence of bugs is guaranteed. If you encounter bugs or unintended behavior, please report them to timo.p.erkkila@tut.fi. For more information, please contact codefor@systemsbiology.org. Related projectsIn case you need a simple tool for merging multiple data tables into one big data set for feature selection, you should check out the new project "mergedata": http://mergedata.googlecode.com
Release historyRF-ACE v1.0.4 release -- March 25th 2012: - Building either RF ( -R / --RF ) or GBT ( -G / --GBT ) predictor now possible
- Example: bin/rf-ace-build-predictor --RF ...
- Default is RF
- Optimized numerical splitter for numerical targets (~30% boost)
- Revised background sampling for statistical testing
- Complete re-implementation of the t-test
- Previous version was making some false assumptions
- Takes tied samples into account when testing for splitting
- Predictor builder prints out OOB error and TOTAL error
- Good for assessing how well the predictor generalizes to new data
- In practice
- With better-than-random predictors: OOB error < TOTAL error
- With random predictors: OOB error ~ TOTAL error
- With overfitting predictors: OOB error > TOTAL error
- Changes in user interface parameters
- NEW: seed ( -S / --seed ) for the random number generator (Mersenne Twister)
- By default seed == system clock + elapsed CPU cycles
- RF and GBT parameters share the same names
- mTry is specified as positive integer
- Usage examples are printed when help becomes invoked
- When all features become pruned, log is updated and program exits normally
- Lots of re-factoring of code
RF-ACE v1.0 release -- February 14th 2012: - Optimizations and default parameter tweaks: over 10x speed-up!
- Blacklists/whitelists are now working with index list inputs
- Three new programs:
- rf-ace-filter
- rf-ace-build-predictor
- rf-ace-predict
- Updated forest file (.sf) format
- Generalizes better to various feature naming etc. conventions
- Create and save predictor (.sf) with “rf-ace-build-predictor”
- Load forest predictor (.sf) with “rf-ace-predict” and make predictions with novel data
- Replaced exhaustive search of a binary split with categorical splitter with a greedy one
- Computational complexity linear as a function of cardinality of the splitter !
- Better factoring of code with new namespaces
- Fixed a bug that resulted in incorrect interpretation of mathematical expressions by Visual Studio compiler
RF-ACE v0.9.9 release -- February 2nd 2012: - Binaries for x86 and x64 Windows XP available
- Updated forest output writer ( -F / --forest )
- The format for the forest is almost stable; once fully stable, I will implement a forest reader, meaning one can then save the built model for, say, prediction
- New whitelist/blacklist functionality with which to fine-tune predictor space
- -W / --whitelist and -B / --blacklist
- Updated ARFF reader seg. fault bug
- This was readily fixed in the intermediate version 0.9.8b
- Updated log outputs ( -L / --log )
- Refactored namespaces and classes (namespace math, class statistics::RF_statistics)
- This is still far from finished
- Updated node counter
- At least works faster, but may still contain bugs (nothing dramatic, though)
- By default features with less than 5 shared samples with the target will be removed
- Tune this manually with -X / --prune_features
- Fixed a small bug when percolating samples through the trees
- This affected calculation of importance scores in the presence of missing values
- Expect to see more association on the outputs!
- Simplifications in t-test implementation, but no functional change
- Fixed one implicit type cast, which was causing seg. fault in 64bit Windows
- Extended manual page to include an explicit example of the AFM data format
RF-ACE v0.9.8 release -- January 10th 2012: - Possibility to predict novel measurements ( -T / --testdata )
- Possibility to turn feature selection with RFs off ( -N / --noFilter )
- Reduced redundancy in class interfaces
- Refactored main program
- Eliminated a bug while parsing ARFF data
- default p-value threshold changed from 0.1 to 0.05
- GBT forest print-outs ( -F / --forest ) updated to include the name of the target feature
- The log file ( -L / --log ) now includes:
- RF-ACE version
- mean and std importance score for real features
- mean and std importance score for contrast features
- mean number of nodes per tree
- nodes created per second
RF-ACE v0.9.7 release -- December 29th 2011: - Several small updates, aiming to make prediction with novel data more straightforward, has been made.
- Data prediction with novel data has been turned OFF until all planned updates are in-place
- By applying a no-filter flag ( -N / --noFilter ), feature filtering with RFs can be turned OFF (GBT will be used with all available features)
- By providing a forest output file ( -F / --forest ) the GBT forest will be written to the file
- By providing a log output file ( -L / --log ) the log will be written to the file; log is currently empty
- It is now much easier to specify which outputs the user wants (associations, GBT predictor, predictions, log, etc.)
RF-ACE v0.9.6 release -- December 18th 2011: - The fraction of sampled candidate contrast features for splitting is now tuned from 50% down to just 1%. This will guarantee significantly better trees being grown, while keeping the fraction of contrasts in the trees large enough so that the null distribution can reliably be constructed. You will notice that for significant associations the p-values are now much closer to zero, indicating increased ability to separate true signal from noise.
- Data and header delimiters can now be changed ( '\t' and ':', respectively, are the defaults )
- Started implementing a logging feature, with which quality of the analysis can easily be assessed.
- The default number of candidate features for splitting, mTry, is now 10% of the number of features, i.e. significantly more than it used to be ( sqrt(nFeatures) ). This also guarantees better trees, however:
- The algorithm now runs slower due to increased CPU load. I will concentrate on cutting down the computing time once I get the planned updates finished.
RF-ACE v0.9.5 release -- November 14th 2011: - Killed a bug (split decision was 100% biased towards the "left" leaf when the data point was NA; the fix now assigns the sample randomly to left and right according to the fraction of training samples in left and right, respectively) that was severely degrading the prediction accuracy of the algorithm
- consequently, killing that bug also improved the accuracy of identifying associations
RF-ACE v0.9.3 release -- November 8th 2011: The algorithm has reached a level that I'm fairly confident to say it's nearly bug-free and contains almost all the features planned to be implemented. The latest update features some essential bug fixes: - node splitting with numerical features sometimes resulted in under-indexing a vector
- categorical splitter was calculating the split fitness partly wrong
Both of these bugs, which are now eliminated, were severely decreasing the qualities of the trees they occurred in, but as RF-ACE is a tree ensemble learning algorithm, thus consisting of thousands of trees, the average performance wasn't affected much. One larger, yet missing feature is support for multiple splitters (main splitter + surrogates) per tree junction, which is supposed to yield better performance with highly sparse data. Support for surrogates will be added in near future. RF-ACE v0.9.1 release -- November 6th 2011: - Splitting with features is now implemented as it is in the original formulation
- if the user wants, (faster) split approximation can be turned ON
- Gray Code implementation for efficient split testing with categorical features
- Updated default parameters for the Random Forest
RF-ACE v0.8.5 release -- October 5th 2011: - Now RF-ACE runs exactly once, for a uniquely specified target
- Default test has been changed back to the t-test
- Default number of permutations is 20
- Improved handling of NaN's and negative importance scores in statistical testing
- Modified print-outs
- Lots of small tweaks
RF-ACE v0.8.0 release -- September 14th 2011: - Restructured the logic of the main program
- Data prediction works better now
- Improved print-outs
- Updated help
RF-ACE v0.7.5 release -- August 29th 2011: - GBT is now functional in data prediction, so yes:
- RF-ACE predicts with new data
- Unit testing is introduced, making development more organized
- Tons of small updates and bug fixes
- Started working on support for feature masks (for exclusion of features from analysis)
RF-ACE v0.5.5 release -- July 5th 2011: much has changed since the last version: - Node class is now dynamic, making tree construction smoother and more memory-efficient
- GBT is now part of the main program of RF-ACE, albeit not fully functional yet
- As target one can now specify a string that will be grepped with feature headers
- if multiple feature headers match the string, multiple RF-ACE calls are made, and results concatenated to the specified output file
- Fixed a bug that was annoyingly making contrasts to never enter the trees
- Lots of small tweaks
RF-ACE v0.4.0 release -- July 1st 2011: The next stable release of RF-ACE, version 0.4.0, is ready. Although functionally very similar to v0.3.5, most of the internal components have been revised and simplified, and naming conventions unified. There will be a few more of such revisions, this time concentrating on making tree generation more dynamic and shifting splitting functions under the Node implementation. June 29th 2011: a major revision of the internal structure RF-ACE is now completed, and code has been committed to the trunk. Also, a makefile for Visual Studio command line compiler (cl) is provided. Some further updates will be executed before the next stable release will be announced. Stay tuned. RF-ACE v0.3.5 release -- June 24th 2011: version 0.3.5 is out! Lots of small tweaks since the last version: - fixed a bug that allowed the target feature to enter the list of predictors
- the target itself didn't exist anywhere in the trees
- sufficient nodesize is now estimated directly from the data
- the algorithm now adapts to larger sample sizes by tuning nodesize up
- as nodes grow bigger, trees become smaller and leave room for more permutations
- increased default permutation size from 20 to 50
- increases statistical power
- finished implementing ARFF support, it should be working now
- reformatted print-outs
RF-ACE v0.3.0 release -- June 21st 2011: RF-ACE has now reached version 0.3.0 (check the source package and Win32 binary here). It can be considered a stable release of the algorithm. RF-ACE v0.3.0 does the following: - accepts AFM (Annotated Feature Matrix) files as inputs
- identifies the type of the input automatically
- identifies the orientation of the input, if AFM, automatically
- handles numerical and categorical features
- handles missing values
- estimates ntrees and mtry based on data dimensions and the number of missing values
- constructs multiple Random Forests
- an optimized implementation of the original RF (less sorting involved)
- uncovers statistically significant associations using Mann-Whitney U-test
|