My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
Quickstart  
A simple guide to using MLSharp.
Updated Feb 7, 2010 by mbhoneyc...@gmail.com

Introduction

This is a very rough guide. It shows you how to:

  1. Download and setup MLSharp.
  2. Perform a sample classification task.
  3. Adjust classifier and output parameters.
  4. Apply filters to transform your data.

Downloading MLSharp

Download the latest release of MLSharp from the homepage. The package includes a command-line tool for running classification tasks. Extract the zip to a directory of your choosing. You are now ready to use MLSharp!

Running a classification task

The command-line tool allows you to load a dataset, apply transformations to the data, run classifiers, and generate reports. There aren't many classifiers built-in now, but if you're comfortable with writing a little code, it's easy to add new ones by wrapping Weka classifiers.

The zip file you extracted containing MLSharp also contains a sample data file consisting of patient attributes and whether or not the corresponding patient has diabetes. Let's use MLSharp to run Nested 10-fold Cross Validation with a couple of classifiers: Random Forests and Support Vector Machines. NOTE: Typically we would multiple instances of the same type of classifier in this type of test, where each instance was configured differently.

To run the experiment, open a command prompt (cmd.exe at the Windows Run dialog), navigate to the directory where you extracted MLSharp, and run the following command:

MLSharp.ConsoleRunner.exe -HarnessType NestedCrossValidation -input "diabetes.arff" -parallel -output "CsvResultWriter path:results.csv" -target class -classifier LibSvmClassifierFactory" -classifier "RandomForestClassifierFactory" 

Let's look at each part of this command:

MLSharp.ConsoleRunner.exe The name of the MLSharp command-line experiment runner.
HarnessType NestedCrossValidation The type of experiment harness to use. Other valid types are "Simple" and "CrossValidation" (NOTE: In the current release, "Simple" is not functioning correctly).
input "diabetes.arff" The name of the input file to process. This can be an ARFF file or an Excel spreadsheet with columns for attributes and rows for instances.
parallel Tells the harness to parallelize things where possible, great for speeding up long-running experiments on multi-core machines. You can edit the MLSharp.ConsoleRunner.exe.config file to control how many threads will be used.
output "CsvResultWriter path:results.csv" Classification results will be written to a CSV file named "results".
target class The name of the attribute to attempt to classify.
classifier "LibSvmClassifierFactory" Use a LibSVM classifier. (Note that MLSharp makes a distinction between a classifier that performs classifications and the factory that is responsible for training a classifier. When specifying a classifier, use the name of the factory.)
classifier "RandomForestClassifierFactory" Use a random forest classifier.

As the application is running, you will see text scrolling in different colors. This is log4net output and can be used to determine the current state of execution. This output is also logged to a file named Output.txt. Log4net configuration can be customized in the .config file. For more information, see log4net.

When the application finishes, you should have two new output files: Output.txt and Results.csv. Output.txt is simply the log4net output. Results.csv contains the classification results. One row is written per instance in the input dataset. The "ID" column is the ID for the instance (if any). The second column is the instance's actual value for the class attribute. The third column is the machine learner's prediction for this attribute. For classifiers that produce confidence values, the fourth column will contain the unnormalized confidence score. Finally, for classifiers that are capable of "explaining" their output, the fifth column will contain an explanation of the classification, such as the rule that was matched.

Setting classifier options

Most classifiers contain multiple settings that can be adjusted to control the classifier induction process. Let's look at another example of nested cross validation, this time using a few different Random Forest classifiers, each with different settings:

MLSharp.ConsoleRunner.exe -HarnessType NestedCrossValidation -input "diabetes.arff" -parallel -output "CsvResultWriter path:results.csv" -target class -classifier "RandomForestClassifierFactory NumTrees:5" -classifier "RandomForestClassifierFactory NumTrees:10" -classifier "RandomForestClassifierFactory NumTrees:20 MaxTreeDepth:2" 

Note that we are now specifying three RandomForestClassifierFactory instances. We are changing the "NumTrees" parameter for each instance. To change a classifier setting in general, assign a value to the setting like so:

SettingName:NewValue

Note that you must put classifier factory settings within the quotation marks that include the classifier factory name. This is a limitation of the current command-line argument parsing that will be fixed in a future release.


Sign in to add a comment
Powered by Google Project Hosting