|
Quickstart
A simple guide to using MLSharp.
IntroductionThis is a very rough guide. It shows you how to:
Downloading MLSharpDownload the latest release of MLSharp from the homepage. The package includes a command-line tool for running classification tasks. Extract the zip to a directory of your choosing. You are now ready to use MLSharp! Running a classification taskThe command-line tool allows you to load a dataset, apply transformations to the data, run classifiers, and generate reports. There aren't many classifiers built-in now, but if you're comfortable with writing a little code, it's easy to add new ones by wrapping Weka classifiers. The zip file you extracted containing MLSharp also contains a sample data file consisting of patient attributes and whether or not the corresponding patient has diabetes. Let's use MLSharp to run Nested 10-fold Cross Validation with a couple of classifiers: Random Forests and Support Vector Machines. NOTE: Typically we would multiple instances of the same type of classifier in this type of test, where each instance was configured differently. To run the experiment, open a command prompt (cmd.exe at the Windows Run dialog), navigate to the directory where you extracted MLSharp, and run the following command: MLSharp.ConsoleRunner.exe -HarnessType NestedCrossValidation -input "diabetes.arff" -parallel -output "CsvResultWriter path:results.csv" -target class -classifier LibSvmClassifierFactory" -classifier "RandomForestClassifierFactory" Let's look at each part of this command:
As the application is running, you will see text scrolling in different colors. This is log4net output and can be used to determine the current state of execution. This output is also logged to a file named Output.txt. Log4net configuration can be customized in the .config file. For more information, see log4net. When the application finishes, you should have two new output files: Output.txt and Results.csv. Output.txt is simply the log4net output. Results.csv contains the classification results. One row is written per instance in the input dataset. The "ID" column is the ID for the instance (if any). The second column is the instance's actual value for the class attribute. The third column is the machine learner's prediction for this attribute. For classifiers that produce confidence values, the fourth column will contain the unnormalized confidence score. Finally, for classifiers that are capable of "explaining" their output, the fifth column will contain an explanation of the classification, such as the rule that was matched. Setting classifier optionsMost classifiers contain multiple settings that can be adjusted to control the classifier induction process. Let's look at another example of nested cross validation, this time using a few different Random Forest classifiers, each with different settings: MLSharp.ConsoleRunner.exe -HarnessType NestedCrossValidation -input "diabetes.arff" -parallel -output "CsvResultWriter path:results.csv" -target class -classifier "RandomForestClassifierFactory NumTrees:5" -classifier "RandomForestClassifierFactory NumTrees:10" -classifier "RandomForestClassifierFactory NumTrees:20 MaxTreeDepth:2" Note that we are now specifying three RandomForestClassifierFactory instances. We are changing the "NumTrees" parameter for each instance. To change a classifier setting in general, assign a value to the setting like so: SettingName:NewValue Note that you must put classifier factory settings within the quotation marks that include the classifier factory name. This is a limitation of the current command-line argument parsing that will be fixed in a future release. | ||||||||||||||||