|
FunctionsAndParameters
PhiloMine2: Functions and Parameters
This is taken from PhiloMine1 and needs to be updated and expanded. We will do this in the near future. For KNN classification we need to discuss Counts Weighted Counts (inverse rank of nearness) Dot Product (sum of matching vectors) DP Squared DP Cubed As noted in the Introduction, we currently have two flavors of PhiloMine, the main version is primarily for comparisons of two sets of documents in a PhiloLogic database. Called PhiloMine, this can be used with any existing PhiloLogic database installation. The second variant is named Encyclomine. This performs classification, prediction, and vector space searching on "chunks" within one or more documents and cannot be simply dropped into an existing PhiloLogic database. We are planning to merge these two flavors in a later release in conjunction with an update release of PhiloLogic. We have tested all PhiloMine functions on a number of large full-text databases, including 2,500 (150 million words) Frantext database and numerous smaller datasets. Encyclomine is currently being used for the Encyclopedie of Diderot and d'Alembert, a large collection of 19th century newspapers, and an experimental version of Frantext (built as div level chunks with additional metadata). Please consult our Demonstration and Samples for a discussion of sample results and applications. What follows is the current list of functions and options found in both PhiloMine and the more experimental Encyclomine variant. Functions
o Differential Relative Rates (DRR): a simple but useful report that generates tables showing words that are over-represented in document group one (c1) compared document group two (c2). These are calculated by frequency per 10000 in each document and selected by (now hard wired) relative ratios of use. o Information Gain (Weka): displays the user selected top features as measured by Information Gain as a list. These are displayed without reference to the two groups of documents, only those features which are most effective at distinguishing the two vectors. A statistically more coherent test than DRR above. Note that this is often used by other classification techniques (decision trees) to build branches. o Classifier: Multinominal Naive Bayesian (MNB) o Classifier: Weka Naive Bayes o Classifier: Decision Tree (DT) Generate Graphic Tree o Classifier: SVMLightMulticlass:o Classifier: SMO (Weka)+ SVMLight One-vs-All (OVA) + SVMLight One-vs-One (OVO) o Predict: Multinominal Naive Bayesian (MNB). Train on c1 predict on c2. o Vector Space (frequency/normalized) Options
o C:This tuning parameter is described by the SVMLight documentation as "trade-off between training error and margin". If you set it on the form it will propagate to the command sent to SVMLight. Values should be orders of magnitude different to substantially affect performance. This parameter applies to binary and multiclass runs with SVMLight. o Multiclass Field: The name of the field that contains the class information, for example, 'genre'. o Multiclass Values: All values for which you would like to train models for the multiclass SVMLight run. Enter each value separated by spaces. For example, 'poetry mystery news'. |
Sign in to add a comment