My favorites | Sign in
Project Logo
                
Search
for
Updated Jun 16, 2008 by MarkyMaypo57
FunctionsAndParameters  
PhiloMine2: Functions and Parameters

This is taken from PhiloMine1 and needs to be updated and expanded. We will do this in the near future.

For KNN classification we need to discuss

Counts Weighted Counts (inverse rank of nearness) Dot Product (sum of matching vectors) DP Squared DP Cubed

As noted in the Introduction, we currently have two flavors of PhiloMine, the main version is primarily for comparisons of two sets of documents in a PhiloLogic database. Called PhiloMine, this can be used with any existing PhiloLogic database installation. The second variant is named Encyclomine. This performs classification, prediction, and vector space searching on "chunks" within one or more documents and cannot be simply dropped into an existing PhiloLogic database. We are planning to merge these two flavors in a later release in conjunction with an update release of PhiloLogic.

We have tested all PhiloMine functions on a number of large full-text databases, including 2,500 (150 million words) Frantext database and numerous smaller datasets. Encyclomine is currently being used for the Encyclopedie of Diderot and d'Alembert, a large collection of 19th century newspapers, and an experimental version of Frantext (built as div level chunks with additional metadata). Please consult our Demonstration and Samples for a discussion of sample results and applications.

What follows is the current list of functions and options found in both PhiloMine and the more experimental Encyclomine variant. Functions

  • Descriptive
  • o Differential Relative Rates (DRR): a simple but useful report that generates tables showing words that are over-represented in document group one (c1) compared document group two (c2). These are calculated by frequency per 10000 in each document and selected by (now hard wired) relative ratios of use. o Information Gain (Weka): displays the user selected top features as measured by Information Gain as a list. These are displayed without reference to the two groups of documents, only those features which are most effective at distinguishing the two vectors. A statistically more coherent test than DRR above. Note that this is often used by other classification techniques (decision trees) to build branches.
  • Comparative
  • o Classifier: Multinominal Naive Bayesian (MNB) o Classifier: Weka Naive Bayes o Classifier: Decision Tree (DT) Generate Graphic Tree o Classifier: SVMLight
    Multiclass:
    + SVMLight One-vs-All (OVA) + SVMLight One-vs-One (OVO)
    o Classifier: SMO (Weka)
  • Predictive
  • o Predict: Multinominal Naive Bayesian (MNB). Train on c1 predict on c2.
  • Cluster/Similarity
  • o Vector Space (frequency/normalized)

Options

  • Balance Instances: randomly selects documents from the larger vector to create a balanced comparison dataset. We often use this several times to accumulate results comparing numerous documents with a desired characteristic (e.g. male authors) against smaller subsets (e.g. female authors) in large collections. Ideally, multiple runs will give a good sense of classification accuracy.
  • Limit Features (per number of documents): eliminates features that occur in more than a user selected number of documents and in few than a user selected number of documents. This is loosely based on Chen, Lee, and Hwang's discussion of "Conformity and Uniformity" in "A Hierarchical Neural Network Document Classifier with Linguistic Feature Selection", in Applied Intelligence, 23, 277-294, 2005.
  • Limit Features (per document frequency): an alternative feature limitation which filters out user selected high and low frequency words by document. Probably not as effective as per document feature limitation.
  • Random Falsification (shuffle instances): for comparative classifications, this shuffles the instance vectors in order to deliberately confuse the classifier. Given the large number of features and relatively limited number of instances, this is a good way to see if the classification accuracy is based on the data or your "greedy" classifiers. For example, a Weka SMO run comparing male/female authors gets 85.4% cross validated accuracy. When shuffled, same run gets 51.9%, cross validated. Given the power of many systems to find patterns, we believe this is an important control, along with N-fold cross validation, to make sure that classification criteria are having the impact that is detected. See A. Ruiz and P.E. L�pez-de-Teruel, "Random Falsifiability and Support Vector Machines".
  • Exclude Features: filter out particular features. This is to remove specific features from the classification task that may privilege classification. In the Encyclopedie tasks, strings that denote class of knowledge and authors should be removed, such as "archit architecture architect. Use reg-ex as normal with PhiloLogic searches. One should inspect the list of features generated by most of the classifiers to see things that should be removed. Also note that some of the classifiers will also list features that occur in one group and not another.
  • Use Features: base classification only on these selected features. We have not used this much to date, but it is an interesting experiment to see just how predictive a particular set of features may be. Further experimentation will be required.
  • Stem Words: currently implemented for Vector Space searching. This uses S�bastien Darribere-Pleyt's (perl mod) implementation of the Porter Stemmer for French. This will probably be extended as an option to all classification tasks. Behaves oddly for many pre-20th century forms and will not readily allow for link backs to the words in PhiloLogic. The English version could also be added. So far, we have not found huge classification differences using stemmed vs. unstemmed features and there is some internal discussion of whether form/tense is a useful classification criteria.
  • SVMLight
  • o C:This tuning parameter is described by the SVMLight documentation as "trade-off between training error and margin". If you set it on the form it will propagate to the command sent to SVMLight. Values should be orders of magnitude different to substantially affect performance. This parameter applies to binary and multiclass runs with SVMLight. o Multiclass Field: The name of the field that contains the class information, for example, 'genre'. o Multiclass Values: All values for which you would like to train models for the multiclass SVMLight run. Enter each value separated by spaces. For example, 'poetry mystery news'.

Sign in to add a comment
Hosted by Google Code