My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
PMML  
Overview of PMML
Featured, Phase-Design
Updated Jun 3, 2011 by collinbe...@gmail.com

For a walk through of the examples which ship with Augustus, see the Augustus Model Primer

The Predictive Model Markup Language or PMML is a vendor driven XML markup language for specifying statistical and data mining models. In other words, it is an XML language so that analytic models can be expressed in a in a platform and application independent fashion.

Without PMML, it is often the cases that

  • Models are deployed in proprietary formats
  • Models are application dependent
  • Models are system dependent
  • Models are architecture dependent
  • Time required to deploy models is long.

PMML aims for portability and safe deployments.

PMML's approach to developing and deploying analytical applications is based upon a few key concepts:

  • View analytic models as first class objects. With PMML, statistical and data mining models can be thought of as first class objects described using XML. Applications or services can be thought of as producing PMML or consuming PMML. A PMML XML file contains enough information so that an application can process and score a data stream with a statistical or data mining model using only the information in the PMML file.
  • Provide an interface between model producers and model consumers. Broadly speaking most analytic applications consist of a learning phase that creates a (PMML) model and a scoring phase that employs the (PMML) model to score a data stream or batch of records. The learning phase usually consists of the following sub-stages: exploratory data analysis, data preparation, event shaping, data modeling, & model validation. The scoring phase is typically simpler and either a stream or batch of data is scored using a model. PMML is designed so that different systems and applications can be used for producing models (PMML Producers) and for consuming models (PMML Consumers).
  • View data as event based. Many analytic applications can be naturally thought of as event based. Event based data presents itself as a stream of events that are transformed, integrated, or aggregated to produce the state vectors that are inputs to statistical or data mining models. The current version of PMML provides implicit support for event based processing of data; future versions are expected to provide explicit support.
  • Support data preparation. As mentioned above, data preparation is often the most time consuming part of the data mining process. PMML provides explicit support for many common data transformations and aggregations used when preparing data. Once encapsulated in this way, data preparation can more easily be re-used and leveraged by different components and applications.

PMML conists of the following components:

  1. Data Dictionary. The data dictionary defines the fields which are the inputs to models and specifies the type and value range for each field.
  2. Mining Schema. Each model contains one mining schema which lists the fields used in the model. These fields are a subset of the fields in the Data Dictionary. The mining schema contains information that is specific to a certain model, while the data dictionary contains data definitions which do not vary with the model. For example, the Mining Schema specifies the usage type of an attribute, which may be active (an input of the model), predicted (an output of the model), or supplementary (holding descriptive information and ignored by the model).
  3. Transformation Dictionary. The Transformation Dictionary defines derived fields. Derived fields may be defined by normalization, which maps continuous or discrete values to numbers; by discretization, which maps continuous values to discrete values; by value mapping, which maps discrete values to discrete values; or by aggregation, which summarizes or collects groups of values, for example by computing averages.
  4. Model Statistics. The Model Statistics component contains basic univariate statistics about the model, such as the minimum, maximum, mean, standard deviation, median, etc. of numerical attributes.
  5. Model Parameters. PMML also specifies the actual parameters defining the statistical and data mining models per se. Models in PMML include regression models, clusters models, trees, neural networks, bayesian models, association rules, and sequence models.

The diagram below shows how input files to PMML models can be defined.

Data attributes are defined using the PMML data dictionary. Those data attributes used in a model are defined using the PMML Mining Schema. In addition, derived attributes can be defined that are inputs to a model using the PMML Transformation Dictionary or using PMML defined local transformations.

References:

The Predictive Model Markup Lanaguage (PMML), http://www.dmg.org

Powered by Google Project Hosting