My favorites | Sign in
Logo
             
All things must change to something new, to something strange;
Henry Wadsworth Longfellow, "Kéramos"

An important problem in data mining is detecting changes in large data sets. Although there are a variety of change detection algorithms that have been developed, in practice it can be a problem to scale these algorithms to large data sets due to the heterogeneity of the data.

This is an introduction and demonstration of using open source software and the Data Mining Group's Predictive Model Markup Language (PMML) standard to perform data analytics.

Specifically, we show how using multiple Baseline models over segments can be used to detect of anomalous behavior.

Baseline models are used for change detection.

Change involves comparison:

The most common case is use of or derivation of parameters from a Baseline Data sample. The question is, How well does the baseline sample represent 'Normal' (Unchanged) behavior?

Strategies

The common approach is to:

It is important that models should be understood to evolve.

Open Data often uses Augustus.

Augustus (Augustus Project Page) is an open source scoring engine for statistical and data mining models based on the Data Mining Group's PMML.

PMML allows models to be separate from code: modelers using Augustus are free to focus on the data, the problem domain, and evaluating models rather than embedding statistical code into software. It is straightforward to develop a model on one system using one application and deploy the model on another system using another application.

Find out more about our use of Augustus and PMML here.

Case Studies and Examples

Open Data uses segments, multiple models and baseline models because we have learned from experience that most large systems are too complex to be described by a single analytic model. On this site, we give examples showing how segments and multiple models can be used with

What to do with your scores?

While raw scores and alerts are useful, generally people want to to be intelligently informed about the unanticipated results rather than being handed a large excel file. If an anomalous event requires human action, then rather than scores, you may want email alerts or information about the event rather than just the score. This is covered with post-processing.

Augustus is written in Python and it is easy to run shell scripts, R, Python, etc on the scores to

Open Data has work with clients on the usability of scores and we use R and Python code to generate a dashboard for each scored model.









Hosted by Google Code