Augustus AutoExample Baseline ChangeDetection Modeling PostProcessing RecentUpdates Segments TransactionalProcessing Workflow
All things must change to something new, to something strange;
Henry Wadsworth Longfellow, "Kéramos"
An important problem in data mining is detecting changes in large data sets. Although there are a variety of change detection algorithms that have been developed, in practice it can be a problem to scale these algorithms to large data sets due to the heterogeneity of the data.
This is an introduction and demonstration of using open source software and the Data Mining Group's Predictive Model Markup Language (PMML) standard to perform data analytics.
Specifically, we show how using multiple Baseline models over segments can be used to detect of anomalous behavior.
Baseline models are used for change detection.
Change involves comparison:
- From (or To) Expected behavior based on analytic properties. (Generalized Likelihood Ration (GLR))
- Measured or Empirically Derived behavior from data. (z-test, discrete distributions)
- A mix of analytic and empirical information. (CUSUM, GLR)
The most common case is use of or derivation of parameters from a Baseline Data sample. The question is, How well does the baseline sample represent 'Normal' (Unchanged) behavior?
Strategies
The common approach is to:
- Answer question in concert with segmentation strategy
- Begin with course segmentation and identify most important effects
- Understand sources of scoring results through analysis
- Feed back understanding to refined segmentation and new models.
It is important that models should be understood to evolve.
Open Data often uses Augustus.
Augustus (Augustus Project Page) is an open source scoring engine for statistical and data mining models based on the Data Mining Group's PMML.
PMML allows models to be separate from code: modelers using Augustus are free to focus on the data, the problem domain, and evaluating models rather than embedding statistical code into software. It is straightforward to develop a model on one system using one application and deploy the model on another system using another application.
Find out more about our use of Augustus and PMML here.
Case Studies and Examples
Open Data uses segments, multiple models and baseline models because we have learned from experience that most large systems are too complex to be described by a single analytic model. On this site, we give examples showing how segments and multiple models can be used with
- Transactional Processing (Payment) Systems and
- The Augustus Auto Example
- Coming Soon: Demonstration & Tutorial Running on Amazon's EC2.
What to do with your scores?
While raw scores and alerts are useful, generally people want to to be intelligently informed about the unanticipated results rather than being handed a large excel file. If an anomalous event requires human action, then rather than scores, you may want email alerts or information about the event rather than just the score. This is covered with post-processing.
Augustus is written in Python and it is easy to run shell scripts, R, Python, etc on the scores to
- Notify based on a threshold
- Select certain interesting values from the results
- Re-normalize the scoring results or performing an additional transformation.
- Restructure the data for use with other applications.
Open Data has work with clients on the usability of scores and we use R and Python code to generate a dashboard for each scored model.