incremental-statistics


a statistical library with an incremental computational model for time series analysis, discrete event simulation, and event stream processing

How would I use this for traditional statistical inference?

Most statistical inference is based on what we call a static computation model. Static computation computes the value of a statistic one time from a collection of discrete input data.

In our library, you can evoke a static computation in one of two ways:

  • To calculate the mean of an array of doubles using an extension method on the array type named Calculate, which takes a Statistic and returns a double.

[Test] public void StaticCalculation() { new[] {10.0, 6, 8, 4, 2}.Calculate(new Mean()).isEqualTo(6); }

Notes: * If you are wondering about the chained isEqualTo - that is just a generic extension method wrapping a traditional NUnit assert.

(You can control the scope and visibility of extension methods if you don't want this behaviour)

  • Some Statistics also implement the StaticStatistic interface that supports double Calculate(double.md).

[Test] public void test_population_variance() { double[] values = { -1.0d, 3.1d, 4.0d, -2.1d, 22d, 11.7d, 3d, 14d }; Variance v1 = Variance.CreateWithoutBiasCorrection(); Assert.That(v1.Calculate(values), Is.EqualTo(populationVariance(values)).Within(1E-14)); }

Notes: * The preferred method of static calculation is to use the first example - the Calculate extension method on the double array that takes an instance of an accumulating statistic. Static calculations eliminate duplication whenever possible by not implementing the StaticStatistic interface. However, the algorithms backing some accumulating statistics make minor accuracy-for-speed trade-offs that are moot for static calculations, so in these cases there will be an implementation of the StaticStatistic interface that uses the standard calculation algorithm that avoids any accuracy trade-offs.

How would I use this for incremental statistical inference; such as working with time series, discrete event simulation, and event stream processing?

Incremental computation means that the value of a statistic is updated each time you receive a new data event. Incremental computation of statistics use different calculation algorithms than static computation algorithms because incremental computation takes advantage of cached prior data and statistic values in order to drastically alter the performance characteristics of updating a statistics value on each new event. Incremental computational models are important for statistical inference in time series, discrete event simulation, and event stream processing. Accumulated and rolling calculations are specialized high performance incremental methods of calculation.

  • Accumulated calculation

    incrementally update the mean as new data events are received. calculates the mean over the entire sample of events. ``` [Test] public void accumulating_calculation() { var mean = new Mean(); mean.Calculate(1).isEqualTo(1); mean.Calculate(7).isEqualTo(4); }

``` * Rolling calculation

maintain a queue of length n and incrementally update the mean as new data events are received. calculates the mean of the past n events.

[Test] public void RollingCalculations() { var mean = new Mean(4); mean.Calculate(1).isEqualTo(double.NaN); mean.Calculate(7).isEqualTo(double.NaN); mean.Calculate(6).isEqualTo(double.NaN); mean.Calculate(4).isEqualTo(4.5); Assert.That(mean.Calculate(5), Is.EqualTo(5.5).Within(.001)); }

Notes: * Notice that rolling statistics return Not a Number until the queue is primed, i.e. when there is enough data to satisfy the window size specified in the constructor of the statistic. Priming is the process that allows rolling statistics to fill their backing queues up with enough data to make calculations based on the window size. The calculation algorithms are also different during priming and rolling calculation - so the implementation of Calculate for RollingStatistics delegates to prime() or calculate(). * Notice that the constructor calls to Mean in the accumulated and rolling cases. The Mean itself is actually just a wrapper. The AccumulatedMean and RollingMean are two separate classes, but you will instantiate the accumulating type if you use Mean's no-args constructor and a rolling type if you specify an integer window size (sometimes also called lookback length, or lag) in the constructor.

What is the background of this project and what other statistics libraries are out there that support the incremental computation model?

This math library evolved to compose trading models for simulation and real time trading. As a result of this, you will find not only the expected core rank and moment statistics, but also a number of finance specific calculations, as well as a number of regression and dependence statistics that are on the backlog.

In the world of trading, the the event driven model is the ubiquitous. While it is easy to find statistic libraries with static computational models, it is uncommon to find statistic libraries that support incremental computational models. The best libraries of the accumulating varieties that we have encountered during our research are the boost accumulators for C++ and apache commons math for Java. The only examples of rolling statistics that we have encountered is the zoo library for the R language, and tslib for C++ http://repo.or.cz/w/tslib.git and R http://repo.or.cz/w/fts.git.

There may be some commercial libraries and tools that we are unaware of (f.ex. there is likely something for matlab.) We prefered to find something that was open source, and also something that was not wedded to a specific tool, such as matlab or mathematica.

Statistics can be parametric (assume something about the underlying distribution) http://en.wikipedia.org/wiki/Parametric_statistics or non-parametric (distribution-free) http://en.wikipedia.org/wiki/Non-parametric_statistics. When we first started out, we found that the design in apache commons math was the closest to what we had in mind for the static and accumulated cases, and it had the richest coverage of parametric and non-parameteric statistics - specificly moment, and rank statistics. We initially began developing a variety of extensions to apache commons math, including the rolling cases. This served as a good prototype, but we wanted a different api that was designed more to our "incremental calculation" use cases and open to a more functional style of composing what might be called "higher order statistics."

More API show and tell please.

The API for this math library is a mix of OO and functional styles; with a trend toward a more functional style. Statistic is the core interface in the library. Statistic supports only a single double-in-double-out Calculate method, which allows the code to be used as a combinator library with composabile higher order statistics. A Statistic.Calculate method can be composed in statistical expression trees together with other double-in-double-out pure functions. HigherOrderStatistics are an example of functional style in the API. One simple example of a HigherOrderStatistic is the ScaledStatistic, which takes a statistic and a scalar function in the constructor and implements the calculate method by delegating to the underlying statistic and applying the scalar function to the result.

``` public class ScaledStatistic : Statistic { private readonly Statistic statistic; private readonly Func scalar;

    public ScaledStatistic(Statistic statistic, Func<double,double> scalar)
    {
        this.statistic = statistic;
        this.scalar = scalar;
    }

    public double Calculate(double d)
    {
        return scalar(statistic.Calculate(d));
    }
}

```

Rolling and accumulating statistics are stateful (a performance optimization that is at the core of the library), so the Calculate method is not a double-in-double-out pure function. This stateful computation imposes a limitation on using this code as a combinator library. Pure functions can be composed into lazily-evaluated monadic expression trees in such a way that functions are evaluated or not evaluated based on wheter the result of evaluating that function is necessary, and whether the results from previous calculations in the sequeuence yield valid input into subsequent calculations in the sequence. The stateful Statistic.Calculate method, on the other hand, must be evaluated with each new discrete event, otherwise it will be left in an invalid state. To make this more clear, consider guard clauses for NaN expressed as higher order functions that take a double and a function that takes and returns a double, and return doubles.

  • filtering pure functions for NaN

public static Func<double, Func<double, double>, double> not_a_number = (input, function) => Double.IsNaN(input) ? Double.NaN : function(input);

Notes: * This filter can wrap around ever pure function and halt the evaluation of an expression tree. * filtering functions with effects for NaN

public static Func<double, Func<double, double>, Func<double, double>, Func<double, double, double>, double> higher_order_with_validation = (input, function1, function2, function3) => { if (Double.IsNaN(input)) return Double.NaN; var result1 = function1(input); var result2 = function2(input); if (Double.IsNaN(result1) || Double.IsNaN(result2)) return Double.NaN; return function3(result1, result2); };

Notes: * Here we are combining three functions, but we can not chain sequences of expressions using our previous not_a_number filter becasue the functions we are passing in are closures of sorts; pointers to the Calculate() method on our stateful Statistic object. As a result of this, we are forced to evaluate both functions before we can express our NaN guard clauses for the output of the first two functions that serves as the input to the third function.

Another example of using the library in a functional style is the CombinedStatistic, which takes two statistics and a function that takes two doubles in the constructor, and implements the calculate method by applying the function to the results of evaluating the Calculate method on the two statistics. As you can see from the comment, our backlog contains a task to extend the CombinedStatistic concept to support a full combinator model for building arbitrary large and complex statistical expression trees.

``` //TODO extend to support combinator model to build arbitrary expression trees public class CombinedStatistic : Statistic { private readonly Func combinedCalculate;

    public CombinedStatistic(Statistic leftOperand, Statistic rightOperand, Func<double,double,double> function)
    {
        combinedCalculate = guards.higher_order_with_validation_generator(leftOperand.Calculate,
                                                                            rightOperand.Calculate,
                                                                            function);
    }

    public double Calculate(double value)
    {
        return combinedCalculate(value);
    }
}

```

What kind of DataStructuresAndAlgorithms are backing this library?

What about concurrency, is this thing thread safe?

The library is not thread safe. The concurrency model is intentionally the responsibility of the client code. We avoid any internal locking and we advocate that client code favor isolation and multi-process over shared memory and locking.

What is on the backlog for this project?

  • clean up inheritance fiasco in the rolling moments
  • implement incremental regression algorithms
  • more dependence statistics
  • more variability statistics
  • optimization library?
  • rewrite in F#?

Project Information

Labels:
Statistics Accumulating Rolling Math Functional Combinator Incremental Inference Finance Trading