Export to GitHub

incremental-statistics - DataStructuresAndAlgorithms.wiki


Data Structures

All calculations occur on doubles and we primarily use double arrays and generic collections containing doubles. However, there is one custom data structure which is used exclusively as a backing queue for rolling calculation. The IndexableFixedLengthCompoundQueueOfDoubles is an optimized, indexable, fixed-length queue designed to solve two problems for event driven calculations on rolling windows. The "compound" word in the name comes from the fact that enqueue and dequeue is folded into a single Enqueue method that enqueues and, dequeues, and returns the dequeued element. A normal queue is dynamically resizing and new elements are enqued and dequed. A queue that serves as the data event caching mechanism to back rolling calculations is used in a specific pattern - data events cause both an enqueue and a dequeue at the same time. This pattern of useage can cause performace issues for normal queues if it leads to dynamic sizing and resizing. Becasue of the invariant that this queue remains fixed-size, and the only operation on the queue is a coumpound enqueue-dequeue-and-return operation, we can implement the queue very efficiently on a fixed length array with a very fast indexing scheme that continues wrapping indexes around the array and overwriting elements that are dequed. (we can use a fixed length array as the backing store and wrap indexes around the array that overwrite elements that are dequeued.)

DiscreteEvent and DateSensitiveStatistics

There is also a DiscreteEvent which is just a tuple containing a double and a DateTime, which is used for date sensitive statistics. For example, consider CompondAnnualReturn that implements the DateSensitiveStatistic interface:

``` public class CompoundAnnualReturn : DateSensitiveStatistic { private DiscreteDouble first = new DiscreteDouble(DateTime.MinValue, double.NaN);

    public DiscreteDouble Calculate(DiscreteDouble value)
    {
        if (double.IsNaN(first.Value))
        {
            first = value;
            return new DiscreteDouble(value.DateTime, 0);
        }
        double years = (value.DateTime - first.DateTime).Days/365.0;
        double compoundAnnualReturn = Math.Pow(value.Value/first.Value, 1.0/years) - 1;
        return new DiscreteDouble(value.DateTime, compoundAnnualReturn);
    }
}

```

Specialized implementations for incremental statistics

  • The implementations of most non straightforward calculations algorithms are at the header of the file for that type. Some of these descriptions were brought over with what we ported from apache commons math.
  • Moment statistics, FirstMoment, SecondMoment and their wrappers Mean, Variance, etc. all require special calculations algorithms for incremental calculation of moments. http://en.wikipedia.org/wiki/Moment_(mathematics)
  • Order statistics use the very important partition selection algorithm. http://en.wikipedia.org/wiki/Order_statistics http://en.wikipedia.org/wiki/Ranking
  • We have experimented with a couple of different partitioning algorithms: taking the left, right, and middle element and picking the median between them, another picks a random index in between the left and right. We have not attempted the median of medians algorithm yet, and will only do so if the performance of our current implementations is insufficient. http://en.wikipedia.org/wiki/Selection_algorithm http://www.cs.mcgill.ca/~cs251/OldCourses/1997/
  • Another performance optimization in selecting the value representing the start of the nth percentile of a list is intelligent choosing whether to select from the "front or back" of the list. Since order statistics provide not just median, but percentiles greater than and less than 50%, our percentile algorithms gain efficiency by making the intelligent decision about whether they use partitioning or reverse partitioning for the selection of the kth smallest or kth largest element: kth smallest for percentile < 50% and to kth largest for percentile > 50%.
  • We have two different cases for incremental rank statistics.

  • You want to use Percentile and specify a quantile in the constructor, you are interested in repeated calls to calculate() for the same quantile, for example, the median.

[Test] public void RollingCalculation() { var nthPercentile = new NthPercentile(.5, 4); Assert.IsNaN(nthPercentile.Calculate(8d)); Assert.IsNaN(nthPercentile.Calculate(2d)); Assert.IsNaN(nthPercentile.Calculate(6d)); Assert.That(nthPercentile.Calculate(4d), Is.EqualTo(5d)); Assert.That(nthPercentile.Calculate(5d), Is.EqualTo(4.5d)); }

  • You want to use Percentile with incremental updates but without specifying a quantile in the constructor, by executing repeated calls on calculate() and calling getValue with a quantile argument. This allows you to use a single instance of Percentile with a single backing data structure for a model that requires multiple percentiles of the same same data.

[Test] public void RollingWithDynamicQuantileCalculation() { var rollingPercentile = new RollingPercentileWithDynamicQuantile(4); Assert.IsNaN(rollingPercentile.Calculate(6.11)); Assert.IsNaN(rollingPercentile.Calculate(3.96)); Assert.IsNaN(rollingPercentile.Calculate(1.23)); rollingPercentile.Calculate(1.1); Assert.That(rollingPercentile.getResult(.75d), Is.EqualTo(5.5725)); rollingPercentile.Calculate(1.09); Assert.That(rollingPercentile.getResult(1d), Is.EqualTo(3.96)); rollingPercentile.Calculate(.22); Assert.That(rollingPercentile.getResult(.01d),Is.EqualTo(.22)); }