My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
SAX  
Walk through steps of Symbolic Aggregate approXimation.
Featured
Updated Aug 25, 2010 by sen...@gmail.com

Description

Symbolic Aggregate approXimation (SAX) transforms original time-series data into symbolic strings.

Symbolic Aggregate approXimation was proposed by Lin et al. and extends PAA-based approach inheriting algorithm simplicity and low computational complexity while providing satisfiable sensitivity and selectivity in range-query processing. Moreover, the use of a symbolic representation opens the door to the existing wealth of data-structures and string-manipulation algorithms in computer science such as hashing, regular expression pattern matching, suffix trees etc.

SAX transforms a time-series X of length n into the string of arbitrary length , where typically, using an alphabet A of size a < 2. The SAX algorithm consist of two steps: during the first step it transforms the original time-series into a PAA representation and this intermediate representation gets converted into a string during the second step. Use of PAA at the first step brings the advantage of a simple and efficient dimensionality reduction while providing the important lower bounding property as shown in the previous section. The second step, actual conversion of PAA coefficients into letters, is also computationally efficient and the contractive property of symbolic distance was proven by Lin et al.

Discretization of the PAA representation of a time-series into SAX is implemented in a way which produces symbols corresponding to the time-series features with equal probability. The extensive and rigorous analysis of various time-series datasets available to the original authors has shown that time-series that are normalized by the zero mean and unit of energy follow the Normal distribution law. By using Gaussian distribution properties, it's easy to pick a equal-sized areas under the Normal curve using lookup tables for the cut lines coordinates, slicing the under-the-Gaussian-curve area. The x coordinates of these lines called "breakpoints" in the SAX algorithm context. The list of breakpoints such that and , divides the area under N(0,1) into a equal areas. By assigning a corresponding alphabet symbol to each interval , the conversion of the vector of PAA coefficients into the string implemented as follows:

SAX introduces new metrics for measuring distance between strings by extending Euclidean and PAA distances. The function returning the minimal distance between two string representations of original time series and is defined as

where the dist function is implemented by using the lookup table for the particular set of the breakpoints (alphabet size) as shown in Table below, and where the singular value for each cell (r,c) is computed as

The lookup table for 4-letters alphabet

a b c d
a 0 0 0.67 1.34
b 0 0 0 0.67
c 0.67 0 0 0
d 1.34 0.67 0 0

As shown by Li et al, this SAX distance metrics lower-bounds the PAA distance, i.e.

The SAX lower bound was examined by Ding et al in great detail and found to be superior in precision to the spectral decomposition methods on bursty (non-periodic) data sets.

SAX primer

1.0 Timeseries data

Timeseries1: (2.02, 2.33, 2.99, 6.85, 9.20, 8.80, 7.50, 6.00, 5.85, 3.85, 4.85, 3.85, 2.22, 1.45, 1.34) 
Timeseries2: (0.50, 1.29, 2.58, 3.83, 3.25, 4.25, 3.83, 5.63, 6.44, 6.25, 8.75, 8.83, 3.25, 0.75, 0.72) 
The Euclidean distance between ts1 and ts2 D(ts1, ts2) = 11.42126

2.0 PAA transform into 9 points

2.1 Normalization "pre-processing"

Note that before transforming timeseries with PAA we Z-normalize them first

Timeseries1, Z-normalized: (-0.9796808, -0.8622706, -0.6123005, 0.8496459, 1.739691, 1.588194, 1.095829, 0.5277147, 0.4709033, -0.2865819, 0.0921607, -0.2865819, -0.9039323, -1.195564, -1.237226)
Timeseries2, Z-normalized: (-1.289433, -0.9992189, -0.5253246, -0.06612478, -0.2791935, 0.08816637, -0.06612478, 0.595123, 0.8926845, 0.8228861, 1.741286, 1.770675, -0.2791935, -1.197593, -1.208614)
The Euclidean distance between normalized ts1 and ts2 D(ts1', ts2') = 4.170542 

2.2 PAA of normalized timeseries into 9 points

Timeseries1, 9-points PAA: (-0.9327168, -0.3699053, 1.383673, 1.391248, 0.6299752, 0.01641218, -0.05933634, -0.8387886, -1.220561)
Timeseries2, 9-points PAA: (-1.173347, -0.5282635, -0.1939660, 0.02644991, 0.5223858, 0.8508055, 1.753041, -0.05289982, -1.204206)
The Euclidean distance between PAA-transformed ts1 and ts2 D(ts1'', ts2'') = 3.007489
   i.e. it is less than distance between normalized timeseries.

3.0 Normalized & PAA transformed timeseries into string

We will use the 4 symbols alphabet {a,b,c,d} as in the table above. The cut lines for this alphabet shown as the thin red lines on the plot below

The 4 symbols alphabet cuts { -0.67, 0, 0.67 }
SAX transform of ts1 into string through 9-points PAA: "abddccbaa"
SAX transform of ts2 into string through 9-points PAA: "abbccddba"
SAX distance: 0 + 0 + 0.67 + 0 + 0 + 0 + 0.67 + 0 + 0 = 1.34 
Orange color depicts symbols distance between which is counted - they are not "adjacent" to each other...  

Comment by eralda_d...@yahoo.com, Mar 27, 2010

i have read about this transformation of time series and i'm very interesting. actualy i'm working on it...but if you can provide more info about the new searches about this kind of trensform i'll be glad. sincerly,alda


Sign in to add a comment
Powered by Google Project Hosting