a query adaptive data structure for efficient indexing of time series databases presented by stavros...

A Query Adaptive Data Structure for Efficient Indexing of Time Series

Databases

Presented by Stavros Papadopoulos

Time Series and Similarity Search

Time SeriesA sequence (ordered collection)

of real values. X=x1,x2,..,xn where n can be very large

SimilarityGiven a query sequence q, a database

S of N sequences S1, S2, ..., SN, a distance measure D and a tolerancethreshold ε, two time series are regarded as similar within range ε when D(x, y) ≤ ε

Time Series and Similarity Search

Whole matchingGiven a collection of N data sequences of real numbers S1, S2 , …, SN and a query sequence Q, we want to find those data sequences that are within distance ε from Q. The data and query sequences must have the same length

Subsequence matchingGiven N data sequences S1, S2,…, SN of arbitrary lengths, a query sequence Q and a tolerance ε, we want to identify the data sequences Si that contain matching subsequences. Report those data sequences, along with the correct offsets within the data sequences that best match the query sequence

Dimensionality Reduction

Instead of sequential scanning the database in order to find similarity, an indexing method is required to reduce the query time

Every time series of length n is regarded as a point in the n-dimensional space

Indexing structures are used: R-Trees and their variants (the R*-tree is the most commonly used data structure)

R-trees scale badly with the increase of dimensionality

To efficiently search time series databases each sequence is represented as a multidimensional vector and dimensionality is reduced to a degree that index structures can be applied efficiently


There are the following known dimensionality reduction techniques: Fourier Transform (DFT) Wavelet Transform (DWT) Piecewise Aggregate Approximation (PAA) Singular Value Decomposition (SVD) Chebyshev Polynomials (Cheb) Piecewise Linear Approximation (PLA) Adaptive Piecewise Constant Approximation (APCA) Symbolic Aggregate Approximation (SAX)

Dimensionality Reduction using DFT

Let X=(x1,x2,…,xn) be a time series

We take the Fourier transform

We keep only the first fc coefficients

),...,,()( 21 nXXXXXFFT


Motivation behind using DFT: In most applications the data does not exhibit

rapid changes (e.g. stock data) Therefore, the energy of a time series vector

is concentrated in the lower frequencies. The first DFT coefficients contain this

information for the lower frequencies The reconstruction error using only a few

coefficients is not big

Dimensionality ReductionTechniques

0 20 40 60 80 100 120 140

0

1

2

3

X

X'

4

5

6

7

8

9

0 20 40 60 80 100 120 140

X

X'

0 20 40 60 80 100 120 140

X

X'

DFT DWT SVD

Dimensionality Reduction Techniques

0 20 40 60 80 100 120 140

X

X'

0 20 40 60 80 100 120

140

X

X'

0 20 40 60 80 100 120 140

X

X'

Cheb PLA PAA

0 20 40 60 80 100 120 140

X

X'

aaaaaabbbbbccccccbbccccdddddddd

dcba

0 20 40 60 80 100 120 140

X

X

Dimensionality Reduction Techniques

APCA SAX


Basically, dimensionality reduction is a technique for approximating the original sequence of size n by another sequence of much smaller length

Any dimensionality reduction technique potentially suffers from two problems:

False alarmsOccur when objects that appear to be close in the index space are actually distant. False alarms are removed in a post-processing stage.

False dismissalsOccur when qualifying objects are missed because they appear distant in the index space. They cannot be tolerated by the system.

The aforementioned techniques guarantee no false dismissals

False alarms occur because every transform of a n-dimensional point to a point in a space of reduced dimensionality approximates the true location of the transformed point.

CPU time is highly dependent on the implementation.

A more accurate and objective measure of the effectiveness of a dimensionality reduction technique is the pruning power

Evaluating the different techniques

Subsequence similarity searchFRM

Allows similarity matching between time series of variable size

Predefines a window length ω It divides every time series in the database into n-ω

sliding windows It indexes all these subsequences with pointers that point

to the original sequence they belong to It divides the query sequence into n/ω disjoint windows Then they search for all the query subsequences the

similar subsequences in the index The candidate set is furthermore examined in order to

remove any false alarms

Subsequence similarity searchFRM

Disadvantage:When a MBR is found, we include all the points in this MBR in the candidate set

Subsequence similarity searchDUAL MATCH

It is the dual of FRM Predefines a window length ω It divides every time series in the database into n/ω

disjoint windows It indexes all these subsequences with pointers that point

to the original sequence they belong to It divides the query sequence into n-ω sliding windows Then they search for all the query subsequences the

similar subsequences in the index The candidate set is furthermore examined in order to

remove any false alarms

It is shown by the authors’ experiments that DUAL MATCH outperforms FRM

Dimensionality reduction and query time

Two factors that affect similarity search queries:

Index searchThe larger the preselected dimensionality, the larger the search time in the R*-tree

False alarmsThe smaller the preselected dimensionality, the lower the accuracy and the larger the number of false alarms. Note that a false alarm is expensive since we have to fetch the ‘false’ time series from the database and sequentially compare it with the query sequence.

Question ?

Is there an optimal dimensionality and if there is how do we select it?

The authors’ point of view The problem of finding the optimal combination between

accuracy and index search time has not been addressed yet

The authors that proposed the various dimensionality reduction techniques conduct experiments and test the pruning power and CPU time when using different dimensionalities

They empirically find the optimal dimensionality counting the number of page accesses, after they issue all the queries

But what do we do when we have an application that supports online queries or when we do not have all the queries in advance?

A dynamic solution must be found!!

A naïve solution We keep multiple R*-trees with different dimensionalities, e.g. 2,

4, 6, …, 16

Each one indexes the whole database

We adopt a heuristic function to count the page accesses

We define proper thresholds

We start by using the R*-tree with the lowest dimensionality, e.g. 2-dimensonal

As queries arrive, if the thresholds are exceeded, we change the tree we use, e.g. from the 2-dimensional to the 4-dimensional

A naïve solution (cont.) Advantages

We can dynamically adjust the dimensionality of our index as queries arrive

We don’t have to decide the dimensionality in advance It can work well with online queries

Disadvantages Increased space Suppose that there is only a fraction of the database that

is affected by the queries and result in false alarms, then the whole tree will unnecessarily upgrade its dimensionality

The unaffected fraction of the database will then be indexed with a tree of higher dimensionality than it is actually needed

Convention for the purposes of our discussion

‘Simple’ time seriesA time series that concentrates its energy in its low frequencies

‘Complicated’ time seriesA time series that concentrates its energy in its high frequencies

Observations In a dataset, there could exist simple time series as well as

complicated times series

A single time series can contain simple subsequences as well as complicated subsequences, e.g. there might be some ‘white’ noise in a time series

The query could be simple or complicated

The query defines where we will have the false alarms, since the query is the one that falls in an MBR and causes false alarms there

Conclusions It is necessary for our solution to be query

adaptive, since we search the index according to the queries and the queries are the ones that cause the alarms

Since not all the indexed subsequences are of same complexity, our structure must support different dimensionalities for different portions of the database

This means that our structure should support MBRs of different dimensionalities

Proposed solution We will use a modification of the R*-tree We will also use the DUAL MATCH technique

for similarity matching We do not have any knowledge about our

dataset, not about the nature of the queries The queries can be offline or online Note that for online queries, most of the time,

applications just use an active window, which can be regarded as a sliding window

The structure is being adjusted throughout time, as new queries arrive

The structure The structure starts as a 2-

dimensional R*-tree Every fc-dimensional point has

a pointer to its original time series as well as to its complete transformation vector

Along with every node/MBR, we associate a variable dim that holds the dimensionality of the current node

20 4 6 8 10

2

4

6

8

10

x axis

y axis

b

c

aE3

a b c d e

E1 E2

E3 E4 E5

Root

E1 E2

E3E4

f g h

E5

d

e f

g h

i j

k

l

m

l m

E7

i j k

E6

E6 E7

Minimum Bounding Rectangle (MBR)

window query

dim=2

dim=2

dim=2

dim=2 dim=2 dim=2 dim=2

dim=2

The structure We search the index based on the dimensionality of every node We keep a heuristic function for the false alarms in every leaf MBR

and we define a threshold If the threshold is exceeded, we increase the dimensionality of the

corresponding leaf MBR This is achieved by retrieving more coefficients from the

transformation vector pointed by the pointer

20 4 6 8 10

2

4

6

8

10

x axis

y axis

bc

aE3

a b c d e

E1 E2

E3 E4 E5

Root

E1 E2

E3 E4f g h

E5

d

e f

g h

i j

kl

m

l mE7

i j kE6

E6 E7


window query

dim=4

dim=2dim=2


dim=2

dim=2

20 4 6 8 10

2

4

6

8

10

x axis

y axis

bc

aE3

a b c d e

E1 E2

E3 E4 E5

Root

E1 E2

E3 E4f g h

E5

d

e f

g h

i j

kl

m

l mE7

i j kE6

E6 E7


window query

dim=4

The structure At some point the dimensionality of all (or most of) the leaf MBRs of

a particular internal MBR may increase their dimensionality At that point we have to increase the dimensionality of the internal

MBR as well this is not trivial (it will be discussed later) This can propagate up to the root

dim=2

dim=2


dim=2

The structure At some point we will result in a R*-tree which

has different dimensionality for different subtrees

Benefit The subsequences that were creating the false

alarms in the beginning have now increased their dimensionality and they do not cause alarms any more the index search time has increased

The subsequences that didn’t cause alarms in the beginning haven’t changed their dimensionality and therefore the index search time for these subsequences has remained unchanged

The structure

Worst case There are two possibilities:

The dataset subsequences are all complicated The query sequences are all complicated

In the above cases the whole R*-tree may have a uniform dimensionality over its MBRs

It may even reach the maximum dimensionality (i.e.16 )

Even in the worst case we have the contribution that we can dynamically define the optimal dimensionality for our R*-tree

Concerns How do we merge subtrees when the dimensionality of their children MBRs has

increased?

Updates. Insertions, Deletions

What will the percentage of the children MBRs that should decide when to upgrade the dimensionality of their parent MBR be?

What will the heuristic function of false alarms be?

How about increasing the dimensionality from 2 to 6 instead of increasing it from 2 to 4?

There should be a special handling for PLA, SAX, PAA and APCA

We should be careful with merges and splits according to the fan out, since when we increase the dimensionality we increase the space as well

How do we decide if we want to decrease the dimensionality?

If we prove that we outperform the original R*-tree for all the dimensionality reduction techniques then we have a good contribution (the experimental section can be large)

Thank You !

For those that are interested, I have an extended bibliography for the issues covered

a query adaptive data structure for efficient indexing of time series databases presented by stavros...

Documents