a query adaptive data structure for efficient indexing of time series databases presented by stavros...
TRANSCRIPT
A Query Adaptive Data Structure for Efficient Indexing of Time Series
Databases
Presented by Stavros Papadopoulos
Time Series and Similarity Search
Time SeriesA sequence (ordered collection)
of real values. X=x1,x2,..,xn where n can be very large
SimilarityGiven a query sequence q, a database
S of N sequences S1, S2, ..., SN, a distance measure D and a tolerancethreshold ε, two time series are regarded as similar within range ε when D(x, y) ≤ ε
Time Series and Similarity Search
Whole matchingGiven a collection of N data sequences of real numbers S1, S2 , …, SN and a query sequence Q, we want to find those data sequences that are within distance ε from Q. The data and query sequences must have the same length
Subsequence matchingGiven N data sequences S1, S2,…, SN of arbitrary lengths, a query sequence Q and a tolerance ε, we want to identify the data sequences Si that contain matching subsequences. Report those data sequences, along with the correct offsets within the data sequences that best match the query sequence
Dimensionality Reduction
Instead of sequential scanning the database in order to find similarity, an indexing method is required to reduce the query time
Every time series of length n is regarded as a point in the n-dimensional space
Indexing structures are used: R-Trees and their variants (the R*-tree is the most commonly used data structure)
R-trees scale badly with the increase of dimensionality
To efficiently search time series databases each sequence is represented as a multidimensional vector and dimensionality is reduced to a degree that index structures can be applied efficiently
Dimensionality Reduction
There are the following known dimensionality reduction techniques: Fourier Transform (DFT) Wavelet Transform (DWT) Piecewise Aggregate Approximation (PAA) Singular Value Decomposition (SVD) Chebyshev Polynomials (Cheb) Piecewise Linear Approximation (PLA) Adaptive Piecewise Constant Approximation (APCA) Symbolic Aggregate Approximation (SAX)
Dimensionality Reduction using DFT
Let X=(x1,x2,…,xn) be a time series
We take the Fourier transform
We keep only the first fc coefficients
),...,,()( 21 nXXXXXFFT
Dimensionality Reduction using DFT
Motivation behind using DFT: In most applications the data does not exhibit
rapid changes (e.g. stock data) Therefore, the energy of a time series vector
is concentrated in the lower frequencies. The first DFT coefficients contain this
information for the lower frequencies The reconstruction error using only a few
coefficients is not big
Dimensionality Reduction using DFT
Dimensionality ReductionTechniques
0 20 40 60 80 100 120 140
0
1
2
3
X
X'
4
5
6
7
8
9
0 20 40 60 80 100 120 140
X
X'
0 20 40 60 80 100 120 140
X
X'
DFT DWT SVD
Dimensionality Reduction Techniques
0 20 40 60 80 100 120 140
X
X'
0 20 40 60 80 100 120
140
X
X'
0 20 40 60 80 100 120 140
X
X'
Cheb PLA PAA
0 20 40 60 80 100 120 140
X
X'
aaaaaabbbbbccccccbbccccdddddddd
dcba
0 20 40 60 80 100 120 140
X
X
Dimensionality Reduction Techniques
APCA SAX
Dimensionality Reduction
Basically, dimensionality reduction is a technique for approximating the original sequence of size n by another sequence of much smaller length
Any dimensionality reduction technique potentially suffers from two problems:
False alarmsOccur when objects that appear to be close in the index space are actually distant. False alarms are removed in a post-processing stage.
False dismissalsOccur when qualifying objects are missed because they appear distant in the index space. They cannot be tolerated by the system.
The aforementioned techniques guarantee no false dismissals
False alarms occur because every transform of a n-dimensional point to a point in a space of reduced dimensionality approximates the true location of the transformed point.
CPU time is highly dependent on the implementation.
A more accurate and objective measure of the effectiveness of a dimensionality reduction technique is the pruning power
Evaluating the different techniques
Subsequence similarity searchFRM
Allows similarity matching between time series of variable size
Predefines a window length ω It divides every time series in the database into n-ω
sliding windows It indexes all these subsequences with pointers that point
to the original sequence they belong to It divides the query sequence into n/ω disjoint windows Then they search for all the query subsequences the
similar subsequences in the index The candidate set is furthermore examined in order to
remove any false alarms
Subsequence similarity searchFRM
Disadvantage:When a MBR is found, we include all the points in this MBR in the candidate set
Subsequence similarity searchDUAL MATCH
It is the dual of FRM Predefines a window length ω It divides every time series in the database into n/ω
disjoint windows It indexes all these subsequences with pointers that point
to the original sequence they belong to It divides the query sequence into n-ω sliding windows Then they search for all the query subsequences the
similar subsequences in the index The candidate set is furthermore examined in order to
remove any false alarms
It is shown by the authors’ experiments that DUAL MATCH outperforms FRM
Dimensionality reduction and query time
Two factors that affect similarity search queries:
Index searchThe larger the preselected dimensionality, the larger the search time in the R*-tree
False alarmsThe smaller the preselected dimensionality, the lower the accuracy and the larger the number of false alarms. Note that a false alarm is expensive since we have to fetch the ‘false’ time series from the database and sequentially compare it with the query sequence.
Question ?
Is there an optimal dimensionality and if there is how do we select it?
The authors’ point of view The problem of finding the optimal combination between
accuracy and index search time has not been addressed yet
The authors that proposed the various dimensionality reduction techniques conduct experiments and test the pruning power and CPU time when using different dimensionalities
They empirically find the optimal dimensionality counting the number of page accesses, after they issue all the queries
But what do we do when we have an application that supports online queries or when we do not have all the queries in advance?
A dynamic solution must be found!!
A naïve solution We keep multiple R*-trees with different dimensionalities, e.g. 2,
4, 6, …, 16
Each one indexes the whole database
We adopt a heuristic function to count the page accesses
We define proper thresholds
We start by using the R*-tree with the lowest dimensionality, e.g. 2-dimensonal
As queries arrive, if the thresholds are exceeded, we change the tree we use, e.g. from the 2-dimensional to the 4-dimensional
A naïve solution (cont.) Advantages
We can dynamically adjust the dimensionality of our index as queries arrive
We don’t have to decide the dimensionality in advance It can work well with online queries
Disadvantages Increased space Suppose that there is only a fraction of the database that
is affected by the queries and result in false alarms, then the whole tree will unnecessarily upgrade its dimensionality
The unaffected fraction of the database will then be indexed with a tree of higher dimensionality than it is actually needed
Convention for the purposes of our discussion
‘Simple’ time seriesA time series that concentrates its energy in its low frequencies
‘Complicated’ time seriesA time series that concentrates its energy in its high frequencies
Observations In a dataset, there could exist simple time series as well as
complicated times series
A single time series can contain simple subsequences as well as complicated subsequences, e.g. there might be some ‘white’ noise in a time series
The query could be simple or complicated
The query defines where we will have the false alarms, since the query is the one that falls in an MBR and causes false alarms there
Conclusions It is necessary for our solution to be query
adaptive, since we search the index according to the queries and the queries are the ones that cause the alarms
Since not all the indexed subsequences are of same complexity, our structure must support different dimensionalities for different portions of the database
This means that our structure should support MBRs of different dimensionalities
Proposed solution We will use a modification of the R*-tree We will also use the DUAL MATCH technique
for similarity matching We do not have any knowledge about our
dataset, not about the nature of the queries The queries can be offline or online Note that for online queries, most of the time,
applications just use an active window, which can be regarded as a sliding window
The structure is being adjusted throughout time, as new queries arrive
The structure The structure starts as a 2-
dimensional R*-tree Every fc-dimensional point has
a pointer to its original time series as well as to its complete transformation vector
Along with every node/MBR, we associate a variable dim that holds the dimensionality of the current node
20 4 6 8 10
2
4
6
8
10
x axis
y axis
b
c
aE3
a b c d e
E1 E2
E3 E4 E5
Root
E1 E2
E3E4
f g h
E5
d
e f
g h
i j
k
l
m
l m
E7
i j k
E6
E6 E7
Minimum Bounding Rectangle (MBR)
window query
dim=2
dim=2
dim=2
dim=2 dim=2 dim=2 dim=2
dim=2
The structure We search the index based on the dimensionality of every node We keep a heuristic function for the false alarms in every leaf MBR
and we define a threshold If the threshold is exceeded, we increase the dimensionality of the
corresponding leaf MBR This is achieved by retrieving more coefficients from the
transformation vector pointed by the pointer
20 4 6 8 10
2
4
6
8
10
x axis
y axis
bc
aE3
a b c d e
E1 E2
E3 E4 E5
Root
E1 E2
E3 E4f g h
E5
d
e f
g h
i j
kl
m
l mE7
i j kE6
E6 E7
Minimum Bounding Rectangle (MBR)
window query
dim=4
dim=2dim=2
dim=2 dim=2 dim=2 dim=2
dim=2
dim=2
20 4 6 8 10
2
4
6
8
10
x axis
y axis
bc
aE3
a b c d e
E1 E2
E3 E4 E5
Root
E1 E2
E3 E4f g h
E5
d
e f
g h
i j
kl
m
l mE7
i j kE6
E6 E7
Minimum Bounding Rectangle (MBR)
window query
dim=4
The structure At some point the dimensionality of all (or most of) the leaf MBRs of
a particular internal MBR may increase their dimensionality At that point we have to increase the dimensionality of the internal
MBR as well this is not trivial (it will be discussed later) This can propagate up to the root
dim=2
dim=2
dim=4 dim=6 dim=2 dim=2
dim=2
The structure At some point we will result in a R*-tree which
has different dimensionality for different subtrees
Benefit The subsequences that were creating the false
alarms in the beginning have now increased their dimensionality and they do not cause alarms any more the index search time has increased
The subsequences that didn’t cause alarms in the beginning haven’t changed their dimensionality and therefore the index search time for these subsequences has remained unchanged
The structure
Worst case There are two possibilities:
The dataset subsequences are all complicated The query sequences are all complicated
In the above cases the whole R*-tree may have a uniform dimensionality over its MBRs
It may even reach the maximum dimensionality (i.e.16 )
Even in the worst case we have the contribution that we can dynamically define the optimal dimensionality for our R*-tree
Concerns How do we merge subtrees when the dimensionality of their children MBRs has
increased?
Updates. Insertions, Deletions
What will the percentage of the children MBRs that should decide when to upgrade the dimensionality of their parent MBR be?
What will the heuristic function of false alarms be?
How about increasing the dimensionality from 2 to 6 instead of increasing it from 2 to 4?
There should be a special handling for PLA, SAX, PAA and APCA
We should be careful with merges and splits according to the fan out, since when we increase the dimensionality we increase the space as well
How do we decide if we want to decrease the dimensionality?
If we prove that we outperform the original R*-tree for all the dimensionality reduction techniques then we have a good contribution (the experimental section can be large)
Thank You !
For those that are interested, I have an extended bibliography for the issues covered