=Paper=
{{Paper
|id=Vol-2399/paper04
|storemode=property
|title=Effective and Efficient Variable-Length Data Series Analytics
|pdfUrl=https://ceur-ws.org/Vol-2399/paper04.pdf
|volume=Vol-2399
|authors=Michele Linardi
|dblpUrl=https://dblp.org/rec/conf/vldb/Linardi19
}}
==Effective and Efficient Variable-Length Data Series Analytics==
ceur-ws.org/Vol-2399/paper04.pdf
Effective and Efficient
Variable-Length Data Series Analytics
Michele Linardi
Supervised by: Themis Palpanas
LIPADE, Université de Paris
michele.linardi@parisdescartes.fr
ABSTRACT require that this length is chosen at index construction [10,
In the last twenty years, data series similarity search has 24, 1, 25, 28, 29, 26, 21, 11]. The same observation holds
emerged as a fundamental operation at the core of several for techniques proposed to discover motifs [12] and discords
analysis tasks and applications related to data series collec- (i.e., anomalous subsequences) [27]: they all assume a fixed
tions. Many solutions to di↵erent mining problems work by sequence length, which has to be predefined.
means of similarity search. In this regard, all the proposed Evidently, this is a constraint that penalizes the flexibility
solutions require the prior knowledge of the series length on needed by analysts, who often times need to analyze pat-
which similarity search is performed. In several cases, the terns of slightly di↵erent lengths (within a given data series
choice of the length is critical and sensibly influences the collection) [7, 8, 5, 17, 16]. For example, in the SENTINEL-
quality of the expected outcome. Unfortunately, the obvi- 2 mission data, oceanographers are interested in searching
ous brute-force solution, which provides an outcome for all for similar coral bleaching patterns2 of di↵erent lengths; at
lengths within a given range is computationally untenable. Airbus3 engineers need to perform similarity search queries
In this Ph.D. work, we present the first solutions that inher- for patterns of variable length when studying aircraft take-
ently support scalable and variable-length similarity search o↵s and landings [19]; and in neuroscience, analysts need to
in data series, applied to sequence/subsequences matching, search in Electroencephalogram (EEG) recordings for Cyclic
motif and discord discovery problems. The experimental re- Alternating Patterns (CAP) of di↵erent lengths (duration),
sults show that our approaches are up to orders of magnitude in order to get insights about brain activity during sleep [22].
faster than the alternatives. They also demonstrate that we In our work, we focus on three core problems that are
can remove the unrealistic constraint of performing analyt- based on similarity search: subsequence matching, and mo-
ics using a predefined length, leading to more intuitive and tif and discord discovery, organized under the ULISSE and
actionable results, which would have otherwise been missed. MAD methods:
1. ULISSE (ULtra compact Index for variable-length Sim-
ilarity SEarch in data series) is the first indexing technique
1. INTRODUCTION that supports variable-length subsequence matching for non
Data series (i.e., ordered sequences of points) are one Z-normalized and Z-normalized data series [15, 13, 14].
of the most common data types1 , present in almost ev- 2. MAD (Motif and Discord discovery framework) im-
ery scientific and social domain (such as meteorology, as- plements two novel algorithms for variable-length motif and
tronomy, chemistry, medicine, neuroscience, finance, agri- discord discovery in large data series [17, 4, 16].
culture, entomology, sociology, smart cities, marketing, op-
eration health monitoring, human action recognition and 2. VARIABLE-LENGTH ANALYTICS
others) [20].
In this section, we describe our proposed approaches to
Once the data series have been collected, the domain ex-
the aforementioned problems. In the next part we describe
perts face the arduous tasks of processing and analyzing
the notions and the elements used in our solutions.
them [30, 6] in order to gain insights, e.g., by identifying sim-
Preliminaries. Let a data series D = d1 ,...,d|D| be a se-
ilar patterns, and performing classification, or clustering. A
quence of numbers di 2 R, where i 2 N represents the posi-
core operation that is part of all these analysis tasks is sim-
tion in D. We denote the length, or size of the data series
ilarity search, which has attracted lots of attention because
D with |D|. The subsequence Ds,` =ds ,...,ds+` 1 of length
of its importance [2]. Nevertheless, all existing scalable and
`, is a contiguous subset of ` points of D starting at o↵set s,
index-based similarity search techniques are restricted in
where 1 s |D| and 1 ` |D|. A subsequence is itself
that they only support queries of a fixed length, and they
a data series. A data series collection, C, is a set of data
1
If the dimension that imposes the ordering of the sequence series. We say that a data series D is Z-normalized, denoted
is time then we talk about time series. Though, a series Dn , when its mean µ is 0 and its standard deviation is
can also be defined over other measures (e.g., angle in radial 1. Z-normalization is an essential operation in several ap-
profiles in astronomy, mass in mass spectroscopy in physics, plications, because it allows similarity search irrespective of
position in genome sequences in biology, etc.). We use the shifting and scaling [5]. The Piecewise Aggregate Approx-
terms data series, time series, and sequence interchangeably.
imation (PAA) of a data series D, P AA(D) = {p1 , ..., pw },
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los 2
Angeles, California. Copyright (C) 2019 for this paper by its authors. http://www.esa.int/Our_Activities/Observing_the_Earth
3
Copying permitted for private and academic purposes. http://www.airbus.com/
represents D in a w-dimensional space by means of w real- γ 140
CPU + disk I/O (Secs )
Avg Exact Query Time
0% 20%
200 120 query answering disk i\o
40% 60%
valued segments of length s, where the value of each segment
Cumulative Query
150 80% 100% 100 query answering cpu
Time (hours)
80
is the mean of the corresponding values of D [9]. We denote 100 60
Indexing (disk i\o + cpu time)
the first k dimensions of P AA(D), (k w), as P AA(D)1,..,k . 50
40
20
The iSAX representation of a data series D, denoted by 0
0
iSAX(D, w, |alphabet|), is the representation of P AA(D) by 160 192 224 256
(a) (b)
w discrete coefficients, drawn from an alphabet of cardinal- Query length γ = (% of (lmax - lmin))
ity |alphabet| [24]. The main idea of the iSAX represen-
tation, is that the real-value space may be segmented by Figure 1: Query answering time performance, vary-
|alphabet| 1 breakpoints in |alphabet| regions, which are ing on non Z-normalized data series. (a) ULISSE
labeled by discrete symbols (e.g., with |alphabet| = 4 the average query time (CPU + disk I/O). (b) ULISSE
available labels may be {00, 01, 10, 11}). average query disk I/O time. (b) Comparison of
ULISSE to other techniques (cumulative indexing
2.1 Subsequence Matching + query answering time).
The subsequence matching problem is defined as follows: Envelope uEN V represents the principal building block of
Given a data series collection C = {D1 , ..., DC }, a se- the ULISSE Index. In details, ULISSE is a tree structure,
ries length range [`min , `max ], a query data series Q, where where each internal node stores the Envelope uEN V repre-
`min |Q| `max , and k 2 N, we want to find the senting all the sequences in the subtree rooted at that node.
i
set R = {Do,` |Di 2 C ^ ` = |Q| ^ (` + o 1) |Di |}, Leaf nodes contain several Envelopes, which by construction
0
where |R| = k. We require that 8Do,` i
2 R @Doi 0 ,`0 s.t. have the same iSAX(L). On the contrary, their iSAX(U )
0 varies, since it get updated with every new insertion in the
dist(Doi 0 ,`0 , Q) < dist(Do,`
i
, Q), where `0 = |Q|, (`0 +o0 1)
node. Each Envelope in leaf nodes point the the represented
i0 i0
|D | and D 2 C. We informally call R, the k nearest neigh- sequences in the original data series collection.
bors set of Q. Given two generic series of the same length, Approximate Subsequence Matching. Subsequence
namely D and D0 the function dist(D, D0 ) can be Euclidean matching performed on ULISSE index relies on the
Distance or Dynamic Time Warping. mindistU LiSSE () lower bounding function to prune the
Variable Length Subsequences. In a data series, when search space. This allows to navigate the tree in order, vis-
we consider contiguous and overlapping subsequences of dif- iting first the most promising nodes. As soon as a leaf node
ferent lengths within the range [`min , `max ],we expect the is discovered, we can load the raw data series pointed by
outcome as a bunch of similar series, whose di↵erences are the Envelopes in the leaf. Each time we compute the true
a↵ected by the misalignment and the di↵erent number of Euclidean or DTW distance between the series in a leaf, the
points. Given a data series D, and a subsequence length best-so-far distance (bsf) is updated, along with a vector
range [`min , `max ], we define the master series as the sub- containing the k best matches, where k refers to the k near-
sequences of the form Di,min(|D| i+1,`max ) , for each i such est neighbors. Since priority is given to the most promising
that 1 i |D| (`min 1), where 1 `min `max |D|. nodes, we can terminate our visit, when at the end of a leaf
We observe that for any master series of the form Di,`0 , we visit the k bsf’s have not improved.
have that P AA(Di,`0 )1,..,k = P AA(Di,`00 )1,..,k holds for each Exact Subsequence Matching. Note that the approxi-
`00 such that `00 `min , `00 `0 `max and `0 , `00 %k = 0. mate search described above may not visit leaves that con-
Therefore, by computing only the P AA of the master se- tain answers better than the approximate answers already
ries in D, we are able to represent the P AA prefix of any sub- identified, and therefore, it will fail to produce exact, correct
sequence of D. When we zero-align the P AA summaries of results.
the master series, we compute the minimum and maximum The exact nearest neighbor search algorithm we propose
P AA values (over all the subsequences) for each segment: finds the k sequences with the absolute smallest distances
this forms what we call an Envelope. (When the length of a to the query. In this case, the search algorithm may visit
master series is not a multiple of the P AA segment length, several leaves: the process stops after it has either visited,
we compute the P AA coefficients of the longest prefix that is or pruned (when the lower bounding distance to the node is
multiple of a segment.) We call containment area the space greater than the bsf) all the nodes of the index, guaranteeing
in between the segments that define the Envelope. the correctness of the results.
PAA Envelope. We formalize the concept of the Envelope, Experiments. To evaluate ULISSE, we used synthetic and
introducing a new series representation. We denote by L and real data (but in the interest of space we only report results
U the P AA coefficients, which delimit the lower and upper with the synthetic data). We record the average CPU time,
parts, respectively, of a containment area. Furthermore, we disk I/O (time to fetch data from disk (Total time - CPU
introduce a parameter , which permits to select the number time)), for 100 queries, extracted from the datasets with the
of master series we represent by the Envelope. We refer to it addition of Gaussian noise. We compare ULISSE with UCR
using the following signature: paaEN V[D,`min ,`max ,a, ,s] = suite [5] the non index-based state-of-the-art technique for
[L, U ]. It delimits the containment area generated by the answering similarity search queries. Concerning the com-
P AA coefficients of the master series. petitor indexing techniques, the state-of-the-art is the Com-
Indexing the Envelopes. Given a paaEN V , we pact Multi Resolution Index [7] CMRI.
can translate its P AA extremes into the correspond- In Figure 1, we present results for subsequence matching
ing iSAX representation: uEN VpaaEN V[D,`min ,`max ,a, ,s] = queries on ULISSE when we vary , ranging from 0 to its
[iSAX(L), iSAX(U )], where iSAX(L) (iSAX(U )) is the maximum value in this dataset, i.e., `max `min . In Fig-
vector of the minimum (maximum) P AA coefficients of all ure 1, we report the results concerning non Z-normalized se-
the segments corresponding to the subsequences of D. The ries. We observe that grouping contiguous and overlapping
subsequences under the same summarization (Envelope) by lengths. Hence, given a data series D, we compute the Ma-
increasing , a↵ects positively the performance of index con- trix profile using the smallest subsequence length, namely
struction, as well as query answering. `min , within a specified input range [`min , `max ]. The key
The latter may seem counterintuitive, since inserting more idea of our approach is to minimize the work that needs
master series into a single Envelope is likely to generate large to be done for succeeding subsequence lengths (`min + 1,
containment areas, which are not tight representations of the `min + 2, . . ., `max ).
data series. On the other hand, it leads to an overall number Matrix Profile Computation. We start the computation
of Envelope that is several orders of magnitude smaller than of the Matrix profile, considering all the contiguous subse-
the one for = 0%, where only a single master series is quences of length `min , computing for each one the Distance
represented by each Envelope. profile in O(|D|) time. This latter is a vector that contains
the z-normalized Euclidean distances between a fixed sub-
2.2 Motif and Discord Discovery sequence and all the other in D (excluding trivial matches).
Motif and Discord are data mining primitives that rep- Lower Bound Subsequences of Di↵erent Length. We
resent frequent and rare (anomalous) patterns, respectively. moreover introduce a new lower bounding distance [17],
Given a data series D, they are defined as follows: which lower bounds (is always smaller than) the true Eu-
clidean distances between subsequences longer than `min .
• Data series motif: Da,` and Db,` is a motif pair i↵
We initially compute this lower bound using the true Eu-
dist(Da,` , Db,` ) dist(Di,` , Dj,` ) 8i, j 2 [1, 2, ..., |D|
clidean distances computation of subsequences with length
` + 1], where a 6= b and i 6= j, and dist is a function
`min . For the larger lengths, we update the lower bound,
that computes the z-normalized Euclidean distance be-
considering only the variation generated by the trailing
tween the input subsequences.
points in the longer subsequences. This measure enjoys an
• Data series discord: We call the k subsequences of important property: if we rank the subsequences according
D, with the k largest distances to their mth Nearest to this measure, the same rank will be preserved along all the
Neighbor (according the Euclidean distance), the Top- lower bound updates for the subsequences of greater length.
k mth discords. We exploit this property, in order to prune computations.
Pruning the Search Space. Once we compute motif and
Variable length motif and discord discovery. We pro- discords, with length greater than `min , instead of comput-
vide solutions to the following problems: ing from scratch each distance profile, we update the true
• Variable-Length Motif Discovery: Given a data series distances (in constant time) of the subsequences that have
D and a subsequence length-range [`min , ..., `max ], we the p smallest lower bounding distances (computed in the
want to find the data series motif pairs of all lengths previous step). These distances form what we call partial
in [`min , ..., `max ], occurring in D. distance profile. In each partial distance profile, we also up-
date the lower bound. After this operation, we may have two
• Variable-Length Top-k mth Discord Discovery: Given cases: if in a new computed distance profile the minimum
a data series D, a subsequence length-range true distance (minDist) is shorter than the maximum lower
[`min , ..., `max ] and the parameters a, b 2 N+ we want bound (maxLB ), we know that no distances among those
to enumerate the Top-k mth discords for each k 2 not computed can be smaller than minDist. In this case, a
{1, .., a} and each m 2 {1, .., b}, and for all lengths partial distance profile becomes a valid distance profile. On
in [`min , ..., `max ], occurring in D. the other hand, when maxLB is smaller than minDist, this
latter is not guaranteed to be the nearest neighbor distance.
Fixed length motif and discord discovery. The state-
For discord discovery, we need to test this condition for the
of-the art algorithm for fixed length motif and discord dis-
m smallest true distances in the partial distance profile. In
covery [3] requires the user to define the length of the de-
this case a valid (partial) distance profile must contain the
sired motif or discord. This mining operation is supported
true mth best match distances, which are smaller than, or
by computation of the Matrix profile, which is a meta data
equal to maxLB.
series storing the z-normalized Euclidean distance between
Exact Motif and Discord Discovery. Once the partial
each subsequence and its nearest neighbor. The Matrix pro-
distance profiles are computed, we pick the absolute smallest
file does not only derive the motif, but also ranks and filters
lower bounding value from all the non-valid distance profiles,
out the other pairs, giving also a convenient and graphical
namely minLBAbs (if any). Therefore, the global minimum
representation of their occurrences and proximity. Unfortu-
(true) distance of all the valid (partial) distance profiles,
nately, this technique comes with an important shortcoming:
which is smaller than minLBAbs is guaranteed to be the
it does not provide an e↵ective solution for trying several dif-
distance between the motif pair subsequences. Symmetri-
ferent motif lengths. Therefore, the analyst is forced to run
cally, we consider the valid (partial) distance profiles to find
the algorithm using all possible lengths in a range of interest,
the true mth best match distances, which are the greatest
and rank the various motifs discovered, picking eventually
nearest neighbor distances that are larger than maxLBAbs.
the patterns that contain the desired insight. Clearly, this
This latter is the largest lower bounding distance of the non-
possibility is not optimal for at least two reasons: the scal-
valid distance profiles.
ability, since finding motif of one fixed length takes O(|D|2 )
In the motif discovery task, if no nearest neighbor dis-
time, and also because it does not provide an e↵ective way
tance is smaller than minLBAbs, we recompute only the
to compare motifs of di↵erent lengths.
distance profiles that have the maxLB distance smaller than
MAD Framework. Our framework for Variable Length
the smallest true distance computed.
Motif and Discord Discovery (MAD) works by applying an
On the other hand, for discord discovery, if no true near-
incremental computing strategy, which aims to prune unnec-
est neighbor distances are found we need to iterate the non
essary distance computations for larger motif and discord
ECG ASTRO Z-normalized and non Z-normalized sequences [15, 13, 14],
30
25 while MAD is the first framework that implements variable-
20
Time (Hours)
15 length motif and discord discovery [17, 4, 16].
10
5
0
100 150 200 400 600 100 150 200 400 600 References
Subsequence length range Subsequence length range [1] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. isax 2.0:
Time out after 24h Indexing and mining one billion time series. In ICDM 2010,
2010.
Figure 2: Time over motif length ranges (default [2] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.
`min =1024, data series length= 0.5M. The lernaean hydra of data series similarity search: An experi-
mental evaluation of the state of the art. PVLDB, 12(2), 2018.
[3] C. M. Y. et al. Matrix profile I: all pairs similarity joins for
DAD MAD (Discord discovery) GrammarViz Time out after 48h time series: A unifying view that includes motifs, discords and
60
50 60 shapelets. In ICDM, 2016.
Time (Hours)
40 50
30
40 [4] M. L. et al. Matrix profile goes mad: Variable-length motif and
30 discord discovery in data series. In Under Submission 2019.
20 20
10 10 [5] T. R. et al. Searching and mining trillions of time series subse-
0 0
1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 1 3 5 7 10 quences under dynamic time warping. In SIGKDD, 2012.
(a) EMG ASTRO (b) EMG ASTRO
k (Top-k 1st discords) [6] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. Pro-
m (Top-1 mth discords)
Dataset Dataset gressive similarity search on time series data. In BigVis, in
conjunction with EDBT/ICDT, 2019.
Figure 3: (a) T op 1 mth discords discovery, and (b) [7] S. Kadiyala and N. Shiri. A compact multi-resolution index for
T op k 1st discords discovery time performance. variable length queries in time series databases. KAIS, 2008.
[8] T. Kahveci and A. Singh. Variable length queries for time series
valid (partial) distance profiles, which contain the maxLB data. In ICDEF, 2001.
distance greater than the largest mth best match distance. [9] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimen-
sionality reduction for fast similarity search in large time series
We keep extracting in this manner the motif and the dis- databases. KAIS, 3, 2000.
cord subsequences of each length, until `max . [10] E. J. Keogh, T. Palpanas, V. B. Zordan, D. Gunopulos, and
Motif Discovery Experimental Evaluation. To bench- M. Cardle. Indexing large human-motion databases. In VLDB,
2004.
mark the MAD framework, we used several di↵erent [11] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas.
real datasets. Concerning the motif discovery problem, Coconut: A scalable bottom-up approach for building data series
the competitors we considered are: QUICKMOTIF [12], indexes. In PVLDB, 2018.
[12] Y. Li, L. H. U, M. L. Yiu, and Z. Gong. Quick-motif: An efficient
STOMP [3], and MOEN [18]. We report in Figure 2 a sam- and scalable framework for exact motif discovery ICDE, 2015.
ple of the experiments we conducted (detailed experimen- [13] M. Linardi and T. Palpanas. Scalable data series subsequence
tal results on several datasets are reported elsewhere [17]). matching with ulisse. In Under Submission 2019.
[14] M. Linardi and T. Palpanas. ULISSE: ULtra compact Index
Here, we show the results of MAD, which finds motifs in for Variable-Length Similarity SEarch in Data Series. In ICDE
di↵erent real datasets. In the plots, we report the total 2018.
execution time varying motif length ranges. From this ex- [15] M. Linardi and T. Palpanas. Scalable, variable-length simi-
periment, we observe that VALMOD maintains a good and larity search in data series: The ULISSE approach. PVLDB,
11(13):2236–2248, 2018.
stable performance across datasets and parameter settings, [16] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. VALMOD:
quickly producing results, even in cases where the competi- A suite for easy and exact detection of variable length motifs in
tors do not terminate within a reasonable amount of time. data series. In SIGMOD Conference 2018.
[17] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. Matrix profile
Discord Discovery Experimental Evaluation. We X: Valmod - scalable discovery of variable-length motifs in data
identify two state-of-the-art competitors to compare to our series. In SIGMOD, 2018.
approach, the Motif And Discord (MAD) framework. The [18] A. Mueen and N. Chavoshi. Enumeration of time series motifs
of all lengths. Knowl. Inf. Syst., 2015.
first one, DAD (Disk aware discord discovery) [27], imple- [19] A. G. H. of Operational Intelligence Department Airbus. Per-
ments an algorithm suitable to enumerate the fixed-length sonal communication., 2017.
T op 1 mth discords. The second approach, Grammar- [20] T. Palpanas. Data series management: The road to big sequence
analytics. SIGMOD Rec., 2015.
Viz [23], is the most recent technique, which discovers Top-k [21] B. Peng, P. Fatourou, and T. Palpanas. Paris: The next destina-
1st discords. In Figure 3.(a) we report the results of T op 1 tion for fast data series indexing and query answering. In IEEE
mth discord discovery, varying m. We note that MAD grace- Big Data, 2018.
[22] A. Rosa, L. Parrino, and M. Terzano. Automatic detection of
fully scales over the number of discords to enumerate and cyclic alternating pattern (cap) sequences in sleep: preliminary
is up to one order of magnitude faster than DAD. In Fig- results. Clinical Neurophysiology, 1999.
ure 3.(b), we show the result of T op k 1st discords dis- [23] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedi-
covery. Once again, MAD scales better over the number hardjo, C. Chen, and S. Frankenstein. Time series anomaly dis-
covery with grammar-based compression. In EDBT, 2015.
of discovered discords, as its execution time remains almost [24] J. Shieh and E. J. Keogh. isax: indexing and mining terabyte
constant. A di↵erent trend is observed for GrammarViz, sized time series. In KDD, 2008.
whose performance significantly deteriorates as k increases. [25] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-
adaptive and dynamic segmentation index for whole matching
on time series. PVLDB, 2013.
3. CONCLUSIONS [26] D. E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas.
Dpisax: Massively distributed partitioned isax. In ICDM, 2017.
Even though much e↵ort has been dedicated for develop- [27] D. Yankov, E. J. Keogh, and U. Rebbapragada. Disk aware
ping techniques for data series analytics, existing solutions discord discovery: finding unusual time series in terabyte sized
datasets. Knowl. Inf. Syst., 2008.
for subsequence matching, motif and discord discovery are [28] K. Zoumpatianos, S. Idreos, and T. Palpanas. RINSE: interac-
limited to fixed length queries/results. In this Ph.D. work, tive data series exploration with ADS+. PVLDB, 2015.
we propose the first scalable solutions to the variable-length [29] K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptive
data series index. VLDB J., 2016.
version of these problems: ULISSE is the first index that [30] K. Zoumpatianos and T. Palpanas. Data series management:
supports variable-length subsequence matching over both Fulfilling the need for big sequence analytics. In ICDE, 2018.