=Paper= {{Paper |id=Vol-2399/paper04 |storemode=property |title=Effective and Efficient Variable-Length Data Series Analytics |pdfUrl=https://ceur-ws.org/Vol-2399/paper04.pdf |volume=Vol-2399 |authors=Michele Linardi |dblpUrl=https://dblp.org/rec/conf/vldb/Linardi19 }} ==Effective and Efficient Variable-Length Data Series Analytics== https://ceur-ws.org/Vol-2399/paper04.pdf
                                                                                                              ceur-ws.org/Vol-2399/paper04.pdf




                                  Effective and Efficient
                          Variable-Length Data Series Analytics

                                                       Michele Linardi
                                               Supervised by: Themis Palpanas
                                                       LIPADE, Université de Paris
                                               michele.linardi@parisdescartes.fr

ABSTRACT                                                                 require that this length is chosen at index construction [10,
In the last twenty years, data series similarity search has              24, 1, 25, 28, 29, 26, 21, 11]. The same observation holds
emerged as a fundamental operation at the core of several                for techniques proposed to discover motifs [12] and discords
analysis tasks and applications related to data series collec-           (i.e., anomalous subsequences) [27]: they all assume a fixed
tions. Many solutions to di↵erent mining problems work by                sequence length, which has to be predefined.
means of similarity search. In this regard, all the proposed                Evidently, this is a constraint that penalizes the flexibility
solutions require the prior knowledge of the series length on            needed by analysts, who often times need to analyze pat-
which similarity search is performed. In several cases, the              terns of slightly di↵erent lengths (within a given data series
choice of the length is critical and sensibly influences the             collection) [7, 8, 5, 17, 16]. For example, in the SENTINEL-
quality of the expected outcome. Unfortunately, the obvi-                2 mission data, oceanographers are interested in searching
ous brute-force solution, which provides an outcome for all              for similar coral bleaching patterns2 of di↵erent lengths; at
lengths within a given range is computationally untenable.               Airbus3 engineers need to perform similarity search queries
In this Ph.D. work, we present the first solutions that inher-           for patterns of variable length when studying aircraft take-
ently support scalable and variable-length similarity search             o↵s and landings [19]; and in neuroscience, analysts need to
in data series, applied to sequence/subsequences matching,               search in Electroencephalogram (EEG) recordings for Cyclic
motif and discord discovery problems. The experimental re-               Alternating Patterns (CAP) of di↵erent lengths (duration),
sults show that our approaches are up to orders of magnitude             in order to get insights about brain activity during sleep [22].
faster than the alternatives. They also demonstrate that we                 In our work, we focus on three core problems that are
can remove the unrealistic constraint of performing analyt-              based on similarity search: subsequence matching, and mo-
ics using a predefined length, leading to more intuitive and             tif and discord discovery, organized under the ULISSE and
actionable results, which would have otherwise been missed.              MAD methods:
                                                                            1. ULISSE (ULtra compact Index for variable-length Sim-
                                                                         ilarity SEarch in data series) is the first indexing technique
1.    INTRODUCTION                                                       that supports variable-length subsequence matching for non
   Data series (i.e., ordered sequences of points) are one               Z-normalized and Z-normalized data series [15, 13, 14].
of the most common data types1 , present in almost ev-                      2. MAD (Motif and Discord discovery framework) im-
ery scientific and social domain (such as meteorology, as-               plements two novel algorithms for variable-length motif and
tronomy, chemistry, medicine, neuroscience, finance, agri-               discord discovery in large data series [17, 4, 16].
culture, entomology, sociology, smart cities, marketing, op-
eration health monitoring, human action recognition and                  2. VARIABLE-LENGTH ANALYTICS
others) [20].
                                                                            In this section, we describe our proposed approaches to
   Once the data series have been collected, the domain ex-
                                                                         the aforementioned problems. In the next part we describe
perts face the arduous tasks of processing and analyzing
                                                                         the notions and the elements used in our solutions.
them [30, 6] in order to gain insights, e.g., by identifying sim-
                                                                         Preliminaries. Let a data series D = d1 ,...,d|D| be a se-
ilar patterns, and performing classification, or clustering. A
                                                                         quence of numbers di 2 R, where i 2 N represents the posi-
core operation that is part of all these analysis tasks is sim-
                                                                         tion in D. We denote the length, or size of the data series
ilarity search, which has attracted lots of attention because
                                                                         D with |D|. The subsequence Ds,` =ds ,...,ds+` 1 of length
of its importance [2]. Nevertheless, all existing scalable and
                                                                         `, is a contiguous subset of ` points of D starting at o↵set s,
index-based similarity search techniques are restricted in
                                                                         where 1  s  |D| and 1  `  |D|. A subsequence is itself
that they only support queries of a fixed length, and they
                                                                         a data series. A data series collection, C, is a set of data
1
  If the dimension that imposes the ordering of the sequence             series. We say that a data series D is Z-normalized, denoted
is time then we talk about time series. Though, a series                 Dn , when its mean µ is 0 and its standard deviation is
can also be defined over other measures (e.g., angle in radial           1. Z-normalization is an essential operation in several ap-
profiles in astronomy, mass in mass spectroscopy in physics,             plications, because it allows similarity search irrespective of
position in genome sequences in biology, etc.). We use the               shifting and scaling [5]. The Piecewise Aggregate Approx-
terms data series, time series, and sequence interchangeably.
                                                                         imation (PAA) of a data series D, P AA(D) = {p1 , ..., pw },
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los        2
Angeles, California. Copyright (C) 2019 for this paper by its authors.       http://www.esa.int/Our_Activities/Observing_the_Earth
                                                                         3
Copying permitted for private and academic purposes.                         http://www.airbus.com/
represents D in a w-dimensional space by means of w real-                                                     γ                                       140




                                                                        CPU + disk I/O (Secs )
                                                                        Avg Exact Query Time
                                                                                                                   0%       20%
                                                                                                 200                                                  120            query answering disk i\o
                                                                                                                   40%      60%
valued segments of length s, where the value of each segment




                                                                                                                                   Cumulative Query
                                                                                                 150               80%      100%                      100            query answering cpu




                                                                                                                                     Time (hours)
                                                                                                                                                       80
is the mean of the corresponding values of D [9]. We denote                                      100                                                   60
                                                                                                                                                                     Indexing (disk i\o + cpu time)
the first k dimensions of P AA(D), (k  w), as P AA(D)1,..,k .                                    50
                                                                                                                                                       40
                                                                                                                                                       20
   The iSAX representation of a data series D, denoted by                                                                                               0
                                                                                                   0
iSAX(D, w, |alphabet|), is the representation of P AA(D) by                                            160   192      224   256
                                                                        (a)                                                                 (b)
w discrete coefficients, drawn from an alphabet of cardinal-                                                 Query length                                   γ = (% of (lmax - lmin))
ity |alphabet| [24]. The main idea of the iSAX represen-
tation, is that the real-value space may be segmented by                Figure 1: Query answering time performance, vary-
|alphabet| 1 breakpoints in |alphabet| regions, which are               ing on non Z-normalized data series. (a) ULISSE
labeled by discrete symbols (e.g., with |alphabet| = 4 the              average query time (CPU + disk I/O). (b) ULISSE
available labels may be {00, 01, 10, 11}).                              average query disk I/O time. (b) Comparison of
                                                                        ULISSE to other techniques (cumulative indexing
2.1    Subsequence Matching                                             + query answering time).
   The subsequence matching problem is defined as follows:              Envelope uEN V represents the principal building block of
   Given a data series collection C = {D1 , ..., DC }, a se-            the ULISSE Index. In details, ULISSE is a tree structure,
ries length range [`min , `max ], a query data series Q, where          where each internal node stores the Envelope uEN V repre-
`min  |Q|  `max , and k 2 N, we want to find the                      senting all the sequences in the subtree rooted at that node.
                    i
set R = {Do,`         |Di 2 C ^ ` = |Q| ^ (` + o 1)  |Di |},           Leaf nodes contain several Envelopes, which by construction
                                                            0
where |R| = k. We require that 8Do,`          i
                                                  2 R @Doi 0 ,`0 s.t.   have the same iSAX(L). On the contrary, their iSAX(U )
         0                                                              varies, since it get updated with every new insertion in the
dist(Doi 0 ,`0 , Q) < dist(Do,`
                             i
                                , Q), where `0 = |Q|, (`0 +o0 1) 
                                                                        node. Each Envelope in leaf nodes point the the represented
    i0           i0
|D | and D 2 C. We informally call R, the k nearest neigh-              sequences in the original data series collection.
bors set of Q. Given two generic series of the same length,             Approximate Subsequence Matching. Subsequence
namely D and D0 the function dist(D, D0 ) can be Euclidean              matching performed on ULISSE index relies on the
Distance or Dynamic Time Warping.                                       mindistU LiSSE () lower bounding function to prune the
Variable Length Subsequences. In a data series, when                    search space. This allows to navigate the tree in order, vis-
we consider contiguous and overlapping subsequences of dif-             iting first the most promising nodes. As soon as a leaf node
ferent lengths within the range [`min , `max ],we expect the            is discovered, we can load the raw data series pointed by
outcome as a bunch of similar series, whose di↵erences are              the Envelopes in the leaf. Each time we compute the true
a↵ected by the misalignment and the di↵erent number of                  Euclidean or DTW distance between the series in a leaf, the
points. Given a data series D, and a subsequence length                 best-so-far distance (bsf) is updated, along with a vector
range [`min , `max ], we define the master series as the sub-           containing the k best matches, where k refers to the k near-
sequences of the form Di,min(|D| i+1,`max ) , for each i such           est neighbors. Since priority is given to the most promising
that 1  i  |D| (`min 1), where 1  `min  `max  |D|.                 nodes, we can terminate our visit, when at the end of a leaf
   We observe that for any master series of the form Di,`0 , we         visit the k bsf’s have not improved.
have that P AA(Di,`0 )1,..,k = P AA(Di,`00 )1,..,k holds for each       Exact Subsequence Matching. Note that the approxi-
`00 such that `00 `min , `00  `0  `max and `0 , `00 %k = 0.           mate search described above may not visit leaves that con-
   Therefore, by computing only the P AA of the master se-              tain answers better than the approximate answers already
ries in D, we are able to represent the P AA prefix of any sub-         identified, and therefore, it will fail to produce exact, correct
sequence of D. When we zero-align the P AA summaries of                 results.
the master series, we compute the minimum and maximum                      The exact nearest neighbor search algorithm we propose
P AA values (over all the subsequences) for each segment:               finds the k sequences with the absolute smallest distances
this forms what we call an Envelope. (When the length of a              to the query. In this case, the search algorithm may visit
master series is not a multiple of the P AA segment length,             several leaves: the process stops after it has either visited,
we compute the P AA coefficients of the longest prefix that is          or pruned (when the lower bounding distance to the node is
multiple of a segment.) We call containment area the space              greater than the bsf) all the nodes of the index, guaranteeing
in between the segments that define the Envelope.                       the correctness of the results.
PAA Envelope. We formalize the concept of the Envelope,                 Experiments. To evaluate ULISSE, we used synthetic and
introducing a new series representation. We denote by L and             real data (but in the interest of space we only report results
U the P AA coefficients, which delimit the lower and upper              with the synthetic data). We record the average CPU time,
parts, respectively, of a containment area. Furthermore, we             disk I/O (time to fetch data from disk (Total time - CPU
introduce a parameter , which permits to select the number              time)), for 100 queries, extracted from the datasets with the
of master series we represent by the Envelope. We refer to it           addition of Gaussian noise. We compare ULISSE with UCR
using the following signature: paaEN V[D,`min ,`max ,a, ,s] =           suite [5] the non index-based state-of-the-art technique for
[L, U ]. It delimits the containment area generated by the              answering similarity search queries. Concerning the com-
P AA coefficients of the master series.                                 petitor indexing techniques, the state-of-the-art is the Com-
Indexing the Envelopes.                 Given a paaEN V , we            pact Multi Resolution Index [7] CMRI.
can translate its P AA extremes into the correspond-                       In Figure 1, we present results for subsequence matching
ing iSAX representation: uEN VpaaEN V[D,`min ,`max ,a, ,s] =            queries on ULISSE when we vary , ranging from 0 to its
[iSAX(L), iSAX(U )], where iSAX(L) (iSAX(U )) is the                    maximum value in this dataset, i.e., `max `min . In Fig-
vector of the minimum (maximum) P AA coefficients of all                ure 1, we report the results concerning non Z-normalized se-
the segments corresponding to the subsequences of D. The                ries. We observe that grouping contiguous and overlapping
subsequences under the same summarization (Envelope) by                lengths. Hence, given a data series D, we compute the Ma-
increasing , a↵ects positively the performance of index con-           trix profile using the smallest subsequence length, namely
struction, as well as query answering.                                 `min , within a specified input range [`min , `max ]. The key
   The latter may seem counterintuitive, since inserting more          idea of our approach is to minimize the work that needs
master series into a single Envelope is likely to generate large       to be done for succeeding subsequence lengths (`min + 1,
containment areas, which are not tight representations of the          `min + 2, . . ., `max ).
data series. On the other hand, it leads to an overall number          Matrix Profile Computation. We start the computation
of Envelope that is several orders of magnitude smaller than           of the Matrix profile, considering all the contiguous subse-
the one for     = 0%, where only a single master series is             quences of length `min , computing for each one the Distance
represented by each Envelope.                                          profile in O(|D|) time. This latter is a vector that contains
                                                                       the z-normalized Euclidean distances between a fixed sub-
2.2    Motif and Discord Discovery                                     sequence and all the other in D (excluding trivial matches).
  Motif and Discord are data mining primitives that rep-               Lower Bound Subsequences of Di↵erent Length. We
resent frequent and rare (anomalous) patterns, respectively.           moreover introduce a new lower bounding distance [17],
Given a data series D, they are defined as follows:                    which lower bounds (is always smaller than) the true Eu-
                                                                       clidean distances between subsequences longer than `min .
   • Data series motif: Da,` and Db,` is a motif pair i↵
                                                                       We initially compute this lower bound using the true Eu-
     dist(Da,` , Db,` )  dist(Di,` , Dj,` ) 8i, j 2 [1, 2, ..., |D|
                                                                       clidean distances computation of subsequences with length
     ` + 1], where a 6= b and i 6= j, and dist is a function
                                                                       `min . For the larger lengths, we update the lower bound,
     that computes the z-normalized Euclidean distance be-
                                                                       considering only the variation generated by the trailing
     tween the input subsequences.
                                                                       points in the longer subsequences. This measure enjoys an
   • Data series discord: We call the k subsequences of                important property: if we rank the subsequences according
     D, with the k largest distances to their mth Nearest              to this measure, the same rank will be preserved along all the
     Neighbor (according the Euclidean distance), the Top-             lower bound updates for the subsequences of greater length.
     k mth discords.                                                   We exploit this property, in order to prune computations.
                                                                       Pruning the Search Space. Once we compute motif and
Variable length motif and discord discovery. We pro-                   discords, with length greater than `min , instead of comput-
vide solutions to the following problems:                              ing from scratch each distance profile, we update the true
   • Variable-Length Motif Discovery: Given a data series              distances (in constant time) of the subsequences that have
     D and a subsequence length-range [`min , ..., `max ], we          the p smallest lower bounding distances (computed in the
     want to find the data series motif pairs of all lengths           previous step). These distances form what we call partial
     in [`min , ..., `max ], occurring in D.                           distance profile. In each partial distance profile, we also up-
                                                                       date the lower bound. After this operation, we may have two
   • Variable-Length Top-k mth Discord Discovery: Given                cases: if in a new computed distance profile the minimum
     a data series D, a subsequence length-range                       true distance (minDist) is shorter than the maximum lower
     [`min , ..., `max ] and the parameters a, b 2 N+ we want          bound (maxLB ), we know that no distances among those
     to enumerate the Top-k mth discords for each k 2                  not computed can be smaller than minDist. In this case, a
     {1, .., a} and each m 2 {1, .., b}, and for all lengths           partial distance profile becomes a valid distance profile. On
     in [`min , ..., `max ], occurring in D.                           the other hand, when maxLB is smaller than minDist, this
                                                                       latter is not guaranteed to be the nearest neighbor distance.
Fixed length motif and discord discovery. The state-
                                                                       For discord discovery, we need to test this condition for the
of-the art algorithm for fixed length motif and discord dis-
                                                                       m smallest true distances in the partial distance profile. In
covery [3] requires the user to define the length of the de-
                                                                       this case a valid (partial) distance profile must contain the
sired motif or discord. This mining operation is supported
                                                                       true mth best match distances, which are smaller than, or
by computation of the Matrix profile, which is a meta data
                                                                       equal to maxLB.
series storing the z-normalized Euclidean distance between
                                                                       Exact Motif and Discord Discovery. Once the partial
each subsequence and its nearest neighbor. The Matrix pro-
                                                                       distance profiles are computed, we pick the absolute smallest
file does not only derive the motif, but also ranks and filters
                                                                       lower bounding value from all the non-valid distance profiles,
out the other pairs, giving also a convenient and graphical
                                                                       namely minLBAbs (if any). Therefore, the global minimum
representation of their occurrences and proximity. Unfortu-
                                                                       (true) distance of all the valid (partial) distance profiles,
nately, this technique comes with an important shortcoming:
                                                                       which is smaller than minLBAbs is guaranteed to be the
it does not provide an e↵ective solution for trying several dif-
                                                                       distance between the motif pair subsequences. Symmetri-
ferent motif lengths. Therefore, the analyst is forced to run
                                                                       cally, we consider the valid (partial) distance profiles to find
the algorithm using all possible lengths in a range of interest,
                                                                       the true mth best match distances, which are the greatest
and rank the various motifs discovered, picking eventually
                                                                       nearest neighbor distances that are larger than maxLBAbs.
the patterns that contain the desired insight. Clearly, this
                                                                       This latter is the largest lower bounding distance of the non-
possibility is not optimal for at least two reasons: the scal-
                                                                       valid distance profiles.
ability, since finding motif of one fixed length takes O(|D|2 )
                                                                          In the motif discovery task, if no nearest neighbor dis-
time, and also because it does not provide an e↵ective way
                                                                       tance is smaller than minLBAbs, we recompute only the
to compare motifs of di↵erent lengths.
                                                                       distance profiles that have the maxLB distance smaller than
MAD Framework. Our framework for Variable Length
                                                                       the smallest true distance computed.
Motif and Discord Discovery (MAD) works by applying an
                                                                          On the other hand, for discord discovery, if no true near-
incremental computing strategy, which aims to prune unnec-
                                                                       est neighbor distances are found we need to iterate the non
essary distance computations for larger motif and discord
               ECG                                                              ASTRO                                                      Z-normalized and non Z-normalized sequences [15, 13, 14],
               30
               25                                                                                                                          while MAD is the first framework that implements variable-
               20
Time (Hours)




               15                                                                                                                          length motif and discord discovery [17, 4, 16].
               10
                5
                0
                                   100     150      200        400      600          100      150       200        400        600          References
                                         Subsequence length range                            Subsequence length range                       [1] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. isax 2.0:
                                                                                                            Time out after 24h                  Indexing and mining one billion time series. In ICDM 2010,
                                                                                                                                                2010.
    Figure 2: Time over motif length ranges (default                                                                                        [2] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.
    `min =1024, data series length= 0.5M.                                                                                                       The lernaean hydra of data series similarity search: An experi-
                                                                                                                                                mental evaluation of the state of the art. PVLDB, 12(2), 2018.
                                                                                                                                            [3] C. M. Y. et al. Matrix profile I: all pairs similarity joins for
                                             DAD        MAD (Discord discovery)            GrammarViz          Time out after 48h               time series: A unifying view that includes motifs, discords and
                              60
                              50                                                     60                                                         shapelets. In ICDM, 2016.
               Time (Hours)




                              40                                                     50
                              30
                                                                                     40                                                     [4] M. L. et al. Matrix profile goes mad: Variable-length motif and
                                                                                     30                                                         discord discovery in data series. In Under Submission 2019.
                              20                                                     20
                              10                                                     10                                                     [5] T. R. et al. Searching and mining trillions of time series subse-
                               0                                                      0
                                     1 3 5 7 10 1 3 5 7 10   1 3 5 7 10 1 3 5 7 10         1 3 5 7 10 1 3 5 7 10   1 3 5 7 10 1 3 5 7 10        quences under dynamic time warping. In SIGKDD, 2012.
                              (a)           EMG                    ASTRO             (b)          EMG                   ASTRO
                                                                                                    k (Top-k 1st discords)                  [6] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. Pro-
                                              m (Top-1 mth discords)
                                                     Dataset                                               Dataset                              gressive similarity search on time series data. In BigVis, in
                                                                                                                                                conjunction with EDBT/ICDT, 2019.
    Figure 3: (a) T op 1 mth discords discovery, and (b)                                                                                    [7] S. Kadiyala and N. Shiri. A compact multi-resolution index for
    T op k 1st discords discovery time performance.                                                                                             variable length queries in time series databases. KAIS, 2008.
                                                                                                                                            [8] T. Kahveci and A. Singh. Variable length queries for time series
    valid (partial) distance profiles, which contain the maxLB                                                                                  data. In ICDEF, 2001.
    distance greater than the largest mth best match distance.                                                                              [9] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimen-
                                                                                                                                                sionality reduction for fast similarity search in large time series
       We keep extracting in this manner the motif and the dis-                                                                                 databases. KAIS, 3, 2000.
    cord subsequences of each length, until `max .                                                                                         [10] E. J. Keogh, T. Palpanas, V. B. Zordan, D. Gunopulos, and
    Motif Discovery Experimental Evaluation. To bench-                                                                                          M. Cardle. Indexing large human-motion databases. In VLDB,
                                                                                                                                                2004.
    mark the MAD framework, we used several di↵erent                                                                                       [11] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas.
    real datasets. Concerning the motif discovery problem,                                                                                      Coconut: A scalable bottom-up approach for building data series
    the competitors we considered are: QUICKMOTIF [12],                                                                                         indexes. In PVLDB, 2018.
                                                                                                                                           [12] Y. Li, L. H. U, M. L. Yiu, and Z. Gong. Quick-motif: An efficient
    STOMP [3], and MOEN [18]. We report in Figure 2 a sam-                                                                                      and scalable framework for exact motif discovery ICDE, 2015.
    ple of the experiments we conducted (detailed experimen-                                                                               [13] M. Linardi and T. Palpanas. Scalable data series subsequence
    tal results on several datasets are reported elsewhere [17]).                                                                               matching with ulisse. In Under Submission 2019.
                                                                                                                                           [14] M. Linardi and T. Palpanas. ULISSE: ULtra compact Index
    Here, we show the results of MAD, which finds motifs in                                                                                     for Variable-Length Similarity SEarch in Data Series. In ICDE
    di↵erent real datasets. In the plots, we report the total                                                                                   2018.
    execution time varying motif length ranges. From this ex-                                                                              [15] M. Linardi and T. Palpanas. Scalable, variable-length simi-
    periment, we observe that VALMOD maintains a good and                                                                                       larity search in data series: The ULISSE approach. PVLDB,
                                                                                                                                                11(13):2236–2248, 2018.
    stable performance across datasets and parameter settings,                                                                             [16] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. VALMOD:
    quickly producing results, even in cases where the competi-                                                                                 A suite for easy and exact detection of variable length motifs in
    tors do not terminate within a reasonable amount of time.                                                                                   data series. In SIGMOD Conference 2018.
                                                                                                                                           [17] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. Matrix profile
    Discord Discovery Experimental Evaluation. We                                                                                               X: Valmod - scalable discovery of variable-length motifs in data
    identify two state-of-the-art competitors to compare to our                                                                                 series. In SIGMOD, 2018.
    approach, the Motif And Discord (MAD) framework. The                                                                                   [18] A. Mueen and N. Chavoshi. Enumeration of time series motifs
                                                                                                                                                of all lengths. Knowl. Inf. Syst., 2015.
    first one, DAD (Disk aware discord discovery) [27], imple-                                                                             [19] A. G. H. of Operational Intelligence Department Airbus. Per-
    ments an algorithm suitable to enumerate the fixed-length                                                                                   sonal communication., 2017.
    T op 1 mth discords. The second approach, Grammar-                                                                                     [20] T. Palpanas. Data series management: The road to big sequence
                                                                                                                                                analytics. SIGMOD Rec., 2015.
    Viz [23], is the most recent technique, which discovers Top-k                                                                          [21] B. Peng, P. Fatourou, and T. Palpanas. Paris: The next destina-
    1st discords. In Figure 3.(a) we report the results of T op 1                                                                               tion for fast data series indexing and query answering. In IEEE
    mth discord discovery, varying m. We note that MAD grace-                                                                                   Big Data, 2018.
                                                                                                                                           [22] A. Rosa, L. Parrino, and M. Terzano. Automatic detection of
    fully scales over the number of discords to enumerate and                                                                                   cyclic alternating pattern (cap) sequences in sleep: preliminary
    is up to one order of magnitude faster than DAD. In Fig-                                                                                    results. Clinical Neurophysiology, 1999.
    ure 3.(b), we show the result of T op k 1st discords dis-                                                                              [23] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedi-
    covery. Once again, MAD scales better over the number                                                                                       hardjo, C. Chen, and S. Frankenstein. Time series anomaly dis-
                                                                                                                                                covery with grammar-based compression. In EDBT, 2015.
    of discovered discords, as its execution time remains almost                                                                           [24] J. Shieh and E. J. Keogh. isax: indexing and mining terabyte
    constant. A di↵erent trend is observed for GrammarViz,                                                                                      sized time series. In KDD, 2008.
    whose performance significantly deteriorates as k increases.                                                                           [25] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-
                                                                                                                                                adaptive and dynamic segmentation index for whole matching
                                                                                                                                                on time series. PVLDB, 2013.
    3.                         CONCLUSIONS                                                                                                 [26] D. E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas.
                                                                                                                                                Dpisax: Massively distributed partitioned isax. In ICDM, 2017.
       Even though much e↵ort has been dedicated for develop-                                                                              [27] D. Yankov, E. J. Keogh, and U. Rebbapragada. Disk aware
    ping techniques for data series analytics, existing solutions                                                                               discord discovery: finding unusual time series in terabyte sized
                                                                                                                                                datasets. Knowl. Inf. Syst., 2008.
    for subsequence matching, motif and discord discovery are                                                                              [28] K. Zoumpatianos, S. Idreos, and T. Palpanas. RINSE: interac-
    limited to fixed length queries/results. In this Ph.D. work,                                                                                tive data series exploration with ADS+. PVLDB, 2015.
    we propose the first scalable solutions to the variable-length                                                                         [29] K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptive
                                                                                                                                                data series index. VLDB J., 2016.
    version of these problems: ULISSE is the first index that                                                                              [30] K. Zoumpatianos and T. Palpanas. Data series management:
    supports variable-length subsequence matching over both                                                                                     Fulfilling the need for big sequence analytics. In ICDE, 2018.