ceur-ws.org/Vol-2399/paper11.pdf


                  Truly Scalable Data Series Similarity Search

                                                            Karima Echihabi
                             Supervised by: Prof. Themis Palpanas and Prof. Houda Benbrahim
                                                   IRDA, Rabat IT Center,
                                               ENSIAS, Mohammed V Univ.
                                                   karima.echihabi@gmail.com


ABSTRACT                                                                    Although a data series can be represented as a vector
Data series are widely used in numerous domains. Analyzing               in high dimensional space, conventional vector-based ap-
this data is important for a variety of real-world applications          proaches are not adapted for two reasons: (a) they cannot
and has been extensively studied over the past 25 years. At              scale to thousands of dimensions; and (b) they do not exploit
the core of the analysis task lies a classic algorithm called            the correlation among dimensions typical for data series.
similarity search. A number of approaches have been pro-                    Similarity search methods can either return exact or ap-
posed to support similarity search over massive data series              proximate answers. Exact methods are costly while approx-
collections. The results of two comprehensive data series                imate methods o↵er better efficiency at the expense of losing
experimental evaluations form the foundations of our fu-                 some accuracy. We call approximate the methods that do
ture work, which will lead to the development of a novel                 not provide any guarantees on the results ng-approximate,
index that can efficiently support both exact and approx-                and those that provide guarantees on the approximation er-
imate data series similarity search, as well as progressive              ror, -✏-approximate methods, where ✏ is the approximation
query answering with bound guarantees. Based on the in-                  error and , the probability that ✏ will not be exceeded.
sights gained from the exhaustive study of the related work,                A plethora of similarity search methods have been pub-
the new index will surpass the state-the-art approximate and             lished by the community including techniques designed for
exact techniques both in performance and accuracy.                       generic vectors [24, 23, 7, 14, 46, 20, 25, 44, 5, 36] and those
                                                                         specific to data series [3, 41, 27, 42, 47, 34, 33, 40, 15, 9, 43,
                                                                         45, 10, 49, 30].
1.    INTRODUCTION                                                          This work aims to propose a novel index that will support
   A data series is a sequence of ordered real values1 . Data            progressive query answering with probabilistic guarantees.
series are omnipresent in various domains from science and               We describe the related work, succinctly report the results of
engineering to business and medicine [35]. The prolifera-                our extensive experimental evaluation of exact methods [17],
tion of IoT technologies is also heavily contributing to the             and give a glimpse of some very interesting results from an
explosive growth of data series collections to the order of              ongoing experimental study focused on approximate meth-
terabytes [38]. A common subroutine to data series ana-                  ods [18]. Finally, we present our future work directions.
lytical tasks is a classic algorithm called similarity search.              In our work, we focus on the problem of whole matching
Therefore the research community has extensively studied                 similarity search in collections with a very large number of
the development of efficient similarity search algorithms for            data series, i.e., similarity search that produces approximate
data series. The similarity search algorithm for data series             or exact results, by calculating distances on the whole (not
returns the set of candidate data series in a collection that            a sub-) sequence. This is a very popular problem that lies
is similar to a given query series. This algorithm is often              at the core of several other algorithms, and is important for
reduced to the nearest neighbor problem where data series                many applications in various domains in the real world [17].
are represented as data points in multidimensional space and
their (dis)similarity is evaluated using a distance function.            2. RELATED WORK
1                                                                           Similarity search involves finding a set of data series in
  The order attribute can be angle, mass, frequency, position,
or time [39]. When the order is based on time, the sequence              a collection that are similar to a query according to some
is called a time series. The terms data series, time series              definition of sameness. A common abstraction is to consider
and sequence are used interchangeably.                                   the query and candidate data series as points in a metric
                                                                         space and evaluating the sameness (or di↵erence) using the
                                                                         euclidean distance.
                                                                            To develop efficient similarity search algorithms on mas-
                                                                         sive datasets, two major costs need to be optimized: the cost
                                                                         of accessing data on disk (I/O) and the cost of comparing
                                                                         the query to candidates (CPU cost). Typically, the first cost
                                                                         is reduced by using summarization techniques that map the
                                                                         high-dimensional data to a lower-dimensional space, while
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los
Angeles, California. Copyright (C) 2019 for this paper by its authors.   the second cost is optimized with sophisticated data struc-
Copying permitted for private and academic purposes.                     tures and search algorithms. Several similarity search meth-
ods have been proposed in the literature supporting either         iSAX2+ [10] is a bulk-loading index based on SAX. Dur-
exact search [23, 7, 46, 20, 27, 42], approximate search [24,   ing search, a query is represented with SAX symbols and
25, 44, 5, 36], or both [14, 43, 45, 10, 49, 47, 30, 34].       the candidates contained in non-pruned leaves are further
   In the following section, we provide a succinct descrip-     refined using the euclidean distance. Pruning uses a lower
tion of the state-of-the-art similarity search methods and      bounding distance.
the summarizations techniques on which they are based.             ADS+ [49] is the first adaptive data series index based on
1. Summarization Techniques. The Discrete Fourier               SAX. It starts with a minimal tree structure containing only
Transform (DFT) [19] decomposes a data series into fre-         summarizations, and then adds in the raw data to leaves
quency coefficients, a subset of which represents a summa-      during query answering. It supports a number of search
rization of the data series.                                    options, in particular SIMS, a skip-sequential algorithm.
   The Discrete Haar Wavelet Transform (DHWT) [12]                 In addition to exact search, the MTree, SFA trie, DSTree,
transforms a data series using Haar wavelets into a hier-       iSAX2+ and ADS+ also support ng-approximate search.
archical representation.                                        3. Approximate Similarity Search Methods. We now
   Random projections map the raw high dimensional data         present the three most popular approximate methods.
into a lower dimensional space using a random matrix while         SRS [44] belongs to the family of LSH methods, inspired
preserving pairwise distances within a distortion thresh-       by the randomized algorithm in [24]. It uses random pro-
old [26].                                                       jections to build a scalable index and supports approximate
   The Piecewise Aggregate Approximation (PAA) [28] and         search with theoretical guarantees.
Adaptive Piecewise Constant Approximation (APCA) [11]              IMI [21, 5] is an inverted index based on OPQ. A query is
techniques divide a data series into segments (of equal and     answered by returning all points corresponding to the cor-
arbitrary length, respectively) and approximate each seg-       responding entries in the inverted index. Returned results
ment with the mean of the points that belong to it. The         are ng-approximate.
Extended Adaptive Piecewise Approximation (EAPCA) [45]             HNSW [36] is an in-memory neighborhood graph exploit-
method enhance APCA by approximating each segment               ing the Voronoi Diagram and the Delaunay Triangulation.
with the standard deviation in addition to the mean.            A greedy search returns the best candidates with high em-
   Symbolic Aggregate Approximation (SAX) [32] first ap-        pirical accuracy but no formal theoretical guarantees.
proximates data series using PAA, then discretizes the PAA
values into a compact binary representation.
   Symbolic Fourier Approximation (SFA) [43] first trans-
forms a data series into DFT coefficients, which are then       3.   PROPOSED WORK
approximated using a succinct symbolic approximation.
                                                                   We propose a novel data series index that can answer
   Optimized Product Quantization (OPQ) [21] applies a lin-
                                                                progressive similarity search queries with strong probability
ear transformation on the data series to decorrelate it, then
                                                                guarantees. In addition to this unique functionality, it also
applies on it a product quantizer [25].
                                                                leverages the strengths and addresses the weaknesses of the
2. Exact Similarity Search Methods. Below, we briefly
                                                                state-of-the-art approximate and exact techniques.
describe algorithms that produce exact results.
                                                                Completed Work. We outline the main contributions
   The R*-tree [7] is a height-balanced spatial index that
                                                                in [17]: (i) we provide a formal problem definition for
organizes data into a hierarchy of nested overlapping rect-
                                                                data series similarity search, unifying conflicting terminol-
angles. Search returns all entries in leaves whose rectangle
                                                                ogy from di↵erent research communities; (ii) we present a
contains the query.
                                                                survey of the state-of-the-art data series similarity search
   The M-tree [14] is a multidimensional index for metric
                                                                techniques (Table 1 summarizes each technique according
space which partitions the data using hyper-spheres based
                                                                to our definitions); and (iii) we conduct an extensive ex-
on their relative distances. The search algorithm uses the
                                                                perimental evaluation for the efficiency of data series exact
triangular inequality to prune data.
                                                                similarity search.
   The VA+file [20] is an improvement of the VA-file [46].
                                                                   The study assessed ten state-of-the-art methods under the
It creates a filter file containing summarizations of the raw
                                                                same experimental framework. To guard against implemen-
data. Search uses the filter file to prune candidates based
                                                                tation bias, we used a large number of comparison criteria
on a lower bounding distance. For efficiency reasons, we
                                                                including implementation-independent ones, and we reim-
modified the VA+file to use the DFT transform instead of
                                                                plemented from scratch four methods that were not avail-
the Karhunen–Loève transform (KLT).
                                                                able in C/C++. Our implementations largely outperform
   Stepwise [27] represents the data space in a multi-level
                                                                the original ones both in time and space, thus enriching the
hierarchical representation using DHWT summarizations.
                                                                landscape of data series similarity search methods.
Search transforms a query into DHWT and filters out can-
                                                                   In an e↵ort to make our results reproducible and support
didate based on upper and lower bounding distances.
                                                                future research in the area, we share with the community a
   The SFA method [43] builds a trie with SFA summariza-
                                                                public archive containing all source codes, datasets, queries,
tions of the data. During query answering, a lower bounding
                                                                results, and plots [1].
distance is used to prune out candidates.
                                                                   Based on the results of our study, we draw up recom-
   The UCR Suite [42] is an optimized sequential scan algo-
                                                                mendations to help users decide the best approach for their
rithm for exact matching that we consider as a baseline for
                                                                problem. Figure 1 shows a decision matrix given a typical
performance comparisons.
                                                                hardware and query workload. The VA+file is particularly
   The DSTree [45] index dynamically segments data using
                                                                well-suited for long series in-memory while for shorter series,
EAPCA. During search, it uses a lower bounding distance
                                                                the DSTree is the best contender on disk, and iSAX2+ is the
to prune the search space.
                                                                winner in-memory.
                                                Table 1: Similarity search methods [17]

                                        Matching Accuracy            Matching Type                                   Representation                               Implementation
                               exact   ng-appr. ✏-appr.   -✏-appr.   Whole Subseq.                                   Raw Reduced                                  Original  New
                     ADS+       [49]     [49]                         X                                                      iSAX                                    C
                     DSTree     [45]     [45]                         X                                                    EAPCA                                    Java      C
         Indexes


                    iSAX2+      [10]     [10]                         X                                                      iSAX                                   C#        C
                     M-tree     [14]              [13]      [13]      X                                               X                                            C++
                     R*-tree     [7]                                  X                                                      PAA                                   C++
                    SFA trie    [43]     [43]                         X        X                                              SFA                                   Java      C
                    VA+file     [20]                                  X                                                      DFT                                  MATLAB      C
                   UCR Suite    [42]                                           X                                      X                                              C
         Other


                     MASS       [48]                                           X                                             DFT                                     C
                    Stepwise    [27]                                  X                                                     DHWT                                     C


   We present an elaborate discussion based on the deep in-                                           In−Memory Long Series                                         Disk−Resident Long Series
sights gained about the data series similarity search prob-                                                                                                     VA+file


                                                                      SERIES LENGTH
lem. The main points are the following:                                                                       VA+file                                          decision depends on dataset size and length


   1. Unexpected results confirmed the importance of careful                                                                                                                 DSTree
parameter tuning, hardware setup, implementation frame-                                              decision depends on dataset size

work, and workload selection. In particular, Stepwise and                                         iSAX2+                 DSTree                                              DSTree
ADS+ performed below our expectations while our op-                                                   In−Memory Short Series                                        Disk−Resident Short Series
timized implementations of the DSTree and the VA+file                                     DATASET SIZE
helped bring them back to the spotlight. Moreover, our
carefully crafted experiments identified optimal parameters                                           Figure 1: Recommendations [17]
that were di↵erent than the ones published in the original                                      (Indexing and answering 10K exact synthetic
papers. Another important finding was that, unlike what                                               queries on a hard-drive machine)
was originally believed [29], the tightness of the lower bound
of a given method does not alone predict its performance.
In fact, it is of paramount importance to consider other fac-
                                                                          Idx + 100 Qrs (min)


                                                                                                                                         Idx + 100 Qrs (min)
tors such as the hardware platform and the the clustering                                   300
quality of a an index.                                                                      100                                                            30
   2. A better understanding of current approaches helped
pinpoint interesting avenues for improvement, in particular                                     30
                                                                                                                                                           10
identifying the methods that would most benefit from mod-                                       10
ern hardware. For instance: (i) the DSTree is a very good                                        3                                                              3
candidate for parallelization as its index building is over 85%
                                                                                                     03

                                                                                                               10

                                                                                                                         30

                                                                                                                                    00


                                                                                                                                                                    03

                                                                                                                                                                                10

                                                                                                                                                                                          30

                                                                                                                                                                                                      00
CPU cost; (ii) the performance of Stepwise can be signifi-
                                                                                                  0.

                                                                                                             0.

                                                                                                                       0.

                                                                                                                                  1.


                                                                                                                                                                  0.

                                                                                                                                                                             0.

                                                                                                                                                                                        0.

                                                                                                                                                                                                   1.
cantly improved with a redesign of the physical storage and                                                       MAP                                                              MAP
the use of modern hardware as the total cost of query an-                                         (a) 100-NN (ng)                                                   (b) 100-NN ( ✏)
swering is 50%-98% CPU; (iii) ADS+ can be enhanced with
the use of asynchronous I/O to overcome the expensive ran-                     Figure 2: Efficiency vs. accuracy of approximate
dom I/O incurred with each skip.                                                                   search [18]
   3. Although index building with iSAX-based indexes is                       (Dataset = Sift25GB, Data series length = 128)
much faster than with the DSTree, the latter achieves a
better clustering as it adapts to the data distribution.
   4. Choosing between an index scan and a serial scan is             available, demonstrate that the extended techniques com-
not a trivial decision. In fact, access path selection is an          pete in memory and outperform on disk the state-of-the-art
optimization problem that depends on a variety of factors             approximate techniques from the vector indexing commu-
including hardware, query pruning ratio, data characteris-            nity both in accuracy and efficiency. Figure 2 summarizes
tics, the accuracy of a summarization and the efficacy of the         results for 100-NN queries on an in-memory 25GB data col-
clustering provided by an index.                                      lection extracted from the Sift dataset [2]. We observe that
Work In Progress. Our current work involves an exper-                 our modifications to iSAX2+ enable it to outperform the
imental study that evaluates data series approximate sim-             popular ng-approximate search approaches HNSW and IMI
ilarity search, both in-memory and on-disk [18]. It di↵ers            (Faiss implementation) and the state-of-the-art LSH method
from other experimental studies which focused on the effi-            SRS, in terms of combined indexing and query answering
ciency of exact search [17], the accuracy of dimensionality           costs. We measure accuracy using MAP, a popular metric
reduction techniques and similarity measures for classifica-          in the information retrieval literature [37, 8].
tion tasks [29, 16, 6], or in-memory data [31, 4].                    Future Work. Inspired by the insights gained from our
   Our results show that some strikingly simple modifica-             two experimental studies on the the inner workings of the
tions to existing exact methods enable them to answer -               di↵erent indexing approaches and the e↵ectiveness of their
✏-approximate queries and have excellent empirical perfor-            design choices, the key future direction for our work is the
mance. In fact, extensive experiments on large synthetic and          design and development of a novel data series index that
real datasets, including the two largest real datasets publicly       will outperform the state-of-the-art approximate and exact
techniques. The new index will also support progressive                  [22] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. Pro-
query answering [22] with probability guarantees, so as to                    gressive similarity search on time series data. In BigVis, in
                                                                              conjunction with EDBT/ICDT, 2019.
further enable interactive exploration tasks on very large               [23] A. Guttman. R-Trees: A Dynamic Index Structure for Spatial
data series collections.                                                      Searching. In SIGMOD, pages 47–57, 1984.
                                                                         [24] P. Indyk and R. Motwani. Approximate nearest neighbors: To-
                                                                              wards removing the curse of dimensionality. In STOC, pages
4.    CONCLUSIONS                                                             604–613, 1998.
                                                                         [25] H. Jegou, M. Douze, and C. Schmid. Product quantization for
   The goal of this thesis is to develop a new index that                     nearest neighbor search. IEEE Transactions on Pattern Anal-
will support approximate and exact search, and progressive                    ysis and Machine Intelligence, 33(1):117–128, Jan 2011.
query answering with probability guarantees. The first step              [26] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz map-
                                                                              pings into a Hilbert space. In CONM, volume 26 of Contempo-
in this direction was to thoroughly assess the state-of-the-                  rary Mathematics, pages 189–206. 1984.
art [17, 18]. We describe the lessons learned from two exten-            [27] S. Kashyap and P. Karras. Scalable knn search on vertically
sive experimental evaluations and outline our future work.                    stored time series. In C. Apt, J. Ghosh, and P. Smyth, editors,
                                                                              KDD, pages 1334–1342. ACM, 2011.
                                                                         [28] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimen-
References                                                                    sionality reduction for fast similarity search in large time series
                                                                              databases. KAIS, 3(3):263–286, 2001.
 [1] Lernaean Hydra Archive.         http://www.mi.parisdescartes.fr/    [29] E. Keogh and S. Kasetty. On the need for time series data mining
     ~themisp/dsseval/, 2018.                                                 benchmarks: A survey and empirical demonstration. Data Min.
 [2] TEXMEX Datasets for Approximate Nearest Neighbor Search.                 Knowl. Discov., 7(4):349–371, Oct. 2003.
     http://corpus-texmex.irisa.fr/, 2018.                               [30] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas.
 [3] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity             Coconut: A scalable bottom-up approach for building data series
     search in sequence databases. pages 69–84, 1993.                         indexes. PVLDB (11)6, pages 677–690, 2018.
 [4] M. Aumüller, E. Bernhardsson, and A. Faithfull.            Ann-    [31] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin.
     benchmarks: A benchmarking tool for approximate nearest                  Approximate Nearest Neighbor Search on High Dimensional
     neighbor algorithms. In SISAP, pages 34–49, 2017.                        Data - Experiments, Analyses, and Improvement (v1.0). CoRR,
 [5] A. Babenko and V. Lempitsky. The inverted multi-index. IEEE              abs/1610.02455, 2016.
     Transactions on Pattern Analysis and Machine Intelligence,          [32] J. Lin, E. J. Keogh, S. Lonardi, and B. Y. Chiu. A symbolic
     37(6):1247–1260, June 2015.                                              representation of time series, with implications for streaming al-
 [6] A. J. Bagnall, J. Lines, A. Bostrom, J. Large, and E. J. Keogh.          gorithms. In SIGMOD, pages 2–11, 2003.
     The great time series classification bake o↵: a review and exper-   [33] M. Linardi and T. Palpanas. Scalable, variable-length similarity
     imental evaluation of recent algorithmic advances. Data Min.             search in data series: The ulisse approach. PVLDB, 11(13):2236–
     Knowl. Discov., 31(3):606–660, 2017.                                     2248, 2018.
 [7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The        [34] M. Linardi and T. Palpanas. ULISSE: ULtra compact Index
     R*-tree: an efficient and robust access method for points and            for Variable-Length Similarity SEarch in Data Series. In ICDE,
     rectangles. In ICMD, pages 322–331. ACM, 1990.                           pages 1356–1359, 2018.
 [8] C. Buckley and E. M. Voorhees. Evaluating evaluation measure        [35] M. Linardi, Y. Zhu, T. Palpanas, and E. Keogh. Matrix profile
     stability. In SIGIR, pages 33–40. ACM, 2000.                             x: Valmod - scalable discovery of variable-length motifs in data
 [9] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. iSAX 2.0:            series. In SIGMOD, pages 1053–1066, 2018.
     Indexing and Mining One Billion Time Series. In ICDM, pages         [36] Y. A. Malkov and D. A. Yashunin. Efficient and robust approx-
     58–67. IEEE Computer Society, 2010.                                      imate nearest neighbor search using hierarchical navigable small
[10] A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. J.            world graphs. CoRR, abs/1603.09320, 2016.
     Keogh. Beyond one billion time series: indexing and mining very     [37] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to
     large time series collections with iSAX2+. Knowl. Inf. Syst.,            Information Retrieval. Cambridge University Press, New York,
     39(1):123–151, 2014.                                                     NY, USA, 2008.
[11] K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally      [38] T. Palpanas. Data series management: The road to big sequence
     adaptive dimensionality reduction for indexing large time series         analytics. SIGMOD Record, 44(2):47–52, 2015.
     databases. ACM TODS, 27(2):188–228, June 2002.                      [39] T. Palpanas. Big sequence management: A glimpse of the past,
[12] K.-P. Chan and A. W.-C. Fu. Efficient time series matching by            the present, and the future. In SOFSEM, volume 9587 of LNCS,
     wavelets. In ICDE, pages 126–133, 1999.                                  pages 63–80, 2016.
[13] P. Ciaccia and M. Patella. PAC Nearest Neighbor Queries: Ap-        [40] B. Peng, T. Palpanas, and P. Fatourou. ParIS: The Next Des-
     proximate and Controlled Search in High-Dimensional and Met-             tination for Fast Data Series Indexing and Query Answering.
     ric Spaces. In ICDE, pages 244–255, 2000.                                IEEE BigData, pages 791–800, 2018.
[14] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An Efficient Ac-     [41] D. Rafiei. On similarity-based queries for time series data. In
     cess Method for Similarity Search in Metric Spaces. In VLDB,             ICDE, pages 410–417, 1999.
     pages 426–435, 1997.                                                [42] T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A.
[15] R. Cole, D. E. Shasha, and X. Zhao. Fast window correlations             Batista, M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh.
     over uncooperative time series. In SIGKDD, pages 743–749,                Searching and mining trillions of time series subsequences under
     2005.                                                                    dynamic time warping. In KDD, pages 262–270, 2012.
[16] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and                [43] P. Schäfer and M. Högqvist. Sfa: A symbolic fourier approxima-
     E. Keogh. Querying and mining of time series data: experimental          tion and index for similarity search in high dimensional datasets.
     comparison of representations and distance measures. PVLDB,              In EDBT, pages 516–527, 2012.
     1(2):1542–1552, 2008.                                               [44] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: Solving
[17] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.             c-approximate Nearest Neighbor Queries in High Dimensional
     The Lernaean Hydra of Data Series Similarity Search: An Exper-           Euclidean Space with a Tiny Index. PVLDB, 8(1):1–12, 2014.
     imental Evaluation of the State of the Art. PVLDB, 12(2):112–       [45] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data-
     127, 2018.                                                               adaptive and dynamic segmentation index for whole matching
[18] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim.             on time series. PVLDB, 6(10):793–804, 2013.
     The Return of the Lernaean Hydra: An Experimental Evaluation        [46] R. Weber, H.-J. Schek, and S. Blott. A Quantitative Analysis
     of Data Series Approximate Similarity Search. Under Submis-              and Performance Study for Similarity-Search Methods in High-
     sion, 2019.                                                              Dimensional Spaces. In VLDB, pages 194–205, 1998.
[19] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast sub-        [47] D.-E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas.
     sequence matching in time-series databases. In SIGMOD, pages             DPiSAX: Massively Distributed Partitioned iSAX. pages 1135–
     419–429, New York, NY, USA, 1994. ACM.                                   1140, 2017.
[20] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. El Abbadi.        [48] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A.
     Vector Approximation Based Indexing for Non-uniform High Di-             Dau, Z. Zimmerman, D. F. Silva, A. Mueen, and E. Keogh. Time
     mensional Data Sets. In CIKM, pages 202–209, 2000.                       series joins, motifs, discords and shapelets: a unifying view that
[21] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantiza-             exploits the matrix profile. DAMI, pages 1–41, 2017.
     tion. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):744–755,       [49] K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptive
     Apr. 2014.                                                               data series index. VLDB J., 25(6):843–866, 2016.