ceur-ws.org/Vol-2399/paper11.pdf Truly Scalable Data Series Similarity Search Karima Echihabi Supervised by: Prof. Themis Palpanas and Prof. Houda Benbrahim IRDA, Rabat IT Center, ENSIAS, Mohammed V Univ. karima.echihabi@gmail.com ABSTRACT Although a data series can be represented as a vector Data series are widely used in numerous domains. Analyzing in high dimensional space, conventional vector-based ap- this data is important for a variety of real-world applications proaches are not adapted for two reasons: (a) they cannot and has been extensively studied over the past 25 years. At scale to thousands of dimensions; and (b) they do not exploit the core of the analysis task lies a classic algorithm called the correlation among dimensions typical for data series. similarity search. A number of approaches have been pro- Similarity search methods can either return exact or ap- posed to support similarity search over massive data series proximate answers. Exact methods are costly while approx- collections. The results of two comprehensive data series imate methods o↵er better efficiency at the expense of losing experimental evaluations form the foundations of our fu- some accuracy. We call approximate the methods that do ture work, which will lead to the development of a novel not provide any guarantees on the results ng-approximate, index that can efficiently support both exact and approx- and those that provide guarantees on the approximation er- imate data series similarity search, as well as progressive ror, -✏-approximate methods, where ✏ is the approximation query answering with bound guarantees. Based on the in- error and , the probability that ✏ will not be exceeded. sights gained from the exhaustive study of the related work, A plethora of similarity search methods have been pub- the new index will surpass the state-the-art approximate and lished by the community including techniques designed for exact techniques both in performance and accuracy. generic vectors [24, 23, 7, 14, 46, 20, 25, 44, 5, 36] and those specific to data series [3, 41, 27, 42, 47, 34, 33, 40, 15, 9, 43, 45, 10, 49, 30]. 1. INTRODUCTION This work aims to propose a novel index that will support A data series is a sequence of ordered real values1 . Data progressive query answering with probabilistic guarantees. series are omnipresent in various domains from science and We describe the related work, succinctly report the results of engineering to business and medicine [35]. The prolifera- our extensive experimental evaluation of exact methods [17], tion of IoT technologies is also heavily contributing to the and give a glimpse of some very interesting results from an explosive growth of data series collections to the order of ongoing experimental study focused on approximate meth- terabytes [38]. A common subroutine to data series ana- ods [18]. Finally, we present our future work directions. lytical tasks is a classic algorithm called similarity search. In our work, we focus on the problem of whole matching Therefore the research community has extensively studied similarity search in collections with a very large number of the development of efficient similarity search algorithms for data series, i.e., similarity search that produces approximate data series. The similarity search algorithm for data series or exact results, by calculating distances on the whole (not returns the set of candidate data series in a collection that a sub-) sequence. This is a very popular problem that lies is similar to a given query series. This algorithm is often at the core of several other algorithms, and is important for reduced to the nearest neighbor problem where data series many applications in various domains in the real world [17]. are represented as data points in multidimensional space and their (dis)similarity is evaluated using a distance function. 2. RELATED WORK 1 Similarity search involves finding a set of data series in The order attribute can be angle, mass, frequency, position, or time [39]. When the order is based on time, the sequence a collection that are similar to a query according to some is called a time series. The terms data series, time series definition of sameness. A common abstraction is to consider and sequence are used interchangeably. the query and candidate data series as points in a metric space and evaluating the sameness (or di↵erence) using the euclidean distance. To develop efficient similarity search algorithms on mas- sive datasets, two major costs need to be optimized: the cost of accessing data on disk (I/O) and the cost of comparing the query to candidates (CPU cost). Typically, the first cost is reduced by using summarization techniques that map the high-dimensional data to a lower-dimensional space, while Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los Angeles, California. Copyright (C) 2019 for this paper by its authors. the second cost is optimized with sophisticated data struc- Copying permitted for private and academic purposes. tures and search algorithms. Several similarity search meth- ods have been proposed in the literature supporting either iSAX2+ [10] is a bulk-loading index based on SAX. Dur- exact search [23, 7, 46, 20, 27, 42], approximate search [24, ing search, a query is represented with SAX symbols and 25, 44, 5, 36], or both [14, 43, 45, 10, 49, 47, 30, 34]. the candidates contained in non-pruned leaves are further In the following section, we provide a succinct descrip- refined using the euclidean distance. Pruning uses a lower tion of the state-of-the-art similarity search methods and bounding distance. the summarizations techniques on which they are based. ADS+ [49] is the first adaptive data series index based on 1. Summarization Techniques. The Discrete Fourier SAX. It starts with a minimal tree structure containing only Transform (DFT) [19] decomposes a data series into fre- summarizations, and then adds in the raw data to leaves quency coefficients, a subset of which represents a summa- during query answering. It supports a number of search rization of the data series. options, in particular SIMS, a skip-sequential algorithm. The Discrete Haar Wavelet Transform (DHWT) [12] In addition to exact search, the MTree, SFA trie, DSTree, transforms a data series using Haar wavelets into a hier- iSAX2+ and ADS+ also support ng-approximate search. archical representation. 3. Approximate Similarity Search Methods. We now Random projections map the raw high dimensional data present the three most popular approximate methods. into a lower dimensional space using a random matrix while SRS [44] belongs to the family of LSH methods, inspired preserving pairwise distances within a distortion thresh- by the randomized algorithm in [24]. It uses random pro- old [26]. jections to build a scalable index and supports approximate The Piecewise Aggregate Approximation (PAA) [28] and search with theoretical guarantees. Adaptive Piecewise Constant Approximation (APCA) [11] IMI [21, 5] is an inverted index based on OPQ. A query is techniques divide a data series into segments (of equal and answered by returning all points corresponding to the cor- arbitrary length, respectively) and approximate each seg- responding entries in the inverted index. Returned results ment with the mean of the points that belong to it. The are ng-approximate. Extended Adaptive Piecewise Approximation (EAPCA) [45] HNSW [36] is an in-memory neighborhood graph exploit- method enhance APCA by approximating each segment ing the Voronoi Diagram and the Delaunay Triangulation. with the standard deviation in addition to the mean. A greedy search returns the best candidates with high em- Symbolic Aggregate Approximation (SAX) [32] first ap- pirical accuracy but no formal theoretical guarantees. proximates data series using PAA, then discretizes the PAA values into a compact binary representation. Symbolic Fourier Approximation (SFA) [43] first trans- forms a data series into DFT coefficients, which are then 3. PROPOSED WORK approximated using a succinct symbolic approximation. We propose a novel data series index that can answer Optimized Product Quantization (OPQ) [21] applies a lin- progressive similarity search queries with strong probability ear transformation on the data series to decorrelate it, then guarantees. In addition to this unique functionality, it also applies on it a product quantizer [25]. leverages the strengths and addresses the weaknesses of the 2. Exact Similarity Search Methods. Below, we briefly state-of-the-art approximate and exact techniques. describe algorithms that produce exact results. Completed Work. We outline the main contributions The R*-tree [7] is a height-balanced spatial index that in [17]: (i) we provide a formal problem definition for organizes data into a hierarchy of nested overlapping rect- data series similarity search, unifying conflicting terminol- angles. Search returns all entries in leaves whose rectangle ogy from di↵erent research communities; (ii) we present a contains the query. survey of the state-of-the-art data series similarity search The M-tree [14] is a multidimensional index for metric techniques (Table 1 summarizes each technique according space which partitions the data using hyper-spheres based to our definitions); and (iii) we conduct an extensive ex- on their relative distances. The search algorithm uses the perimental evaluation for the efficiency of data series exact triangular inequality to prune data. similarity search. The VA+file [20] is an improvement of the VA-file [46]. The study assessed ten state-of-the-art methods under the It creates a filter file containing summarizations of the raw same experimental framework. To guard against implemen- data. Search uses the filter file to prune candidates based tation bias, we used a large number of comparison criteria on a lower bounding distance. For efficiency reasons, we including implementation-independent ones, and we reim- modified the VA+file to use the DFT transform instead of plemented from scratch four methods that were not avail- the Karhunen–Loève transform (KLT). able in C/C++. Our implementations largely outperform Stepwise [27] represents the data space in a multi-level the original ones both in time and space, thus enriching the hierarchical representation using DHWT summarizations. landscape of data series similarity search methods. Search transforms a query into DHWT and filters out can- In an e↵ort to make our results reproducible and support didate based on upper and lower bounding distances. future research in the area, we share with the community a The SFA method [43] builds a trie with SFA summariza- public archive containing all source codes, datasets, queries, tions of the data. During query answering, a lower bounding results, and plots [1]. distance is used to prune out candidates. Based on the results of our study, we draw up recom- The UCR Suite [42] is an optimized sequential scan algo- mendations to help users decide the best approach for their rithm for exact matching that we consider as a baseline for problem. Figure 1 shows a decision matrix given a typical performance comparisons. hardware and query workload. The VA+file is particularly The DSTree [45] index dynamically segments data using well-suited for long series in-memory while for shorter series, EAPCA. During search, it uses a lower bounding distance the DSTree is the best contender on disk, and iSAX2+ is the to prune the search space. winner in-memory. Table 1: Similarity search methods [17] Matching Accuracy Matching Type Representation Implementation exact ng-appr. ✏-appr. -✏-appr. Whole Subseq. Raw Reduced Original New ADS+ [49] [49] X iSAX C DSTree [45] [45] X EAPCA Java C Indexes iSAX2+ [10] [10] X iSAX C# C M-tree [14] [13] [13] X X C++ R*-tree [7] X PAA C++ SFA trie [43] [43] X X SFA Java C VA+file [20] X DFT MATLAB C UCR Suite [42] X X C Other MASS [48] X DFT C Stepwise [27] X DHWT C We present an elaborate discussion based on the deep in- In−Memory Long Series Disk−Resident Long Series sights gained about the data series similarity search prob- VA+file SERIES LENGTH lem. The main points are the following: VA+file decision depends on dataset size and length 1. Unexpected results confirmed the importance of careful DSTree parameter tuning, hardware setup, implementation frame- decision depends on dataset size work, and workload selection. In particular, Stepwise and iSAX2+ DSTree DSTree ADS+ performed below our expectations while our op- In−Memory Short Series Disk−Resident Short Series timized implementations of the DSTree and the VA+file DATASET SIZE helped bring them back to the spotlight. Moreover, our carefully crafted experiments identified optimal parameters Figure 1: Recommendations [17] that were di↵erent than the ones published in the original (Indexing and answering 10K exact synthetic papers. Another important finding was that, unlike what queries on a hard-drive machine) was originally believed [29], the tightness of the lower bound of a given method does not alone predict its performance. In fact, it is of paramount importance to consider other fac- Idx + 100 Qrs (min) Idx + 100 Qrs (min) tors such as the hardware platform and the the clustering 300 quality of a an index. 100 30 2. A better understanding of current approaches helped pinpoint interesting avenues for improvement, in particular 30 10 identifying the methods that would most benefit from mod- 10 ern hardware. For instance: (i) the DSTree is a very good 3 3 candidate for parallelization as its index building is over 85% 03 10 30 00 03 10 30 00 CPU cost; (ii) the performance of Stepwise can be signifi- 0. 0. 0. 1. 0. 0. 0. 1. cantly improved with a redesign of the physical storage and MAP MAP the use of modern hardware as the total cost of query an- (a) 100-NN (ng) (b) 100-NN ( ✏) swering is 50%-98% CPU; (iii) ADS+ can be enhanced with the use of asynchronous I/O to overcome the expensive ran- Figure 2: Efficiency vs. accuracy of approximate dom I/O incurred with each skip. search [18] 3. Although index building with iSAX-based indexes is (Dataset = Sift25GB, Data series length = 128) much faster than with the DSTree, the latter achieves a better clustering as it adapts to the data distribution. 4. Choosing between an index scan and a serial scan is available, demonstrate that the extended techniques com- not a trivial decision. In fact, access path selection is an pete in memory and outperform on disk the state-of-the-art optimization problem that depends on a variety of factors approximate techniques from the vector indexing commu- including hardware, query pruning ratio, data characteris- nity both in accuracy and efficiency. Figure 2 summarizes tics, the accuracy of a summarization and the efficacy of the results for 100-NN queries on an in-memory 25GB data col- clustering provided by an index. lection extracted from the Sift dataset [2]. We observe that Work In Progress. Our current work involves an exper- our modifications to iSAX2+ enable it to outperform the imental study that evaluates data series approximate sim- popular ng-approximate search approaches HNSW and IMI ilarity search, both in-memory and on-disk [18]. It di↵ers (Faiss implementation) and the state-of-the-art LSH method from other experimental studies which focused on the effi- SRS, in terms of combined indexing and query answering ciency of exact search [17], the accuracy of dimensionality costs. We measure accuracy using MAP, a popular metric reduction techniques and similarity measures for classifica- in the information retrieval literature [37, 8]. tion tasks [29, 16, 6], or in-memory data [31, 4]. Future Work. Inspired by the insights gained from our Our results show that some strikingly simple modifica- two experimental studies on the the inner workings of the tions to existing exact methods enable them to answer - di↵erent indexing approaches and the e↵ectiveness of their ✏-approximate queries and have excellent empirical perfor- design choices, the key future direction for our work is the mance. In fact, extensive experiments on large synthetic and design and development of a novel data series index that real datasets, including the two largest real datasets publicly will outperform the state-of-the-art approximate and exact techniques. The new index will also support progressive [22] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. Pro- query answering [22] with probability guarantees, so as to gressive similarity search on time series data. In BigVis, in conjunction with EDBT/ICDT, 2019. further enable interactive exploration tasks on very large [23] A. Guttman. R-Trees: A Dynamic Index Structure for Spatial data series collections. Searching. In SIGMOD, pages 47–57, 1984. [24] P. Indyk and R. Motwani. Approximate nearest neighbors: To- wards removing the curse of dimensionality. In STOC, pages 4. CONCLUSIONS 604–613, 1998. [25] H. Jegou, M. Douze, and C. Schmid. Product quantization for The goal of this thesis is to develop a new index that nearest neighbor search. IEEE Transactions on Pattern Anal- will support approximate and exact search, and progressive ysis and Machine Intelligence, 33(1):117–128, Jan 2011. query answering with probability guarantees. The first step [26] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz map- pings into a Hilbert space. In CONM, volume 26 of Contempo- in this direction was to thoroughly assess the state-of-the- rary Mathematics, pages 189–206. 1984. art [17, 18]. We describe the lessons learned from two exten- [27] S. Kashyap and P. Karras. Scalable knn search on vertically sive experimental evaluations and outline our future work. stored time series. In C. Apt, J. Ghosh, and P. Smyth, editors, KDD, pages 1334–1342. ACM, 2011. [28] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. Dimen- References sionality reduction for fast similarity search in large time series databases. KAIS, 3(3):263–286, 2001. [1] Lernaean Hydra Archive. http://www.mi.parisdescartes.fr/ [29] E. Keogh and S. Kasetty. On the need for time series data mining ~themisp/dsseval/, 2018. benchmarks: A survey and empirical demonstration. Data Min. [2] TEXMEX Datasets for Approximate Nearest Neighbor Search. Knowl. Discov., 7(4):349–371, Oct. 2003. http://corpus-texmex.irisa.fr/, 2018. [30] H. Kondylakis, N. Dayan, K. Zoumpatianos, and T. Palpanas. [3] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity Coconut: A scalable bottom-up approach for building data series search in sequence databases. pages 69–84, 1993. indexes. PVLDB (11)6, pages 677–690, 2018. [4] M. Aumüller, E. Bernhardsson, and A. Faithfull. Ann- [31] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, and X. Lin. benchmarks: A benchmarking tool for approximate nearest Approximate Nearest Neighbor Search on High Dimensional neighbor algorithms. In SISAP, pages 34–49, 2017. Data - Experiments, Analyses, and Improvement (v1.0). CoRR, [5] A. Babenko and V. Lempitsky. The inverted multi-index. IEEE abs/1610.02455, 2016. Transactions on Pattern Analysis and Machine Intelligence, [32] J. Lin, E. J. Keogh, S. Lonardi, and B. Y. Chiu. A symbolic 37(6):1247–1260, June 2015. representation of time series, with implications for streaming al- [6] A. J. Bagnall, J. Lines, A. Bostrom, J. Large, and E. J. Keogh. gorithms. In SIGMOD, pages 2–11, 2003. The great time series classification bake o↵: a review and exper- [33] M. Linardi and T. Palpanas. Scalable, variable-length similarity imental evaluation of recent algorithmic advances. Data Min. search in data series: The ulisse approach. PVLDB, 11(13):2236– Knowl. Discov., 31(3):606–660, 2017. 2248, 2018. [7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The [34] M. Linardi and T. Palpanas. ULISSE: ULtra compact Index R*-tree: an efficient and robust access method for points and for Variable-Length Similarity SEarch in Data Series. In ICDE, rectangles. In ICMD, pages 322–331. ACM, 1990. pages 1356–1359, 2018. [8] C. Buckley and E. M. Voorhees. Evaluating evaluation measure [35] M. Linardi, Y. Zhu, T. Palpanas, and E. Keogh. Matrix profile stability. In SIGIR, pages 33–40. ACM, 2000. x: Valmod - scalable discovery of variable-length motifs in data [9] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. iSAX 2.0: series. In SIGMOD, pages 1053–1066, 2018. Indexing and Mining One Billion Time Series. In ICDM, pages [36] Y. A. Malkov and D. A. Yashunin. Efficient and robust approx- 58–67. IEEE Computer Society, 2010. imate nearest neighbor search using hierarchical navigable small [10] A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. J. world graphs. CoRR, abs/1603.09320, 2016. Keogh. Beyond one billion time series: indexing and mining very [37] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to large time series collections with iSAX2+. Knowl. Inf. Syst., Information Retrieval. Cambridge University Press, New York, 39(1):123–151, 2014. NY, USA, 2008. [11] K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally [38] T. Palpanas. Data series management: The road to big sequence adaptive dimensionality reduction for indexing large time series analytics. SIGMOD Record, 44(2):47–52, 2015. databases. ACM TODS, 27(2):188–228, June 2002. [39] T. Palpanas. Big sequence management: A glimpse of the past, [12] K.-P. Chan and A. W.-C. Fu. Efficient time series matching by the present, and the future. In SOFSEM, volume 9587 of LNCS, wavelets. In ICDE, pages 126–133, 1999. pages 63–80, 2016. [13] P. Ciaccia and M. Patella. PAC Nearest Neighbor Queries: Ap- [40] B. Peng, T. Palpanas, and P. Fatourou. ParIS: The Next Des- proximate and Controlled Search in High-Dimensional and Met- tination for Fast Data Series Indexing and Query Answering. ric Spaces. In ICDE, pages 244–255, 2000. IEEE BigData, pages 791–800, 2018. [14] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An Efficient Ac- [41] D. Rafiei. On similarity-based queries for time series data. In cess Method for Similarity Search in Metric Spaces. In VLDB, ICDE, pages 410–417, 1999. pages 426–435, 1997. [42] T. Rakthanmanon, B. J. L. Campana, A. Mueen, G. E. A. P. A. [15] R. Cole, D. E. Shasha, and X. Zhao. Fast window correlations Batista, M. B. Westover, Q. Zhu, J. Zakaria, and E. J. Keogh. over uncooperative time series. In SIGKDD, pages 743–749, Searching and mining trillions of time series subsequences under 2005. dynamic time warping. In KDD, pages 262–270, 2012. [16] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and [43] P. Schäfer and M. Högqvist. Sfa: A symbolic fourier approxima- E. Keogh. Querying and mining of time series data: experimental tion and index for similarity search in high dimensional datasets. comparison of representations and distance measures. PVLDB, In EDBT, pages 516–527, 2012. 1(2):1542–1552, 2008. [44] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: Solving [17] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim. c-approximate Nearest Neighbor Queries in High Dimensional The Lernaean Hydra of Data Series Similarity Search: An Exper- Euclidean Space with a Tiny Index. PVLDB, 8(1):1–12, 2014. imental Evaluation of the State of the Art. PVLDB, 12(2):112– [45] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A data- 127, 2018. adaptive and dynamic segmentation index for whole matching [18] K. Echihabi, K. Zoumpatianos, T. Palpanas, and H. Benbrahim. on time series. PVLDB, 6(10):793–804, 2013. The Return of the Lernaean Hydra: An Experimental Evaluation [46] R. Weber, H.-J. Schek, and S. Blott. A Quantitative Analysis of Data Series Approximate Similarity Search. Under Submis- and Performance Study for Similarity-Search Methods in High- sion, 2019. Dimensional Spaces. In VLDB, pages 194–205, 1998. [19] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast sub- [47] D.-E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas. sequence matching in time-series databases. In SIGMOD, pages DPiSAX: Massively Distributed Partitioned iSAX. pages 1135– 419–429, New York, NY, USA, 1994. ACM. 1140, 2017. [20] H. Ferhatosmanoglu, E. Tuncel, D. Agrawal, and A. El Abbadi. [48] C.-C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Vector Approximation Based Indexing for Non-uniform High Di- Dau, Z. Zimmerman, D. F. Silva, A. Mueen, and E. Keogh. Time mensional Data Sets. In CIKM, pages 202–209, 2000. series joins, motifs, discords and shapelets: a unifying view that [21] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantiza- exploits the matrix profile. DAMI, pages 1–41, 2017. tion. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):744–755, [49] K. Zoumpatianos, S. Idreos, and T. Palpanas. ADS: the adaptive Apr. 2014. data series index. VLDB J., 25(6):843–866, 2016.