Progressive Similarity Search on Time Series Data [Vision Paper] Anna Gogolou1 Theophanis Tsandilas1 Themis Palpanas3 Anastasia Bezerianos2 1 Inria, Univ. Paris-Sud & CNRS, and Univ. Paris-Saclay, France 2 Univ. Paris-Sud & CNRS, Inria and Univ. Paris-Saclay, France 3 Univ. Paris-Descartes, France 1 {first.last}@inria.fr 2 {first.last}@lri.fr 3 {first}@mi.parisdescartes.fr ABSTRACT corresponding indexing techniques and algorithms [10], in Time series data are increasing at a dramatic rate, yet their order to address the scalability challenges. analysis remains highly relevant in a wide range of human ac- Nevertheless, we observe that time series similarity can tivities. Due to their volume, existing systems dealing with be domain- and visualization-dependent [2, 14], and in many time series data cannot guarantee interactive response times, situations, analysts depend on time-consuming manual anal- even for fundamental tasks such as similarity search. There- ysis processes. For example, neuroscientists manually in- fore, in this paper, we present our vision to develop analytic spect the EEG data of their patients, using visual analysis approaches that support exploration and decision making tools, so as to identify patterns of interest [15, 14]. In such by providing progressive results, before the final and exact cases, it is of paramount importance to have techniques that ones have been computed. We demonstrate through exper- can operate within interactive response times [20], in order iments that providing first approximate and then progres- to enable analysts to complete their tasks easily and quickly. sive answers is useful (and necessary) for similarity search In the past years, several visual analysis tools have com- queries on very large time series data. Our findings indi- bined visualizations with advanced data management and cate that there is a gap between the time the most similar analytics techniques (e.g., [22, 16]), albeit not targeted to answer is found and the time when the search algorithm time series similarity search. The focus of the data manage- terminates, resulting in inflated waiting times without any ment community is on the scalability issues related to the improvement. We present preliminary ideas on computing processing and analysis of very large datasets. However, the probabilistic estimates of the final results that could help state-of-the-art methods on time series processing are still users decide when to stop the search process, i.e., decid- far from achieving interactive response times [10]. ing when improvement in the final answer is unlikely, thus To allow for interactive response times when users ana- eliminating waiting time. Finally, we discuss two additional lyze large time series collections, progressive and iterative challenges: how to compute efficiently these probabilistic visual analytics approaches need to be considered [1, 28, estimates, and how to communicate them to users. 26]. These approaches provide progressive answers to users’ requests [12, 24, 19], sometimes based on algorithms that return quick approximate answers [5, 8, 11]. They support Keywords exploration and decision making by providing progressive time series, progressive similarity search; progressive visual results, before the final and exact ones have been computed. analytics; progressive error Most of these techniques consider approximations of aggre- gate queries on relational databases, with the exception of Ciaccia and Patella [5], who examine progressive similar- 1. INTRODUCTION ity search in multi-dimensional metric spaces. Nevertheless, Time series (TS) are sequences of value measurements none of these works has considered time series data, which derived from a wide range of human activities or natural have the additional characteristic of being high-dimensional, processes, such as temperatures per hour, blood oxygen sat- i.e., they are hundreds to thousands of points long. uration per day, or electroencephalography (EEG) signals. These sequences are becoming ubiquitous in modern life, Contributions. We demonstrate the importance of pro- requiring their analysis that is becoming increasingly chal- viding progressive whole-matching [10] similarity search re- lenging given their sizes [21]. sults on large time series collections. Our preliminary ex- Time series analysis involves tasks such as pattern match- periments show that there is a gap between the time the 1st ing, anomaly detection, frequent pattern identification, and Nearest Neighbour (1-NN) is found and the time when the time series clustering or classification. These tasks rely on search algorithm terminates. In other words, users often the notion of time series similarity. The data-mining com- wait without any improvement in their answers. We fur- munity has proposed several techniques, including many sim- ther show that high-quality approximate answers are found ilarity measures (or distance measure algorithms), for calcu- very early, e.g., in less than one second, so they can support lating the distance between two time series [9], as well as highly interactive visual analysis tasks. For our benchmarks we utilize the state-of-the-art Adaptive Data Series (ADS) index [29], which is among the fastest techniques for answer- c 2019 Copyright held by the author(s). Published in the Workshop Proceedings ing k-Nearest Neighbor (k-NN) queries on time series data. of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on CEUR-WS.org. Finally, we lay out our vision for developing progressive ap- proaches for time series exploration and analytics tasks. We discuss promising directions (and our ongoing work) on how Table 1: Experimental datasets Name Description Cardinality TS Length to estimate probabilistic distance bounds, and how to help users evaluate the quality of progressive results. seismic seismic records 100M 256 SALD MRI data 200M 128 deep1B image descriptors 267M 96 2. STATE OF THE ART Similarity Measures: Ding et al. [9] discuss measures Patella [5] studied progressive similarity search queries over that compute similarity between time series. Euclidean Dis- multi-dimensional spaces and proposed a probabilistic ap- tance (ED) is the most popular, performing point-by-point proach for computing the uncertainty of partial similarity value comparison between two time series. ED can be com- search results. However, their approach does not scale to bined with data normalization, often z-normalization, con- the dataset sizes and number of dimensions that we target. sidering as similar patterns that may vary in amplitude or We focus on very large collections (i.e., in the order of value offset. Based on their analysis, Ding et al. [9] con- TBs) of data series (where the dimensionality of each series cluded that there is no superior measure. In our work, we is in the order of hundreds to thousands), and how we can focus on ED because [9, 10]: (i) it is an effective and the develop approaches to support progressive visual analysis in most commonly used measure in the visualization and data- a fully interactive system. Our ultimate goal is to study mining literature; (ii) it leads to efficient solutions for large how users decide to terminate a search that is progressive datasets. (We plan to examine other measures, e.g., Dy- in nature (and thus reduce waiting times) when they are namic Time Warping (DTW) in our future work.) provided with approximate answers and information about their uncertainty. In particular, we are interested in the Similarity Search and Interactive Querying: The data- quality of approximate answers and how to communicate to base community has optimized similarity search methods by users when no improvement is expected to be obtained even using index structures [6, 27, 4, 29], or by directly optimizing if the search algorithm is still running. sequential scans [23]. Recently, Echihabi et al. [10] compared these methods in terms of efficiency under a single, unified experimental framework. Her work indicates that there is 3. PRELIMINARY OBSERVATIONS no single best method that outperforms all the rest. In our We first investigate whether progressive time series simi- work, we use the state-of-the-art ADS index [29], which pro- larity search in large datasets is feasible. We examine how vides high-quality approximate answers almost immediately, early we can provide approximate answers, and how good and then updates that converge fast to the exact answer. these answers are compared to the exact answers. To this The human-computer interaction community focuses on end, we conducted similarity search experiments on three the interactive visual exploration and querying of time series. real datasets with the state-of-the-art ADS index [29], which In particular, they are interested in how to form interactive can quickly provide good initial approximate answers and similarity search queries. Existing querying approaches on can potentially support progressive similarity search within top of line chart visualizations [25] rely either on the inter- interactive time thresholds. active selection of part of an existing time series [3], or on Scope. We examine an important class of queries, i.e., ap- sketching of patterns to search for [7, 18]. Although we do proximate and exact 1-NN whole-matching1 similarity search not study mechanisms of querying in this paper, this line queries [10]. (We expect that similar results will hold for of work is orthogonal to our approach, that considers ap- k-NN and r-range queries, as well as subsequence match- proximate and progressive results from these queries when ing [17]; we will cover these cases in our future work.) interactive search-times are not possible. Environment. We ran all experiments on a Dell T630 Progressive Visual Analytics: A recent research direc- Rack Server with two Intel Xeon E5-2643 v4 3.4Ghz CPUs, tion studies the problem of how we can support interactive, 512GB of RAM, and 3.6TB (2 x 1.8TB) HDD in RAID0. real-time visual analytics when back-end computations can- The search algorithm is a single-core implementation. not be performed instantaneously, as is the case of our work. To this effect we can use progressive and iterative methods Datasets. We tested real datasets that have also been used in order to produce fast, but approximate, computational in previous studies [29, 10]. They include a different number results and visualizations, that are refined over time with of series and a different number of points (Table 1) but have increasing precision. Fekete and Primet [11] provide a sum- the same overall dataset size of 100GB. The IRIS seismic mary of the features of a progressive system, where here we dataset2 consists of seismic instrument recordings from sev- focus on how to provide: (i) progressively improved answers; eral stations worldwide and contains 100 million series with (ii) feedback about the state and costs of the computation; a length of 256 points. The neuroscience dataset, SALD3 , and (iii) guarantees of time and error bounds for progressive consists of MRI data and contains 200 million series of 128 and final results. We address these features in Sec. 3 and 4. points. The image processing dataset, deep1B4 , consists of The state-of-the-art in big data exploration takes advan- vectors extracted from the last layers of a convolutional neu- tage of the power of distributed systems, indexing, and sam- ral network and contains 267 million series of size 96. pling methods, and different works utilize one or more of 1 Whole-matching refers to the situation where the query and these methods in order to provide progressive results for dif- all candidate answers (series) in the dataset have the same ferent kinds of queries and data. Moritz et al. [19] used an length. 2 existing algorithm [8] which exploits sampling methods and http://ds.iris.edu/data/access/ 3 approximate query processing for incremental, approximate http://fcon 1000.projects.nitrc.org/indi/retro/sald.html 4 aggregate database queries. On the other hand, Ciaccia and http://sites.skoltech.ru/compvision/noimi/ Query Table 2: Summary of experimental results 18 q1 1-NN Time (sec) Total Time (sec) Euclidean Distance Dataset Avg Min Max Avg Min Max q50 seismic 8.5 0.017 48.5 92 21.3 111 17 q53 SALD 0.4 0.003 5.2 49 0.24 183 q57 deep1B 0.2 0.001 2.8 76 0.05 189 16 q65 average Queries. All our query workloads include 100 query series. 15 We generated the query datasets by extracting random data series from the raw data. For the deep1B dataset, we used a real query workload that came with the original dataset. 14 0.001 0.01 0.1 1 10 100 Measures. For each similarity query, we recorded its overall completion time, the time for each approximate answer, as Time (sec) in log-scale well as the time passed until the algorithm finds the exact Figure 1: Examples of progressive search for 5 answer to the query, i.e., the 1-NN. For each approximate queries (seismic dataset). Right-most points in each and exact answer, we also recorded its Euclidean distance curve represent the true 1-NN, while intermediate from the query. points represent early approximate results. Thick grey line represents average trend over 100 queries. Results. Table 2 summarizes our results. For each dataset, we present the average, minimum, and maximum time (in viding early progressive answers to users could drastically seconds) for the 1-NN query answering algorithm to first en- reduce waiting times. The challenge is how to help users counter the 1-NN answer (marked as 1-NN Time in Table 2), to assess the quality of such progressive answers and decide and the corresponding times for the same algorithm to fin- whether to trust these answers, or wait for a better one. ish execution (marked as Total Time). We observe that the total time waiting for a single query to finish can be long, e.g., up to three minutes, which is beyond acceptable thresh- 4. VISION AND CHALLENGES olds for interactive data-analysis tasks [11]. Moreover, these Previous work on progressive or optimistic visual analyt- delays are orders of magnitude longer than the actual time ics [12, 24, 19] has focused on the estimation of aggregated needed to find the best answer (i.e., first encounter of 1- functions, such as the mean, over random or carefully se- NN). This means that for most queries, the greatest cost lected samples of the data. In this case, providing feed- is not locating the 1-NN, but rather confirming that there back to users about the uncertainty of progressive results is no better answer: this is why the query answering algo- can rely on common statistical methods, e.g., confidence in- rithm finishes execution long after having retrieved the 1-NN tervals [12, 19], or coarse-grain visualizations of aggregated value. This finding is consistent with results by Ciaccia and data [13]. However, such approaches cannot apply to our Patella [5], who report that most of the time spent in an problem. Although k-NN distances converge over time (see exact NN search is ”wasted time, during which no improve- Fig. 1), providing bounds for their error requires a different ment is obtained.” set of statistical tools. The time needed to locate the 1-NN was especially fast for Our vision is to develop methods for progressive time se- the SALD and deep1B datasets, where average times were ries similarity search, coupled with appropriate bounds on below 1 sec. However, times varied greatly on the seismic the errors of the intermediate results. We are inspired by the dataset, ranging from a few milliseconds to 48.5 sec. For 28% probabilistic approach of Ciaccia and Patella [5] on approx- of the queries, the delay was greater than 10 sec, which is imate search in multi-dimensional spaces. According to this considered as a limit for keeping a user’s attention focused approach, a dataset is considered as a random sample (or on a dialog [11]. We expect that such delays will further instance) drawn from a large multidimensional space. If the increase in larger datasets, and for k-NN exact search. In distribution of k-NN distances in this space is known for any these cases, providing early approximate answers will also given query, then we can derive an estimate of the k-NN dis- be crucial. tance for the sample. Unfortunately, such distributions are Fig. 1 presents the progressive results for five example unknown, so the challenge is how to approximate them from queries on the seismic data. For these queries, the time to a given dataset. Ciaccia and Patella [5] use this framework locate the 1-NN (right-most point in each curve) is relatively to determine a mix of probability and distance error bounds long (> 20 sec), while approximate answers (intermediate as stopping conditions for an approximate similarity search. points in each curve) appear with various frequencies and Our goal instead is to provide live estimates of probabilistic trends. For example, for queries q1 and q65, results con- distance bounds to users, and let users decide whether to verge quickly (in the order of hundreds of milliseconds), and trust the current results and stop, wait for a better answer, then only slightly improve. For other queries, such as q57, or keep the process executing in the background while they convergence is more progressive. In all cases, however, the continue with a new search. first approximate answer (left-most point in each curve) be- We briefly discuss promising measures that could help comes available very early (< 50 msec), and approximate users evaluate the progressive results of a similarity search. results very close to the final answer appear within interac- Let T be a space of time series data. For a given query Q ∈ tion times (< 1 sec). T , we define the cumulative distribution function FQ (x) = Overall, our results indicate that (i) supporting interactive P r{dQ ≤ x} that gives the probability that its distance similarity search over large datasets is feasible, and (ii) pro- dQ from a random series in T is lower than or equal to x. From this function, we can derive the probability distribu- [3] P. Buono and A. L. Simeone. Interactive shape specification for tion of k-NN distances dkQ , and from this, we can estimate pattern search in time series. In AVI, 2008. [4] A. Camerra, T. Palpanas, J. Shieh, and E. Keogh. isax 2.0: the expected k-NN distance and assess the uncertainty of Indexing and mining one billion time series. In ICDM, 2010. this value. Furthermore, we can infer probabilistic distance [5] P. Ciaccia and M. Patella. Pac nearest neighbor queries: bounds for dkQ . For example, we can estimate a distance Approximate and controlled search in high-dimensional and metric spaces. In ICDE, 2000. bound d+ such that P r{dkQ > d+ } ≤ 5%. The FQ function [6] P. Ciaccia, M. Patella, and P. Zezula. A cost model for can be also used to derive a probability distribution for the similarity queries in metric spaces. In PODS, 1998. number k of answers whose distance is better than a given [7] M. Correll and M. Gleicher. The semantics of sketch: distance dQ . Investigating additional probabilistic measures Flexibility in visual query systems for time series data. In VAST, 2016. about expected CPU and I/O costs [6] can also be useful. [8] B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and We note that the above directions pose two main chal- C. Wang. Sample + seek: Approximating aggregates with lenges. First, the FQ (x) function can be only approximated distribution precision guarantee. In SIGMOD, 2016. based on a specific instance of the space, i.e., the given time [9] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh. Querying and mining of time series data: series dataset. Ciaccia and Patella [5] argue that for high- Experimental comparison of representations and distance dimensional spaces, this function is close to the cumulative measures. Proc. VLDB Endow., 1(2), 2008. distribution function F (x) of all pairwise distances, and can [10] K. Echihabi, K. Zoumpatianos, T. Palpanas, and be approximated from a sample of distances that are ran- H. Benbrahim. The lernaean hydra of data series similarity search: An experimental evaluation of the state of the art. The domly drawn from a given dataset. Unfortunately, reliably VLDB Journal, 12(2), 2018. estimating the probability distribution of k-NN distances re- [11] J.-D. Fekete and R. Primet. Progressive analytics: A quires large samples of distances. Our early tests have shown computation paradigm for exploratory data analysis. CoRR, that their precomputation can be prohibitively expensive for abs/1607.05162, 2016. [12] D. Fisher, S. M. Drucker, and A. C. König. Exploratory large datasets, such as the ones in Table 1, or larger. We visualization involving incremental, approximate database are currently working on solutions that reduce such compu- queries and uncertainty. IEEE CG&A, 32, 2012. tation costs. [13] M. Glueck, A. Khan, and D. J. Wigdor. Dive in!: Enabling The second challenge is how to communicate to users progressive loading for real-time navigation of data visualizations. In CHI, 2014. probabilistic distance bounds and errors, given that similar- [14] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos. ity distance values do not have a clear interpretation. Our Comparing similarity perception in time series visualizations. goal is to integrate these distance bounds and errors into IEEE TVCG, 25, 2018. the visual representation of a time series, by either using [15] J. Jing, J. Dauwels, T. Rakthanmanon, E. Keogh, S. Cash, and M. Westover. Rapid annotation of interictal epileptiform the query itself, or its progressive k-NN answers. Finally, discharges via template matching under dynamic time warping. we are interested in experimentally assessing how users per- Journal of Neuroscience Methods, 274, 2016. ceive and understand such probabilistic measures. We plan [16] T. Kraska. Northstar: An interactive data science system. PVLDB, 11(12):2150–2164, 2018. to measure whether, and to what extent, the visualization of [17] M. Linardi and T. Palpanas. Scalable, variable-length these probabilistic measures helps them to effectively com- similarity search in data series: The ULISSE approach. plete their visual analysis tasks. PVLDB, 11(13), 2018. [18] M. Mannino and A. Abouzied. Expressive time series querying with hand-drawn scale-free sketches. In CHI, 2018. 5. CONCLUSIONS [19] D. Moritz, D. Fisher, B. Ding, and C. Wang. Trust, but verify: Optimistic visualizations of approximate queries for exploring In this work, we presented preliminary experiments that big data. In CHI, 2017. demonstrate the usefulness (and need) of progressive answer- [20] J. Nielsen. Response times: The 3 important limits. ing of similarity search queries on very large time series data. https://www.nngroup.com/articles/ response-times-3-important-limits/. Our findings show that the greatest cost is not locating the [21] T. Palpanas. Data series management: The road to big 1-NN, but rather waiting for the algorithm to confirm that sequence analytics. SIGMOD Record, 44(2), 2015. there is no better answer and finish execution. [22] S. Rahman, M. Aliakbarpour, H. Kong, E. Blais, We argue that providing progressive answers and esti- K. Karahalios, A. G. Parameswaran, and R. Rubinfeld. I’ve seen ”enough”: Incrementally improving visualizations to mates of probabilistic distance bounds to users, and letting support rapid decision making. PVLDB, 10(11):1262–1273, them decide when to stop the search process, are important 2017. research questions. This would eliminate wasted time and [23] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, reduce user waiting times, in cases where improvement in B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic the final answer is not possible. We have identified two main time warping. In KDD, 2012. challenges: (i) how to compute efficiently distance probabil- [24] C. D. Stolper, A. Perer, and D. Gotz. Progressive visual ity distributions for large data series collections; and (ii) how analytics: User-driven visual exploration of in-progress analytics. IEEE TVCG, 20, 2014. to communicate to users probabilistic distance bounds and [25] E. R. Tufte. The Visual Display of Quantitative Information. errors. Given the increasing popularity of data series anal- 1986. ysis tasks, these research directions are both relevant and [26] C. Turkay, E. Kaya, S. Balcisoy, and H. Hauser. Designing important, offering exciting research opportunities. progressive and interactive analytics processes for high-dimensional data analysis. IEEE TVCG, 23(1), 2017. [27] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A 6. REFERENCES data-adaptive and dynamic segmentation index for whole matching on time series. Proc. VLDB Endow., 6(10), 2013. [1] S. K. Badam, N. Elmqvist, and J.-D. Fekete. Steering the [28] E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and craft: Ui elements and visualizations for supporting progressive T. Kraska. How progressive visualizations affect exploratory visual analytics. Comput. Graph. Forum, 36(3), 2017. analysis. IEEE TVCG, 23, 2017. [2] G. E. Batista, E. J. Keogh, O. M. Tataw, and V. M. Souza. [29] K. Zoumpatianos, S. Idreos, and T. Palpanas. Ads: The Cid: An efficient complexity-invariant distance for time series. adaptive data series index. The VLDB Journal, 25(6), 2016. Data Min. Knowl. Discov., 28(3), 2014.