Progressive Similarity Search on Time Series Data
                                                                   [Vision Paper]
        Anna Gogolou1 Theophanis Tsandilas1 Themis Palpanas3 Anastasia Bezerianos2
                                    1
                                        Inria, Univ. Paris-Sud & CNRS, and Univ. Paris-Saclay, France
                                    2
                                         Univ. Paris-Sud & CNRS, Inria and Univ. Paris-Saclay, France
                                                         3
                                                           Univ. Paris-Descartes, France
                      1
                          {first.last}@inria.fr            2
                                                               {first.last}@lri.fr    3
                                                                                          {first}@mi.parisdescartes.fr


ABSTRACT                                                                         corresponding indexing techniques and algorithms [10], in
Time series data are increasing at a dramatic rate, yet their                    order to address the scalability challenges.
analysis remains highly relevant in a wide range of human ac-                       Nevertheless, we observe that time series similarity can
tivities. Due to their volume, existing systems dealing with                     be domain- and visualization-dependent [2, 14], and in many
time series data cannot guarantee interactive response times,                    situations, analysts depend on time-consuming manual anal-
even for fundamental tasks such as similarity search. There-                     ysis processes. For example, neuroscientists manually in-
fore, in this paper, we present our vision to develop analytic                   spect the EEG data of their patients, using visual analysis
approaches that support exploration and decision making                          tools, so as to identify patterns of interest [15, 14]. In such
by providing progressive results, before the final and exact                     cases, it is of paramount importance to have techniques that
ones have been computed. We demonstrate through exper-                           can operate within interactive response times [20], in order
iments that providing first approximate and then progres-                        to enable analysts to complete their tasks easily and quickly.
sive answers is useful (and necessary) for similarity search                        In the past years, several visual analysis tools have com-
queries on very large time series data. Our findings indi-                       bined visualizations with advanced data management and
cate that there is a gap between the time the most similar                       analytics techniques (e.g., [22, 16]), albeit not targeted to
answer is found and the time when the search algorithm                           time series similarity search. The focus of the data manage-
terminates, resulting in inflated waiting times without any                      ment community is on the scalability issues related to the
improvement. We present preliminary ideas on computing                           processing and analysis of very large datasets. However, the
probabilistic estimates of the final results that could help                     state-of-the-art methods on time series processing are still
users decide when to stop the search process, i.e., decid-                       far from achieving interactive response times [10].
ing when improvement in the final answer is unlikely, thus                          To allow for interactive response times when users ana-
eliminating waiting time. Finally, we discuss two additional                     lyze large time series collections, progressive and iterative
challenges: how to compute efficiently these probabilistic                       visual analytics approaches need to be considered [1, 28,
estimates, and how to communicate them to users.                                 26]. These approaches provide progressive answers to users’
                                                                                 requests [12, 24, 19], sometimes based on algorithms that
                                                                                 return quick approximate answers [5, 8, 11]. They support
Keywords                                                                         exploration and decision making by providing progressive
time series, progressive similarity search; progressive visual                   results, before the final and exact ones have been computed.
analytics; progressive error                                                     Most of these techniques consider approximations of aggre-
                                                                                 gate queries on relational databases, with the exception of
                                                                                 Ciaccia and Patella [5], who examine progressive similar-
1.    INTRODUCTION                                                               ity search in multi-dimensional metric spaces. Nevertheless,
   Time series (TS) are sequences of value measurements                          none of these works has considered time series data, which
derived from a wide range of human activities or natural                         have the additional characteristic of being high-dimensional,
processes, such as temperatures per hour, blood oxygen sat-                      i.e., they are hundreds to thousands of points long.
uration per day, or electroencephalography (EEG) signals.
These sequences are becoming ubiquitous in modern life,                          Contributions. We demonstrate the importance of pro-
requiring their analysis that is becoming increasingly chal-                     viding progressive whole-matching [10] similarity search re-
lenging given their sizes [21].                                                  sults on large time series collections. Our preliminary ex-
   Time series analysis involves tasks such as pattern match-                    periments show that there is a gap between the time the 1st
ing, anomaly detection, frequent pattern identification, and                     Nearest Neighbour (1-NN) is found and the time when the
time series clustering or classification. These tasks rely on                    search algorithm terminates. In other words, users often
the notion of time series similarity. The data-mining com-                       wait without any improvement in their answers. We fur-
munity has proposed several techniques, including many sim-                      ther show that high-quality approximate answers are found
ilarity measures (or distance measure algorithms), for calcu-                    very early, e.g., in less than one second, so they can support
lating the distance between two time series [9], as well as                      highly interactive visual analysis tasks. For our benchmarks
                                                                                 we utilize the state-of-the-art Adaptive Data Series (ADS)
                                                                                 index [29], which is among the fastest techniques for answer-
 c 2019 Copyright held by the author(s). Published in the Workshop Proceedings
                                                                                 ing k-Nearest Neighbor (k-NN) queries on time series data.
of the EDBT/ICDT 2019 Joint Conference (March 26, 2019, Lisbon, Portugal) on
CEUR-WS.org.                                                                     Finally, we lay out our vision for developing progressive ap-
proaches for time series exploration and analytics tasks. We
discuss promising directions (and our ongoing work) on how                      Table 1: Experimental datasets
                                                                      Name         Description    Cardinality TS Length
to estimate probabilistic distance bounds, and how to help
users evaluate the quality of progressive results.                   seismic     seismic records    100M         256
                                                                     SALD           MRI data        200M         128
                                                                     deep1B     image descriptors   267M          96
2.   STATE OF THE ART
   Similarity Measures: Ding et al. [9] discuss measures            Patella [5] studied progressive similarity search queries over
that compute similarity between time series. Euclidean Dis-         multi-dimensional spaces and proposed a probabilistic ap-
tance (ED) is the most popular, performing point-by-point           proach for computing the uncertainty of partial similarity
value comparison between two time series. ED can be com-            search results. However, their approach does not scale to
bined with data normalization, often z-normalization, con-          the dataset sizes and number of dimensions that we target.
sidering as similar patterns that may vary in amplitude or             We focus on very large collections (i.e., in the order of
value offset. Based on their analysis, Ding et al. [9] con-         TBs) of data series (where the dimensionality of each series
cluded that there is no superior measure. In our work, we           is in the order of hundreds to thousands), and how we can
focus on ED because [9, 10]: (i) it is an effective and the         develop approaches to support progressive visual analysis in
most commonly used measure in the visualization and data-           a fully interactive system. Our ultimate goal is to study
mining literature; (ii) it leads to efficient solutions for large   how users decide to terminate a search that is progressive
datasets. (We plan to examine other measures, e.g., Dy-             in nature (and thus reduce waiting times) when they are
namic Time Warping (DTW) in our future work.)                       provided with approximate answers and information about
                                                                    their uncertainty. In particular, we are interested in the
Similarity Search and Interactive Querying: The data-               quality of approximate answers and how to communicate to
base community has optimized similarity search methods by           users when no improvement is expected to be obtained even
using index structures [6, 27, 4, 29], or by directly optimizing    if the search algorithm is still running.
sequential scans [23]. Recently, Echihabi et al. [10] compared
these methods in terms of efficiency under a single, unified
experimental framework. Her work indicates that there is
                                                                    3.   PRELIMINARY OBSERVATIONS
no single best method that outperforms all the rest. In our            We first investigate whether progressive time series simi-
work, we use the state-of-the-art ADS index [29], which pro-        larity search in large datasets is feasible. We examine how
vides high-quality approximate answers almost immediately,          early we can provide approximate answers, and how good
and then updates that converge fast to the exact answer.            these answers are compared to the exact answers. To this
  The human-computer interaction community focuses on               end, we conducted similarity search experiments on three
the interactive visual exploration and querying of time series.     real datasets with the state-of-the-art ADS index [29], which
In particular, they are interested in how to form interactive       can quickly provide good initial approximate answers and
similarity search queries. Existing querying approaches on          can potentially support progressive similarity search within
top of line chart visualizations [25] rely either on the inter-     interactive time thresholds.
active selection of part of an existing time series [3], or on      Scope. We examine an important class of queries, i.e., ap-
sketching of patterns to search for [7, 18]. Although we do         proximate and exact 1-NN whole-matching1 similarity search
not study mechanisms of querying in this paper, this line           queries [10]. (We expect that similar results will hold for
of work is orthogonal to our approach, that considers ap-           k-NN and r-range queries, as well as subsequence match-
proximate and progressive results from these queries when           ing [17]; we will cover these cases in our future work.)
interactive search-times are not possible.
                                                                    Environment. We ran all experiments on a Dell T630
Progressive Visual Analytics: A recent research direc-              Rack Server with two Intel Xeon E5-2643 v4 3.4Ghz CPUs,
tion studies the problem of how we can support interactive,         512GB of RAM, and 3.6TB (2 x 1.8TB) HDD in RAID0.
real-time visual analytics when back-end computations can-          The search algorithm is a single-core implementation.
not be performed instantaneously, as is the case of our work.
To this effect we can use progressive and iterative methods         Datasets. We tested real datasets that have also been used
in order to produce fast, but approximate, computational            in previous studies [29, 10]. They include a different number
results and visualizations, that are refined over time with         of series and a different number of points (Table 1) but have
increasing precision. Fekete and Primet [11] provide a sum-         the same overall dataset size of 100GB. The IRIS seismic
mary of the features of a progressive system, where here we         dataset2 consists of seismic instrument recordings from sev-
focus on how to provide: (i) progressively improved answers;        eral stations worldwide and contains 100 million series with
(ii) feedback about the state and costs of the computation;         a length of 256 points. The neuroscience dataset, SALD3 ,
and (iii) guarantees of time and error bounds for progressive       consists of MRI data and contains 200 million series of 128
and final results. We address these features in Sec. 3 and 4.       points. The image processing dataset, deep1B4 , consists of
   The state-of-the-art in big data exploration takes advan-        vectors extracted from the last layers of a convolutional neu-
tage of the power of distributed systems, indexing, and sam-        ral network and contains 267 million series of size 96.
pling methods, and different works utilize one or more of           1
                                                                      Whole-matching refers to the situation where the query and
these methods in order to provide progressive results for dif-      all candidate answers (series) in the dataset have the same
ferent kinds of queries and data. Moritz et al. [19] used an        length.
                                                                    2
existing algorithm [8] which exploits sampling methods and            http://ds.iris.edu/data/access/
                                                                    3
approximate query processing for incremental, approximate             http://fcon 1000.projects.nitrc.org/indi/retro/sald.html
                                                                    4
aggregate database queries. On the other hand, Ciaccia and            http://sites.skoltech.ru/compvision/noimi/
                                                                                                                                            Query
     Table 2: Summary of experimental results                                           18                                                     q1
              1-NN Time (sec)  Total Time (sec)


                                                                   Euclidean Distance
    Dataset Avg Min Max Avg Min Max                                                                                                            q50
    seismic  8.5 0.017 48.5    92    21.3   111                                         17                                                     q53
    SALD     0.4 0.003     5.2 49    0.24 183
                                                                                                                                               q57
    deep1B   0.2 0.001     2.8 76    0.05 189
                                                                                        16                                                     q65

                                                                                                                                               average
Queries. All our query workloads include 100 query series.                              15
We generated the query datasets by extracting random data
series from the raw data. For the deep1B dataset, we used
a real query workload that came with the original dataset.                              14
                                                                                             0.001   0.01      0.1      1      10     100
Measures. For each similarity query, we recorded its overall
completion time, the time for each approximate answer, as                                                   Time (sec) in log-scale
well as the time passed until the algorithm finds the exact        Figure 1: Examples of progressive search for 5
answer to the query, i.e., the 1-NN. For each approximate          queries (seismic dataset). Right-most points in each
and exact answer, we also recorded its Euclidean distance          curve represent the true 1-NN, while intermediate
from the query.                                                    points represent early approximate results. Thick
                                                                   grey line represents average trend over 100 queries.
Results. Table 2 summarizes our results. For each dataset,
we present the average, minimum, and maximum time (in              viding early progressive answers to users could drastically
seconds) for the 1-NN query answering algorithm to first en-       reduce waiting times. The challenge is how to help users
counter the 1-NN answer (marked as 1-NN Time in Table 2),          to assess the quality of such progressive answers and decide
and the corresponding times for the same algorithm to fin-         whether to trust these answers, or wait for a better one.
ish execution (marked as Total Time). We observe that the
total time waiting for a single query to finish can be long,
e.g., up to three minutes, which is beyond acceptable thresh-      4.                    VISION AND CHALLENGES
olds for interactive data-analysis tasks [11]. Moreover, these        Previous work on progressive or optimistic visual analyt-
delays are orders of magnitude longer than the actual time         ics [12, 24, 19] has focused on the estimation of aggregated
needed to find the best answer (i.e., first encounter of 1-        functions, such as the mean, over random or carefully se-
NN). This means that for most queries, the greatest cost           lected samples of the data. In this case, providing feed-
is not locating the 1-NN, but rather confirming that there         back to users about the uncertainty of progressive results
is no better answer: this is why the query answering algo-         can rely on common statistical methods, e.g., confidence in-
rithm finishes execution long after having retrieved the 1-NN      tervals [12, 19], or coarse-grain visualizations of aggregated
value. This finding is consistent with results by Ciaccia and      data [13]. However, such approaches cannot apply to our
Patella [5], who report that most of the time spent in an          problem. Although k-NN distances converge over time (see
exact NN search is ”wasted time, during which no improve-          Fig. 1), providing bounds for their error requires a different
ment is obtained.”                                                 set of statistical tools.
   The time needed to locate the 1-NN was especially fast for         Our vision is to develop methods for progressive time se-
the SALD and deep1B datasets, where average times were             ries similarity search, coupled with appropriate bounds on
below 1 sec. However, times varied greatly on the seismic          the errors of the intermediate results. We are inspired by the
dataset, ranging from a few milliseconds to 48.5 sec. For 28%      probabilistic approach of Ciaccia and Patella [5] on approx-
of the queries, the delay was greater than 10 sec, which is        imate search in multi-dimensional spaces. According to this
considered as a limit for keeping a user’s attention focused       approach, a dataset is considered as a random sample (or
on a dialog [11]. We expect that such delays will further          instance) drawn from a large multidimensional space. If the
increase in larger datasets, and for k-NN exact search. In         distribution of k-NN distances in this space is known for any
these cases, providing early approximate answers will also         given query, then we can derive an estimate of the k-NN dis-
be crucial.                                                        tance for the sample. Unfortunately, such distributions are
   Fig. 1 presents the progressive results for five example        unknown, so the challenge is how to approximate them from
queries on the seismic data. For these queries, the time to        a given dataset. Ciaccia and Patella [5] use this framework
locate the 1-NN (right-most point in each curve) is relatively     to determine a mix of probability and distance error bounds
long (> 20 sec), while approximate answers (intermediate           as stopping conditions for an approximate similarity search.
points in each curve) appear with various frequencies and          Our goal instead is to provide live estimates of probabilistic
trends. For example, for queries q1 and q65, results con-          distance bounds to users, and let users decide whether to
verge quickly (in the order of hundreds of milliseconds), and      trust the current results and stop, wait for a better answer,
then only slightly improve. For other queries, such as q57,        or keep the process executing in the background while they
convergence is more progressive. In all cases, however, the        continue with a new search.
first approximate answer (left-most point in each curve) be-          We briefly discuss promising measures that could help
comes available very early (< 50 msec), and approximate            users evaluate the progressive results of a similarity search.
results very close to the final answer appear within interac-      Let T be a space of time series data. For a given query Q ∈
tion times (< 1 sec).                                              T , we define the cumulative distribution function FQ (x) =
   Overall, our results indicate that (i) supporting interactive   P r{dQ ≤ x} that gives the probability that its distance
similarity search over large datasets is feasible, and (ii) pro-   dQ from a random series in T is lower than or equal to x.
From this function, we can derive the probability distribu-              [3] P. Buono and A. L. Simeone. Interactive shape specification for
tion of k-NN distances dkQ , and from this, we can estimate                  pattern search in time series. In AVI, 2008.
                                                                         [4] A. Camerra, T. Palpanas, J. Shieh, and E. Keogh. isax 2.0:
the expected k-NN distance and assess the uncertainty of                     Indexing and mining one billion time series. In ICDM, 2010.
this value. Furthermore, we can infer probabilistic distance             [5] P. Ciaccia and M. Patella. Pac nearest neighbor queries:
bounds for dkQ . For example, we can estimate a distance                     Approximate and controlled search in high-dimensional and
                                                                             metric spaces. In ICDE, 2000.
bound d+ such that P r{dkQ > d+ } ≤ 5%. The FQ function
                                                                         [6] P. Ciaccia, M. Patella, and P. Zezula. A cost model for
can be also used to derive a probability distribution for the                similarity queries in metric spaces. In PODS, 1998.
number k of answers whose distance is better than a given                [7] M. Correll and M. Gleicher. The semantics of sketch:
distance dQ . Investigating additional probabilistic measures                Flexibility in visual query systems for time series data. In
                                                                             VAST, 2016.
about expected CPU and I/O costs [6] can also be useful.
                                                                         [8] B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and
   We note that the above directions pose two main chal-                     C. Wang. Sample + seek: Approximating aggregates with
lenges. First, the FQ (x) function can be only approximated                  distribution precision guarantee. In SIGMOD, 2016.
based on a specific instance of the space, i.e., the given time          [9] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and
                                                                             E. Keogh. Querying and mining of time series data:
series dataset. Ciaccia and Patella [5] argue that for high-                 Experimental comparison of representations and distance
dimensional spaces, this function is close to the cumulative                 measures. Proc. VLDB Endow., 1(2), 2008.
distribution function F (x) of all pairwise distances, and can          [10] K. Echihabi, K. Zoumpatianos, T. Palpanas, and
be approximated from a sample of distances that are ran-                     H. Benbrahim. The lernaean hydra of data series similarity
                                                                             search: An experimental evaluation of the state of the art. The
domly drawn from a given dataset. Unfortunately, reliably                    VLDB Journal, 12(2), 2018.
estimating the probability distribution of k-NN distances re-           [11] J.-D. Fekete and R. Primet. Progressive analytics: A
quires large samples of distances. Our early tests have shown                computation paradigm for exploratory data analysis. CoRR,
that their precomputation can be prohibitively expensive for                 abs/1607.05162, 2016.
                                                                        [12] D. Fisher, S. M. Drucker, and A. C. König. Exploratory
large datasets, such as the ones in Table 1, or larger. We                   visualization involving incremental, approximate database
are currently working on solutions that reduce such compu-                   queries and uncertainty. IEEE CG&A, 32, 2012.
tation costs.                                                           [13] M. Glueck, A. Khan, and D. J. Wigdor. Dive in!: Enabling
   The second challenge is how to communicate to users                       progressive loading for real-time navigation of data
                                                                             visualizations. In CHI, 2014.
probabilistic distance bounds and errors, given that similar-           [14] A. Gogolou, T. Tsandilas, T. Palpanas, and A. Bezerianos.
ity distance values do not have a clear interpretation. Our                  Comparing similarity perception in time series visualizations.
goal is to integrate these distance bounds and errors into                   IEEE TVCG, 25, 2018.
the visual representation of a time series, by either using             [15] J. Jing, J. Dauwels, T. Rakthanmanon, E. Keogh, S. Cash, and
                                                                             M. Westover. Rapid annotation of interictal epileptiform
the query itself, or its progressive k-NN answers. Finally,                  discharges via template matching under dynamic time warping.
we are interested in experimentally assessing how users per-                 Journal of Neuroscience Methods, 274, 2016.
ceive and understand such probabilistic measures. We plan               [16] T. Kraska. Northstar: An interactive data science system.
                                                                             PVLDB, 11(12):2150–2164, 2018.
to measure whether, and to what extent, the visualization of
                                                                        [17] M. Linardi and T. Palpanas. Scalable, variable-length
these probabilistic measures helps them to effectively com-                  similarity search in data series: The ULISSE approach.
plete their visual analysis tasks.                                           PVLDB, 11(13), 2018.
                                                                        [18] M. Mannino and A. Abouzied. Expressive time series querying
                                                                             with hand-drawn scale-free sketches. In CHI, 2018.
5.   CONCLUSIONS                                                        [19] D. Moritz, D. Fisher, B. Ding, and C. Wang. Trust, but verify:
                                                                             Optimistic visualizations of approximate queries for exploring
   In this work, we presented preliminary experiments that                   big data. In CHI, 2017.
demonstrate the usefulness (and need) of progressive answer-            [20] J. Nielsen. Response times: The 3 important limits.
ing of similarity search queries on very large time series data.             https://www.nngroup.com/articles/
                                                                             response-times-3-important-limits/.
Our findings show that the greatest cost is not locating the
                                                                        [21] T. Palpanas. Data series management: The road to big
1-NN, but rather waiting for the algorithm to confirm that                   sequence analytics. SIGMOD Record, 44(2), 2015.
there is no better answer and finish execution.                         [22] S. Rahman, M. Aliakbarpour, H. Kong, E. Blais,
   We argue that providing progressive answers and esti-                     K. Karahalios, A. G. Parameswaran, and R. Rubinfeld. I’ve
                                                                             seen ”enough”: Incrementally improving visualizations to
mates of probabilistic distance bounds to users, and letting                 support rapid decision making. PVLDB, 10(11):1262–1273,
them decide when to stop the search process, are important                   2017.
research questions. This would eliminate wasted time and                [23] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista,
reduce user waiting times, in cases where improvement in                     B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and
                                                                             mining trillions of time series subsequences under dynamic
the final answer is not possible. We have identified two main                time warping. In KDD, 2012.
challenges: (i) how to compute efficiently distance probabil-           [24] C. D. Stolper, A. Perer, and D. Gotz. Progressive visual
ity distributions for large data series collections; and (ii) how            analytics: User-driven visual exploration of in-progress
                                                                             analytics. IEEE TVCG, 20, 2014.
to communicate to users probabilistic distance bounds and
                                                                        [25] E. R. Tufte. The Visual Display of Quantitative Information.
errors. Given the increasing popularity of data series anal-                 1986.
ysis tasks, these research directions are both relevant and             [26] C. Turkay, E. Kaya, S. Balcisoy, and H. Hauser. Designing
important, offering exciting research opportunities.                         progressive and interactive analytics processes for
                                                                             high-dimensional data analysis. IEEE TVCG, 23(1), 2017.
                                                                        [27] Y. Wang, P. Wang, J. Pei, W. Wang, and S. Huang. A
6.   REFERENCES                                                              data-adaptive and dynamic segmentation index for whole
                                                                             matching on time series. Proc. VLDB Endow., 6(10), 2013.
 [1] S. K. Badam, N. Elmqvist, and J.-D. Fekete. Steering the           [28] E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and
     craft: Ui elements and visualizations for supporting progressive        T. Kraska. How progressive visualizations affect exploratory
     visual analytics. Comput. Graph. Forum, 36(3), 2017.                    analysis. IEEE TVCG, 23, 2017.
 [2] G. E. Batista, E. J. Keogh, O. M. Tataw, and V. M. Souza.          [29] K. Zoumpatianos, S. Idreos, and T. Palpanas. Ads: The
     Cid: An efficient complexity-invariant distance for time series.        adaptive data series index. The VLDB Journal, 25(6), 2016.
     Data Min. Knowl. Discov., 28(3), 2014.