=Paper= {{Paper |id=Vol-2083/paper-14 |storemode=property |title=Map-Based Visual Exploration of Geolocated Time Series |pdfUrl=https://ceur-ws.org/Vol-2083/paper-14.pdf |volume=Vol-2083 |authors=Georgios Chatzigeorgakidis,Dimitrios Skoutas,Kostas Patroumpas,Spiros Athanasiou,Spiros Skiadopoulos |dblpUrl=https://dblp.org/rec/conf/edbt/Chatzigeorgakidis18 }} ==Map-Based Visual Exploration of Geolocated Time Series== https://ceur-ws.org/Vol-2083/paper-14.pdf
        Map-Based Visual Exploration of Geolocated Time Series
      Georgios Chatzigeorgakidis                                     Dimitrios Skoutas                                 Kostas Patroumpas
            University of Peloponnese                                 Athena R.C.                                       Athena R.C.
                     Greece                                             Greece                                             Greece
             chgeorgakidis@uop.gr                           dskoutas@imis.athena-innovation.                   kpatro@imis.athena-innovation.gr
                                                                          gr

                                         Spiros Athanasiou                                    Spiros Skiadopoulos
                                         Athena R.C.                                         University of Peloponnese
                                           Greece                                                     Greece
                               spathan@imis.athena-innovation.gr                                  spiros@uop.gr

ABSTRACT                                                                                 can also be found in other domains, such as in geomarketing or
The amount and significance of time series that are associated                           mobile advertisement, where geolocated time series may repre-
with specific locations, such as visitor check-ins at various places                     sent the number of visitors or the revenue generated at a certain
or sensor readings, have increased in many domains over the last                         location across time. Extracting insights, trends and patterns can
years. Although several works exist for time series visualization                        be significantly facilitated by map-based visualizations of sum-
and visual analytics in general, there is a lack of efficient tech-                      marized time series data. For example, such visualizations can
niques for geolocated time series in particular. In this work, we                        reveal which type of consumption patterns are most frequently
present an approach that relies on a hybrid spatial-time series                          observed among consumers in a certain area or what the spatial
index to allow for interactive map-based visual exploration and                          distribution of sales for a certain product looks like.
summarization of geolocated time series data. In particular, we                             However, time series is an inherently complex data type, and
use the BTSR-tree index, which extends the R-tree by maintain-                           such datasets can reach extremely large volumes, both horizon-
ing bounds for the time series indexed at each node. We describe                         tally (i.e., very long series across time) and vertically (i.e., time
the structure of this index and show how it can be directly ex-                          series generated by countless sources). Consequently, manage-
ploited to produce map-based visualizations of geolocated time                           ment, analysis and exploration of such Big Time Series Data is a
series at different zoom levels efficiently. We empirically validate                     task of excessive complexity, requiring efficient algorithms. In
our approach using two real-world datasets, as well as a synthetic                       particular, visual exploration of geolocated time series needs to
one that is used to test the scalability of our method.                                  process the required information in real time, while the user
                                                                                         interacts with the map. Whenever the user zooms in or scrolls
KEYWORDS                                                                                 the map, visual analytics and aggregates should be computed
                                                                                         on-the-fly, e.g., identifying the predominant patterns in the time
time series visualization, geolocated time series, visual explo-
                                                                                         series and their spatial distribution within the map area.
ration
                                                                                            Consider the example illustrated in Figure 1. When the user
                                                                                         zooms the map into the red rectangle, the visualization tool
1     INTRODUCTION                                                                       should readily identify and present the two patterns (shown
Time series are generated and stored at a vastly increasing rate                         in blue and green color) appearing therein. To avoid cluttering
in many industrial and research applications, including the Web                          the map, when the spatial distribution is very dense or the map
and the Internet of Things, public utilities, finance, astronomy, bi-                    area is too large, it is meaningful to display only aggregate infor-
ology, and many more. A significant portion concerns geolocated                          mation, e.g., the number of time series per identified pattern and
time series, i.e., those generated at, or otherwise associated with                      their respective centroids; individual time series may be identified
specific locations. While indexing, mining and exploring time                            only at a greater zoom level.
series data has attracted a lot of interest from the database and                           Such fast computation and retrieval for large datasets of geolo-
data mining communities [4, 12, 25, 28], studying of geolocated                          cated time series can be enabled by indexing. Several approaches
time series is still largely overlooked.                                                 have been proposed that efficiently index large amounts of plain
   Geolocated time series can be found in various domains and                            time series data, either relying on Discrete Wavelet Transform
applications. A typical example is encountered in the DAIAD                              like [5] to reduce dimensionality of time series, or the family
project1 , where time series are used to represent water consump-                        of indices based on Symbolic Aggregate Approximation (SAX)
tion measured by smart meters installed in urban households.                             over the time series [3, 4, 28, 31]. However, all aforementioned
Analyzing such time series can provide valuable insights regard-                         techniques index the data solely on the time series domain, not
ing trends and patterns of consumer behavior in daily life. Such                         taking the spatial dimension into account. If the analyzed time
results can then be used to forecast and balance water demand,                           series are inherently associated with a spatial attribute (e.g., loca-
as well as to plan and prioritize interventions that can guide                           tions of smart meters), such indexing does not efficiently support
consumers towards more sensible water use. Similar use cases                             queries and visualizations that do not simply apply to the time
                                                                                         series domain, but also involve spatial filters. As in the example
1 http://daiad.eu/
                                                                                         of Figure 1, a user may need to explore time series similar to
                                                                                         a specific pattern, but also having locations within the actual
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
                                                                                         map area. For such mixed requests, it is inefficient to evaluate
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted        each predicate separately, e.g., using first the time series index to
under the terms of the Creative Commons license CC-by-nc-nd 4.0.




                                                                                    92
                                                                               • Thanks to robust index support, we introduce a method
                                                                                 that can cluster bundles according to time series similarity,
                                                                                 calculate their support and identify their spatial extent,
                                                                                 thus enabling their interactive map-based exploration.
                                                                               • We exemplify the proposed visualization method with two
                                                                                 use cases based on real-world datasets.
                                                                               • We also empirically evaluate the performance of index
                                                                                 traversal, confirming its low execution cost against a large
                                                                                 synthetic dataset of geolocated time series.
                                                                              The remainder of this paper is organized as follows. Section 2
                                                                           reviews related work. Section 3 outlines basic concepts and for-
                                                                           mulates the problem. Section 4 describes the BTSR-tree index and
                                                                           introduces our approach on summarization of geolocated time
                                                                           series for their efficient visual exploration. Section 5 presents
Figure 1: Exploring time series patterns in a spatial region.              indicative use cases with map visualizations and also reports
                                                                           performance results. Finally, Section 6 concludes the paper and
retrieve a candidate set of time series similar to the pattern, and        outline future research directions.
then applying a spatial filter against these candidates to retrieve
only those qualifying within the specified query area.                     2   RELATED WORK
    To address this shortcoming, we have recently proposed an              As our approach suggests visual exploration of time series and is
extension to the R-tree spatial index [16], offering efficient sup-        based on indexing of such data, next we briefly survey both of
port to similarity search on geolocated time series. The idea              these research topics.
behind our BTSR-tree hybrid index [7] is to combine both spatial              Visual Exploration of Time Series. In constrast to declarative
proximity and time series similarity. To that end, in addition to          visualization specifications suggested in [29], a recent tutorial
the standard Minimum Bounding Rectangle (MBR) denoting the                 [21] advocates the use of example-based methods in exploration
spatial bound of its contents, each node is augmented with a               of large relational, textual, and graph datasets. Such a query-
Minimum Bounding Time Series (MBTS), i.e., a band with upper               by-example approach has been applied in [13] so as to explore
and lower bounds that encloses all time series contained in its            relevance feedback for retrieval from time series databases. In-
subtree. Maintaining both kinds of bounds per node enables prun-           stead of returning the top matching time series, this technique
ing the search space simultaneously in the spatial and the time            incorporates diversity into the results, which are presented to
series domains while traversing the index. To increase pruning             the user for feedback and refined in several rounds.
effectiveness, time series indexed in a given node are further                RINSE [32] is a Recursive Interactive Series Explorer specif-
distinguished into bundles on the basis of their similarity, hence         ically designed for exploration of data series. Built on top of
achieving tighter bounds in the MBTS of these bundles.                     ADS+ [31], a special adaptive index structure for data series, it
    In this paper, we take advantage of this novel BTSR-tree in-           can progressively build parts of the index on demand at query
dex and propose an interactive method for map-based visual                 time, concerning only those chunks of the data involved in users’
exploration and summarization over large datasets of geolocated            queries. In terms of visualization, users can get those series qual-
time series. Intuitively, when the user interacts with the map,            ifying to range or nearest-neighbor queries interactively drawn
i.e., either zooms in/out or moves the visible area, this technique        on screen, as well as monitor various statistics regarding the
can identify the most significant patterns characterizing the time         index footprint (e.g., RAM and disk usage) as it gets updated.
series located in this area. These patterns are abstracted by the             In contrast, ATLAS [6] is a visual analytics tool specifically
MBTSs of the bundles contained in the nodes of the BTSR-tree in-           geared towards interactivity when ad hoc filters, arbitrary aggre-
dex within this area, summarizing the various time series therein.         gations, and trend exploration are applied against massive time
In addition, the corresponding MBRs and the number of geolo-               series data. This client-server architecture employs a column
cated time series per pattern can be depicted on the map.                  store as its backend equipped with indexing, and preemptively
    For providing prompt visualizations of summaries over geolo-           caches data that may be required in queries so as to reduce latency
cated time series data and minimizing latency when drawing the             when panning, scrolling, and zooming over time series.
relevant graphic elements, we need early access to both spatial               Recently, the ONEX paradigm [22] concerns online explo-
and time series information while traversing the index. For this           ration of time series. It first constructs compact similarity groups
purpose, we adapt the BTSR-tree index so as to also include ag-            over time series for specific lengths based on Euclidean distance,
gregates per node, i.e., the number of time series pertaining to           and then can efficiently support exploration of these groups with
each bundle. Subsequently, we introduce a new traversal algo-              the Dynamic Time-Warping (DTW) method over their represen-
rithm for efficient retrieval of a given number of bundles that            tatives of different lengths and alignments.
are the most representative in the map area. To the best of our               Besides, smoothing can be applied to streaming time series to
knowledge, this is the first work that considers visual exploration        remove noise in visualizations while preserving large-scale devi-
and summarization of geolocated time series.                               ations [27]. To highlight important phenomena without harming
    In summary, our main contributions are as follows:                     representation quality from oversmoothing, this approach intro-
    • We propose an adapted variant of the BTSR-tree index                 duces quantitative metrics involving variance of first differences
      as well as a novel algorithm for its traversal in order to           and kurtosis to automatically calibrate smoothing parameters.
      quickly retrieve summaries (bundles) of geolocated time                 ForeCache [2] leverages two prefetching mechanisms to facil-
      series within a given area.                                          itate exploration of large geospatial, multidimentional and time




                                                                      93
series data stored in a DBMS. By predicting the user’s behav-                       A time series is a time-ordered sequence of values T = {v 1 , . . . ,
ior, it fetches the necessary data as the user interacts with the               vw }, where vi is the value at the i-th time point and w is the
application.                                                                    length of the series. In particular, we deal with time series that
   None of the aforementioned methods and systems provides                      are additionally characterized by a location, denoted by T .loc.
map-based visual exploration of geolocated time series, as is the               Assuming a 2-dimensional space, we further use the notation
goal of our work in this paper.                                                 T .loc x , T .locy to refer to the (x,y) coordinates of T ’s location.
                                                                                    In the spatial domain, the distance between two geolocated
   Indexing of Time Series. Earlier approaches towards indexing                 time series T and T ′ of equal length w is calculated using the
of time series data were based on leveraging multi-resolution                   Euclidean distance of their respective locations. Furthermore,
representations. For instance, the Discrete Wavelet Transform                   we normalize this distance with maxDistsp , i.e., the maximum
[15] is used in [5] to gradually reduce the dimensionality of                   spatial distance of any pair of objects in the dataset, to obtain a
time series data via the Haar wavelet [17] and generate an index                measure in the interval [0, 1]. Thus:
using the coefficients of the transformed sequences. In [26], it                                       q
is further observed that, other than orthonormal wavelets, bi-                                            (T .loc x − T ′ .loc x ) 2 + (T .locy − T ′ .locy ) 2
orthonormal ones can also be used for efficient similarity search                  distsp (T ,T ′ ) =
                                                                                                                            maxDistsp
over wavelet-indexed time series data, demonstrating several
                                                                                                                                                               (1)
such wavelets that outperform the Haar wavelet in terms of
precision and performance. In addition, an alternative approach
                                                                                   In the time series domain, similarly to other prior works (e.g.,
regarding k-nearest neighbor search over time series data is in-
                                                                                [28]), we also apply the Euclidean distance to measure the simi-
troduced in [18]. The proposed method accesses the coefficients
                                                                                larity of a pair of objects. In future work, we plan to make use of
of Haar-wavelet-transformed time series through a sequential
                                                                                more complex distance measures [24]. More specifically, we cal-
scan over step-wise increasing resolutions.
                                                                                culate the distance between two time series T and T ′ as follows:
   State-of-the-art approaches for time series indexing comprise                                                    v
                                                                                                                    tw
methods based on the Symbolic Aggregate Approximation (SAX)                                                           X
representation [20]. This is derived from the Piecewise Aggregate                                                        (vi − vi′ ) 2
                                                                                                                            i=1
Approximation (PAA) representation of a time series [19, 30], by                                    distt s (T ,T ′ ) =                                       (2)
quantizing the segments of its PAA representation on the y-axis.                                                           maxDistt s
The first attempt to leverage the potential of the SAX representa-              where maxDistt s denotes the maximum distance of any pair of
tion was presented in [28], introducing the indexable Symbolic                  objects in the dataset and is used for normalization, as above.
Aggregate Approximation (iSAX), capable of a multi-resolution                      To index and summarize time series, we use the notion of Mini-
representation for time series. The iSAX index was further ex-                  mum Bounding Time Series (MBTS). An MBTS is a summarization
tended to iSAX 2.0 [3] by enabling bulk loading of time series                  of a set of time series T , defined by a pair of upper and lower
data. Its next version is the iSAX2+ index [4], which handles                   time series bounds that contain them. Formally, given a set of
better the expensive I/O operations caused by the aggressive                    time series T , its MBTS consists of an upper bounding time series
node splitting while building the index. Finally, the ADS+ in-                  Tup and a lower bounding time series Tlo , constructed by selecting
dex [31] is another extension of iSAX, which overcomes the still                the maximum (for Tup ) and minimum (for Tlo ) of values at each
significantly expensive index build time by adaptively building                 time point among all time series of the set as follows:
the index while processing the workload of queries issued by
the user. A comprehensive overview of the time series indexing                                   Tup = {max T [0], . . . , max T [w − 1]}
approaches based on the SAX representation is presented in [23].                                          T ∈T               T ∈T
                                                                                                                                                              (3)
   Unfortunately, none of the abovementioned access methods                                      Tlo = { min T [0], . . . , min T [w − 1]}
                                                                                                          T ∈T              T ∈T
can inherently support geolocated time series, i.e., time series
                                                                                  We can now formally introduce our problem. Given a set of
inextricably associated with a location. To the best of our knowl-
                                                                                geolocated time series and an area, our goal is to produce a
edge, the only index in the literature that supports such time
                                                                                summary for visualization comprising the following two parts:
series is the BTSR-tree index [7]. This hybrid index follows a
similar rationale set by spatio-textual indices [8, 9, 11, 14] that                  • Time series summary: A collection of MBTSs (bundles),
have been proposed to speed up evaluation of queries combin-                           summarizing the time series located within the given area.
ing location-based predicates with keyword search. Essentially,                      • Spatial summary: A set of MBRs, each one associated with
this paradigm implies combining a spatial index structure (e.g.,                       an object counter for each identified bundle.
R-tree, Quadtree, Space-Filling Curve) with a textual index (e.g.,                 The bundles provide a summarization of the time series that
inverted file, signature file). Depending on their structure, these             are contained within their MBTSs. Figure 2 depicts an example
variants can be characterized either as spatial-first or textual-first          of two time series bundles for two different sets of time series.
indices [10]. In a similar spirit, our BTSR-tree is a spatial-first             Regarding the spatial summary, for each MBR associated with
index based on the R-tree that can additionally abstract similarity             a certain bundle, the counter denotes the number of time series
of time series instead of a textual one. As a result, it can offer anal-        contained in it.
ogous improvements when searching against geolocated time
series data, as we discuss in more detail in Section 4.1.                       4     APPROACH
                                                                                We propose a visualization method for geolocated time series
3    PROBLEM DEFINITION                                                         that draws on a map the time series and spatial summaries for
                                                                                the current visible area. Using this process, a user can select the
Next, we introduce notation and definitions used in our approach,               bundle of her preference and the proper spatial summary will
and we formally define the problem addressed in this paper.                     appear on the map after acquiring the necessary MBRs from




                                                                           94
                                                                             the bundles of each node are calculated bottom-up. To construct
                                                                             the time series bundles within each node, we rely on k-means
                                                                             clustering. The objects contained in each node are clustered ac-
                                                                             cording to their Euclidean distance on the time series domain.
                                                                             The example in Figure 2 depicts the bundles (the two bands with
                                                                             a thick outline) obtained for a set of time series (shown as thin
                                                                             polylines) when the number of clusters is set to k = 2.
                                                                                For inner nodes, the bundles are constructed bottom-up. First,
                                                                             in each leaf node, the contained time series are clustered into
                                                                             k bundles. Then, the MBTS of each bundle is computed and
                                                                             stored in the node. As a next step, each parent node receives
                                                                             all the MBTSs of its children and computes its own k bundles
                                                                             and respective set of MBTS by clustering them. The process
                                                                             continues upwards, until reaching the root of the tree. Optionally,
                                                                             Piecewise Aggregate Approximation [19, 30] can be applied over
Figure 2: Example illustrating the resulting bundles for                     the time series. As detailed in [7], this allows a trade off between
two sets of time series.                                                     the number of bundles per node and the MBTS resolution, thus
                                                                             permitting a larger number (> k) of bundles in nodes at higher
                                                                             levels in the tree hierarchy.
the BTSR-tree index. Whenever the user zooms in/out or moves                    In addition, to support the required functionality of our visual-
around the map, the BTSR-tree is traversed, and the correspond-              ization method, we further extend here the information stored in
ing bundles, MBRs and object counts are obtained to drive the                each node with the count of geolocated time series that are fully
visualization. In each case, the rectangle corresponding to the              contained within each bundle. This is also done bottom-up, while
visible part of the map is used to feed a traversal algorithm that           the index is traversed to calculate the bundles. At each leaf node,
efficiently gathers the results. In the following, we describe this          after the clustering, we propagate the number of members of
process in detail, after providing some necessary background                 each cluster to its parent, which, in turn calculates its clusters and
information on the BTSR-tree index.                                          aggregates the counts it has received for each bundle’s members.
                                                                             This procedure continues up to the root of the tree.
4.1    The BTSR-tree Index
To efficiently generate real-time visualizations of geolocated time
series data, we need early access to both spatial and time series re-        4.2    Summary Construction for Map-Based
lated information while traversing the index, in order to maintain                  Visualization
low latency levels when drawing the required graphic elements.
                                                                             We now present our summarization approach for producing map-
However, none of the approaches presented in Section 2 supports
                                                                             based visualizations of geolocated time series. The process is
geolocated time series indexing. To the best of our knowledge,
                                                                             outlined in Algorithm 1. It takes as input the query rectangle
the recently proposed BTSR-tree index [7] is the only one that
                                                                             (q), i.e., the area of the map for which the visualization is pro-
provides the desired functionality.
                                                                             duced, and the number of bundles k to be generated. The process
   The BTSR-tree is based on the R-tree [16] for the spatial
                                                                             comprises three distinct steps. Initially, the BTSR-tree index is
indexing part. The R-tree organizes a hierarchy of nested d-
                                                                             traversed to obtain the MBRs contained in the query rectangle,
dimensional rectangles. Each node corresponds to a disk page
                                                                             along with their bundles and the number of objects per bundle
and represents the MBR of its children or, for leaf nodes, the
                                                                             (Line 1). Next, k-means clustering is applied using the average
MBR of its contained geometries. The number of entries per node
                                                                             time series per bundle as centroids (Line 2). Finally, the new
(excluding the root) is between a lower bound m and a maxi-
                                                                             bundles are calculated and the proper MBRs and correspond-
mum capacity M. Query execution in R-trees starts from the root.
                                                                             ing object counts are assigned to each bundle (Line 3). Next, we
MBRs in any visited node are tested for intersection against the
                                                                             describe each step in more detail.
search region. Qualifying entries are recursively visited until the
                                                                                 Step 1: BTSR-tree Traversal. During this step, the BTSR-tree
leaf level or until no further overlaps are found. Several paths
                                                                             index is traversed, with the target being the fast provision of a
may be probed, as multiple sibling entries could overlap with
                                                                             predefined number k of geolocated time series bundles contained
the search region. The BTSR-tree extends the information stored
                                                                             within the given area q, along with the MBRs where these bundles
within each node of the R-Tree with bundles of MBTSs. This al-
                                                                             can be found and the total number of geolocated time series that
lows to efficiently prune the search space when evaluating hybrid
                                                                             reside within each MBR. All required information is stored within
queries combining time series similarity with spatial proximity.
                                                                             the nodes of the BTSR-tree, thus, when a node that is contained
   As in the standard R-tree, each node of the BTSR-tree has at
                                                                             within the query rectangle is found, the relevant information
least m and at most M entries and stores the MBRs of its chil-
                                                                             is retrieved and added to the intermediate results, without any
dren. Additionally, for each child, a node stores a pre-specified
                                                                             need to continue searching in its sub-tree. The output of this step
number of time series bundles, each consisted of an MBTS that
                                                                             is passed to the next step of the procedure.
encloses all the time series indexed in its subtree. Each bundle is
                                                                                 In more detail, the traversal is performed as follows. After
calculated using Equation 3. Construction and maintenance of
                                                                             initializing a queue with the root’s children (Line 7), we loop
the BTSR-tree follow the procedures of the R-tree for data inser-
                                                                             over it (Line 8) until it’s empty. For each inner node’s child N ′ ,
tion, deletion and node splitting. Objects (i.e., geolocated time
                                                                             we check whether its MBR is contained within the given query
series) are inserted into leaf nodes, and any resulting changes
                                                                             rectangle q (Lines 11–12). If so, its MBR, time series bundles and
are propagated upwards. Once the nodes have been populated,




                                                                        95
the number of objects per bundle are added to the intermediate               Algorithm 1: Summarization of Geolocated Time Series
results (Line 13) as a tuple with the following components:                   Input: The query rectangle q; the number of bundles to be
                                                                                     generated k
             ⟨mbr , {(mbts 1 ,cnt 1 ), ..., (mbtsk ,cntk )}⟩                  Output: A list R containing tuples of bundles, MBRs and
Each such tuple indicates the MBR of a node (mbr ), consisting                        object counts
of the coordinates of the lower left and upper right point, as              1 R ← IndexTraversal (q)                          // Step 1
well as k pairs denoting the bundles of the node along with the             2 R c ← BundlesClusterinд(R,k )                   // Step 2
corresponding number of objects per bundle. If the MBR is not               3 R f ← BundlesCalculation(R c )                  // Step 3
contained in the query rectangle, we check whether it overlaps              4 return R f
with it and if so, we add the child node to the queue (Line 15). If
not, this MBR is located outside the query rectangle, and thus              5 Procedure IndexTraversal (q)
we can skip searching this child. Once no more nodes are left to            6    R←∅
search, the intermediate results are finally returned (Line 16).            7    Q ← Root .дetChildren()
   Step 2: Bundles Clustering. The traversal algorithm returns tu-          8    while Q , ∅ do
ples, each containing the bundles residing in the query rectangle,          9       N ← Q.дetN ext ()
the corresponding nodes’ MBRs and the number of objects per                10       if N is not leaf then
bundle. During step 2, k-means clustering is executed on the               11           foreach N ′ ∈ N .дetChildren() do
average time series of each bundle.                                        12               if q.contains (N ′ .mbr ) then
   Line 2 of Algorithm 1 calls the clustering procedure. Initially,        13                   R ← R ∪ {⟨N ′ .mbr , {N ′ .mbts}, {N ′ .cnt }⟩}
for each tuple (Line 20), we loop over its bundles (Line 21) and
                                                                           14                  else if q.overlaps (N ′ .mbr ) then
generate a new tuple per bundle of the following format:
                                                                           15                      Q ← Q ∪ N ′ .дetChildren()
                       ⟨Tavд ,mbts,cnt,mbr ⟩
                                                                           16     return R
This new tuple contains an average time series, the bundle itself
(mbts), the number cnt of objects enclosed in this bundle, and the         17 Procedure BundlesClusterinд(R,k )
MBR (mbr ) this bundle belongs to (Line 23). The average time              18    Rc ← ∅
series Tavд is calculated by averaging the upper and lower bound
                                                                           19    C←∅
of each bundle (Line 22), i.e., average value at each time point.
                                                                           20    foreach t ∈ R do
The resulting collection of tuples (Line 24) is fed to the k-means
                                                                           21       foreach b ∈ t .mbts do
algorithm in order to return the required number k of bundles
                                                                           22          Tavд ← avд(b.up,b.lo)
to be created. This clustering generates a clustered collection of
tuples using the calculated average time series (Line 25). These           23          t ′ ← ⟨Tavд ,b,t .cnt (b),t .mbr ⟩
results are then forwarded to step 3 (Line 26).                            24          C ← C ∪ {t ′ }
    Step 3: Bundles Calculation and MBR Assignment. During step 3,         25     Rc ← kmeans (C.avд,k )
the clustered tuples received from step 2 are used to calculate the
                                                                           26     return Rc
final bundles, corresponding MBRs and total number of objects
per MBR are assigned to each bundle. The final bundles are calcu-          27 Procedure BundlesCalculation(Rc )
lated in a similar manner to the MBTS bundles during BTSR-tree             28    Rf ← ∅
construction. More specifically, at each time point, we obtain the
                                                                           29    foreach Cl ∈ Rc .clusters do
maximum and minimum value among the corresponding upper
                                                                           30       B←∅
and lower bounds for the bundles of each cluster (see Section
                                                                           31       M←∅
4.1). In case two MBRs that belong to the same final bundle are
the same, their number of objects is aggregated. The final result          32       foreach t ∈ Cl do
is then forwarded to the visualization layer.                              33          B ← updateMBT S (B,t .mbts)
    Line 3 of Algorithm 1 calls the corresponding procedure. For           34          M ← updateMBRs (M,t .mbr ,t .cnt )
each cluster of tuples received from step 2 (Line 29), we loop             35         R f ← R f ∪ {⟨{B}, {M }⟩}
over its members (Line 32) and we use each tuple’s bundle and              36     return R f
MBR to update the upper and lower bounds and the collection of
MBRs that are contained in the final bundle (Lines 33–34). Once
the bounds and the corresponding list of MBRs for the current
bundle have been calculated, we issue an aggregated tuple to the
final result (Line 35). This tuple has the following components:
                                                                           5    EXPERIMENTAL EVALUATION
             ⟨mbts ′ , {(mbr 1 ,cnt 1 ), ..., (mbr n ,cntn )}⟩             In this section, we first describe our experimental setup, followed
where mbts ′ is a resulting bundle, along with the MBRs associated         by indicative examples of map-based visualizations of real-world
with it. Note that the number n of MBRs (as well as their shape)           geolocated time series, as well as scalability results using a syn-
may be varying per bundle, reflecting the spatial distribution             thetic dataset containing 4 million time series. The experiments
of the respective pattern. Each MBR is accompanied with the                were conducted on a Dell PowerEdge M910 with 4 Intel Xeon
corresponding number of objects (i.e., raw time series) therein.           E7-4830 CPUs, each containing 8 cores clocked at 2.13GHz, 256
The final result with all such tuples is then returned in order to         GB RAM and a total storage space of 900 GB. Finally, we assume
generate the visualization (Line 36).                                      that the index fits in memory.




                                                                      96
5.1 Experimental Setup                                                      The bundles are listed on the left of the map, using confidence
   5.1.1 Datasets. We use two real-world datasets selected from             bands to indicate their upper and lower bounds. The average
different application domains and with diverse characteristics. In          time series of each bundle is also depicted. A user can scroll this
addition, we generated a synthetic dataset to test the scalability          list and select the bundle of their preference. Once a bundle is
of our method. Table 1 lists a summary of the characteristics of            selected, the contents of the map are updated accordingly with
each dataset.                                                               the respective MBRs and aggregates.
                                                                                Figure 3 shows an example of such visualization using the
          Table 1: Datasets used in the experiments.                        water dataset. The depicted area is in the center of Alicante, in
                         Area    Number of         Length w of              the most densely populated zone of the city. In this example,
          Dataset                                                           Bundle 4 is selected (indicated with a green colored frame) and
                        (km2 )   time series     each time series
          Water          114         822               168                  the relevant MBRs are shown on the map (using red colored
          Taxi          2,500      417,960             168                  frames). This indicates that inside each depicted MBR there ex-
          Synthetic      114      4,000,000            168                  ists a specific number of geolocated time series that have been
                                                                            clustered to the chosen bundle. As mentioned, each geolocated
                                                                            time series in this dataset represents hourly water consumption
   DAIAD Water Consumption (Water). Courtesy of the DAIAD
                                                                            of a household across one week. Different consumption behav-
project, we acquired a geolocated time series dataset of hourly
                                                                            iors have been grouped together and a daily pattern for each
water consumption for 822 households in Alicante, Spain from
                                                                            bundle can be noticed which is due to the Circadian rhythmic
1/1/2015 to 20/1/2017. In order to get a more representative
                                                                            way that people consume water [1]. The rather large number of
dataset for our tests, we calculated the average weekly time
                                                                            geolocated time series in the bundle, considering the zoom level
series per household, which is the average consumption value
                                                                            and the extent of the MBRs, intuitively suggests that neighboring
per hour of the week. Thus, the length of each resulting time
                                                                            families tend to have similar water consumption behavior.
series is 24 × 7 = 168 values across the week.
                                                                                Figure 4 illustrates another example, this time using the taxi
   NYC taxi dropoffs (Taxi). This dataset contains time series
                                                                            dataset in New York City. This dataset is significantly larger, and
extracted from yellow taxi rides in New York City during 2015.
                                                                            the zoom level selected in this example is lower (a larger geo-
The original data2 provide pick-up and drop-off locations, as well
                                                                            graphic area is visible), hence the MBRs contain a larger number
as corresponding timestamps for each ride. For each month, we
                                                                            of time series. In this figure, we choose Bundle 1, which repre-
generated time series by applying a uniform spatial grid over the
                                                                            sents the rather quieter taxi dropoff zones in Manhattan, as the
entire city (cell side was 200 meters) and counting all drop-offs
                                                                            total number of dropoffs there is rarely over 60 during any hour
therein for each day of the week at the time granularity of one
                                                                            of the week. In this example, there is also a clear daily routine in
hour. Thus, we obtained the number of drop-offs for 24 × 7 time
                                                                            all bundles, with the dropoffs reaching a local maximum twice
intervals in every cell, which essentially captures the weekly
                                                                            per day, suggesting the rush hours in New York City, when peo-
fluctuation of taxi destinations there. The centroid of each cell is
                                                                            ple commute to and from their work. In almost all bundles, the
used as the geolocation of the corresponding time series.
                                                                            daily pattern is significantly different on Saturdays and Sundays,
   Synthetic Water Consumption (Synthetic). To examine the scala-
                                                                            which confirms the intuition that during weekends people do
bility of our method, we generated a synthetic dataset comprising
                                                                            not tend to commute in a routinely fashion. Overall, such visual
4 million geolocated time series by inflating the water consump-
                                                                            representations of information digested from massive time series
tion dataset. This was achieved by using the original time series
                                                                            data can easily catch users’ attention to important phenomena
as seeds and introducing some random variations in their loca-
                                                                            and ongoing trends, confirming the usefulness of our approach.
tion and pattern. We chose the water dataset so as to generate
a more densely populated dataset (Alicante is a medium-sized                5.3    Performance Results
city), in order to stress-test our visualization method.                    In order to evaluate the performance of our approach on larger
   5.1.2 Parameters. We built the BTSR-Tree index setting the               datasets, we built the index using the synthetic dataset and exam-
minimum and maximum number of entries per node to m = 60                    ined its response time for different zoom levels on the map. Since
and M = 200, respectively. Regarding the number of bundles,                 this is intended as an interactive application, where the summa-
we set k 0 = 5 for its leaf nodes. The number of bundles for the            rization method is triggered as soon as the user moves the map,
traversal algorithm is set to be equal to the number of bundles             response times must be adequately small. Ideally, the response
at the leafs, i.e., k = k 0 = 5. For an evaluation of the BTSR-Tree         time should be in the order of milliseconds. In our method, this
index under different parameter settings, please refer to [7].              is facilitated by the fact that the search along a path stops once
                                                                            it encounters a node whose MBR is contained in the actual map
5.2 Map Visualizations                                                      extent (rectangle). In this experiment, we measure the response
                                                                            time for different zoom levels, since zooming-in requires deeper
Our visualization method depicts the MBTS derived for the most
                                                                            traversal of the BTSR-tree index in order to locate the relevant
representative patterns of time series at the currently visible
                                                                            nodes. We use map scales to indicate the different zoom levels.
area of the map. Once our summarization method returns the
                                                                                Figure 5 depicts traversal costs for different map scales over
results, the corresponding MBRs contained in the current view
                                                                            the areas covered by the three datasets. More specifically, the
and zoom level are drawn on the map, along with the number
                                                                            water and synthetic datasets cover the area of the city of Alicante,
of the geolocated time series that belong to the selected bundle.
                                                                            Spain, whereas the taxi dataset the wider metropolitan area of
This number is depicted using circles, colored green for small
                                                                            New York City. Response time in all cases is equal or lower than
numbers, yellow for larger and red for more densely populated
                                                                            one second, which makes this method suitable for interactive
MBRs, thus easily conveying the local intensity of this pattern.
                                                                            visualization. The synthetic dataset, due to its very high density
2 http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml             is significantly slower than the rest, however still the results are




                                                                       97
              Figure 3: Visualizing water consumption patterns in the city center of Alicante (map scale 1:5000).




                     Figure 4: Visualization of taxi dropoff patterns in Manhattan, NYC (map scale 1:10000).


obtained in less than a second. The response for the water dataset               have to be retrieved. The worst case for the synthetic dataset is at
is almost instant due to its small size and very low density.                    scale 1:5000, which roughly corresponds to a large neighborhood
   Initially, in all cases, at the largest scale, the visible area of the        of the city, where many time series are located. For the taxi
map contains all the time series in the dataset, thus it only has to             dataset, the worst case is at 1:20000, which corresponds to the
retrieve information from the root of the index. Then, as we zoom                wider Manhattan area and then the response time gradually drops
in, more nodes have to be visited, as the MBRs of the accessed                   due to the lower dataset density. The number of nodes accessed
nodes begin to overlap with the map rectangle and their children                 in each case is proportional to the response times, ranging from




                                                                            98
                                                                              ACKNOWLEDGMENTS
                                                                              This work has been partially funded and supported by the General
                                                                              Secretariat for Research and Technology (GSRT), the Hellenic
                                                                              Foundation for Research and Innovation (HFRI), the EU Project
                                                                              City.Risks (grant No 653747), and the NSRF 2014-2020 project
                                                                              HELIX (grant No 5002781).

                                                                              REFERENCES
                                                                               [1] J. Aschoff. Circadian rhythms in man. Science, 148(3676):1427–1432, 1965.
                                                                               [2] L. Battle, R. Chang, and M. Stonebraker. Dynamic prefetching of data tiles for
                                                                                   interactive visualization. In SIGMOD, pages 1363–1375, 2016.
                                                                               [3] A. Camerra, T. Palpanas, J. Shieh, and E. J. Keogh. iSAX 2.0: Indexing and
                                                                                   mining one billion time series. In ICDM, pages 58–67, 2010.
                                                                               [4] A. Camerra, J. Shieh, T. Palpanas, T. Rakthanmanon, and E. J. Keogh. Beyond
                                                                                   one billion time series: indexing and mining very large time series collections
                                                                                   with i SAX2+. Knowl. Inf. Syst., 39(1):123–151, 2014.
                                                                               [5] K. Chan and A. W. Fu. Efficient time series matching by wavelets. In ICDE,
                                                                                   pages 126–133, 1999.
                                                                               [6] S. Chan, L. Xiao, J. Gerth, and P. Hanrahan. Maintaining interactivity while
                                                                                   exploring massive time series. In IEEE VAST, pages 59–66, 2008.
                                                                               [7] G. Chatzigeorgakidis, D. Skoutas, K. Patroumpas, S. Athanasiou, and S. Ski-
    Figure 5: Execution time for different map scales.                             adopoulos. Indexing geolocated time series data. In SIGSPATIAL, pages
                                                                                   19:1–19:10, 2017.
                                                                               [8] L. Chen, G. Cong, C. S. Jensen, and D. Wu. Spatial keyword query processing:
                                                                                   An experimental evaluation. PVLDB, 6(3):217–228, 2013.
                                                                               [9] Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in geographic
                                                                                   web search engines. In SIGMOD, pages 277–288, 2006.
one node (the root) in case of the smaller map scale (all city) up to         [10] M. Christoforaki, J. He, C. Dimopoulos, A. Markowetz, and T. Suel. Text vs.
                                                                                   space: efficient geo-search query processing. In CIKM, pages 423–432, 2011.
165 at scale of 1:5000 for the synthetic dataset, one up to 53 for the        [11] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant
taxi dataset and one up to 15 for the water dataset. Interestingly,                spatial web objects. PVLDB, 2(1):337–348, 2009.
fewer node accesses are required in all cases at the very large               [12] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. Querying
                                                                                   and mining of time series data: experimental comparison of representations
scale of 1:500, since the respective small map area overlaps with                  and distance measures. PVLDB, 1(2):1542–1552, 2008.
fewer nodes and most of the search space is pruned.                           [13] B. Eravci and H. Ferhatosmanoglu. Diversity based relevance feedback for
   Consequently, we deem that our method performs adequately                       time series search. Proc. VLDB Endow., 7(2):109–120, 2013.
                                                                              [14] I. D. Felipe, V. Hristidis, and N. Rishe. Keyword search on spatial databases.
fast even against a heavily dense synthetic dataset, where a large                 In ICDE, pages 656–665, 2008.
number of time series are contained within a small area.                      [15] A. Graps. An introduction to wavelets. IEEE Comput. Sci. Eng., 2(2):50–61,
                                                                                   1995.
                                                                              [16] A. Guttman. R-trees: A dynamic index structure for spatial searching. In
                                                                                   SIGMOD, pages 47–57, 1984.
6    CONCLUSIONS AND FUTURE WORK                                              [17] A. Haar. Zur theorie der orthogonalen funktionensysteme. Mathematische
In this paper, we introduced a method for map-based visual ex-                     Annalen, 69(3):331–371, 1910.
                                                                              [18] S. Kashyap and P. Karras. Scalable kNN search on vertically stored time series.
ploration of geolocated time series data. To that end, we proposed                 In SIGKDD, pages 1334–1342, 2011.
a summarization approach over geolocated time series, which                   [19] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. Dimensionality
allows a visual analytics application to retrieve the required infor-              reduction for fast similarity search in large time series databases. Knowl. Inf.
                                                                                   Syst., 3(3):263–286, 2001.
mation. Such retrieval can be achieved at low latency, thus being             [20] J. Lin, E. J. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic
suitable for interactive exploration of large volumes of such data.                representation of time series. Data Min. Knowl. Discov., 15(2):107–144, 2007.
                                                                              [21] D. Mottin, M. Lissandrini, Y. Velegrakis, and T. Palpanas. New trends on
The results can be displayed on a map, depicting the relevant                      exploratory methods for data analytics. Proc. VLDB Endow., 10(12):1977–1980,
MBRs and the number of time series contained in each one, for a                    2017.
selected pattern detected in the time series data. Thanks to the              [22] R. Neamtu, R. Ahsan, E. Rundensteiner, and G. Sarkozy. Interactive time
                                                                                   series exploration powered by the marriage of similarity distances. Proc. VLDB
support of a robust hybrid indexing technique, the patterns de-                    Endow., 10(3):169–180, 2016.
tected at a given zoom level are calculated via k-means clustering            [23] T. Palpanas. Big sequence management: A glimpse of the past, the present,
over the time series that reside in the currently visible part of the              and the future. In SOFSEM, pages 63–80, 2016.
                                                                              [24] J. Paparrizos and L. Gravano. k-shape: Efficient and accurate clustering of
map. Our experiments on a large-scale synthetic dataset indicated                  time series. In SIGMOD, pages 1855–1870, 2015.
that the visualization can be rendered adequately fast for use in             [25] P. Paraskevopoulos, G. Pellegrini, and T. Palpanas. Tweeloc: A system for
                                                                                   geolocalizing tweets at fine-grain. In ICDM Workshops, pages 1178–1183, 2017.
interactive map-based applications. Additionally, we presented                [26] I. Popivanov and R. J. Miller. Similarity search over time-series data using
indicative demonstrations of the visualizations generated on two                   wavelets. In ICDE, pages 212–221, 2002.
real-world datasets from different domains, confirming that these             [27] K. Rong and P. Bailis. Asap: Prioritizing attention via time series smoothing.
                                                                                   Proc. VLDB Endow., 10(11):1358–1369, 2017.
visualizations are helpful in revealing patterns both on the time             [28] J. Shieh and E. J. Keogh. iSAX: indexing and mining terabyte sized time series.
series themselves as well as their geographic distribution.                        In SIGKDD, pages 623–631, 2008.
    Our ongoing and future work focuses on supporting more de-                [29] E. Wu, L. Battle, and S. R. Madden. The case for data visualization management
                                                                                   systems: Vision paper. Proc. VLDB Endow., 7(10):903–906, 2014.
tailed visual analytics and identifying more fine-grained patterns            [30] B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary Lp norms.
through visual exploration. One possible extension would be to                     In VLDB, pages 385–394, 2000.
                                                                              [31] K. Zoumpatianos, S. Idreos, and T. Palpanas. Indexing for interactive explo-
enable zooms along time, so that the user can identify patterns                    ration of big data series. In SIGMOD, pages 1555–1566, 2014.
and their spatial distribution, not only over the entire time series,         [32] K. Zoumpatianos, S. Idreos, and T. Palpanas. Rinse: Interactive data series
but also over particular intervals. Further, it would be interest-                 exploration with ads+. Proc. VLDB Endow., 8(12):1912–1915, 2015.
ing to drill-down in a particular summarized result and discover
whether there are differentiations in the spatial distributions of
its constituent, more detailed patterns.




                                                                         99