Towards In-Memory Sub-Trajectory Similarity Search
                       Omid Isfahani Alamdari                                                                    Mirco Nanni
                               University of Pisa                                                           ISTI-CNR Pisa, Italy
                              alamdari@di.unipi.it                                                         mirco.nanni@isti.cnr.it

                              Roberto Trasarti                                                                Dino Pedreschi
                            ISTI-CNR Pisa, Italy                                                             University of Pisa
                         roberto.trasarti@isti.cnr.it                                                    dino.pedreschi@di.unipi.it

ABSTRACT                                                                               between A and C is more than δ , the two trajectories are not
Spatial-temporal trajectory data contains rich information about                       considered similar, although they have very similar routes. In this
moving objects and has been widely used for a large number                             paper, we address the problem of sub-trajectory similarity search,
of real-world applications. However, the complexity of spatial-                        where we aim to find similar sub-trajectories in the dataset to the
temporal trajectory data, on the one hand, and the fast collection                     whole or a sub-trajectory of the query trajectory. An interesting
of datasets, on the other hand, has made it challenging to ef-                         application of this query is carpooling, where a user can share a
ficiently store, process, and query such data. In this paper, we                       ride with other users who have similar routes. For the example
propose a scalable method to analyze the sub-trajectory simi-                          shown in Figure 1, the carpooling system can match trajectories
larity search in an in-memory cluster computing environment.                           T1 and T2 and hence, suggest the user who follows T1 to share
Notably, we have extended Apache Spark with efficient trajectory                       the ride with the user who follows T2, with possibly a small
indexing, partitioning, and querying functionalities to support                        deviation from his/her routine path.
the sub-trajectory similarity query. Our experiments on a real                            The two state-of-the-art works [6] and [4] find similarities be-
trajectory dataset have shown the efficiency and effectiveness of                      tween whole trajectories and are unable to find similarities in the
the proposed method.                                                                   sub-trajectory level. Different from those works, our method
                                                                                       targets the more computationally expensive problem of sub-
                                                                                       trajectory similarity search and tries to find not only the sim-
1    INTRODUCTION                                                                      ilarities in whole trajectory curves but also in between every
Following the ubiquity of location-aware devices, collecting the                       sub-trajectory of data and query trajectories. Furthermore, our
location data of moving objects is now much easier thanks to                           work is different from [5] in the sense that we focus on in-memory
the developments in positioning systems and mobile communica-                          analytics to enhance the execution of interactive queries and at
tions. Many applications in the areas of transportation and urban                      the same time being able to use the sub-trajectory similarity
planning can benefit from this data to improve their quality of                        search as a primitive for more advanced trajectory pattern min-
service. Moreover, the location-based services (LBS) market that                       ing algorithms that try to identify objects that move together
primarily depends on the data gathered by location-aware de-                           closely. In this regard, our goal is to use less memory space and
vices is rapidly expanding. Analyzing the historical trajectories                      accelerate the local computations inside partitions.
of moving objects plays an essential role in the success of appli-
cations, such as carpooling and route recommendation systems.
However, the extensive use of LBS applications imposes signif-
icant storage and processing challenges to application servers.
Every day several gigabytes or even terabytes of moving objects
tracks are accumulated in servers, making their post-processing
an arduous task.
   One of the critical operations in trajectory data analytics is
similarity search where given a query trajectory Q and a distance
threshold δ , it returns all trajectories with distance at most δ
to Q, according to a distance function. We call this problem the
whole trajectory similarity search problem, in which trajectories
are matched as a whole. In other words, for every comparison                                   Figure 1: Two trajectories with similar routes
between a query and a candidate trajectory, the first and last
points of both trajectories are matched, and depending on the                             Parallel processing of trajectory data is inevitable while deal-
type of the distance function, either matching or coupling of the                      ing with massive amounts of trajectory data. The main goal of
in-between points is established.                                                      this paper is to introduce a novel approach to the sub-trajectory
   However, as illustrated in Figure 1, there are many situations                      similarity search problem in an in-memory cluster computing
where start and end points of trajectories are far from each other                     environment, i.e., Apache Spark.
while they have similar parts in-between. In this figure, the start
and end points of trajectory T1 (blue solid line) are A and B, and                     2    PROBLEM FORMULATION
for trajectory T2 (red dashed line) they are C and D. If the distance
                                                                                       In this section, we formally define trajectories, sub-trajectories,
Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed-   and the sub-trajectory similarity search problem.
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, , Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-                Definition 2.1. A trajectory T is a sequence of time-stamped
tribution 4.0 International (CC BY 4.0).                                               sample points ⟨p1, . . . , p |T | ⟩ where each pi is a triplet (x i , yi , ti )
                                                                                                                                               Real Partition
                                                                                                       Raw Trajectories        Local Indexes    Boundaries


                                                                                      RDD Partitions
                                                                                                                                                                Global
                                                                                                                                                                Index
                                                                                                              ..          ..        ..             ..
                                                                                                               .           .         .              .

        Figure 2: A trajectory and its simplifications


that indicates the spatial location (x i , yi ) of the moving object at                                            Worker Nodes                                 Master Node


timestamp ti .
    Even though the movement is a continuous phenomenon, a                          Figure 3: The distributed indexing structure.
trajectory is recorded by a finite set of discrete sample points. Let
T [i] be the i t h sample point of T . We consider that the move in-         4     DISTRIBUTED TRAJECTORY INDEXING
between two consecutive points T [i] and T [i + 1] is on a straight
                                                                             In this section, we elaborate on a bottom-up approach for building
line with constant speed and hence, the expected location of the
                                                                             the distributed trajectory index. Each data partition builds its
trajectory T between T [i] and T [i + 1] is obtained by a linear
                                                                             local index, and the global index summarizes the information
interpolation. We use T (t) to denote the position (x, y) of the
                                                                             about local indexes. We dedicate this section for describing how
object at timestamp t ∈ [t 1, t |T | ]. Furthermore, we represent the
                                                                             we partition trajectory data and build local and global indexes.
number of points in a trajectory T with |T | and its spatial length
             Í |T |−1
as SL(T ) = i=1 d(pi , pi+1 ) where d is the Euclidean distance.             4.1    Trajectory Partitioning
   Definition 2.2. A sub-trajectory T [i : j] is a contiguous subse-         We use a two-step partitioning approach, in which the data is first
quence of T starting at point T [i] and ending at point T [j], where         partitioned based on trajectory identifiers resulting in all points
1 ≤ i ≤ j ≤ |T |.                                                            of each trajectory being ended up in the same machine. Then, we
   Definition 2.3. An approximation line segment for trajec-                 utilize the Sort-Tile-Recursive (STR) [3] method that considers
tory (or sub-trajectory) T , denoted by ALS T , is the line segment          the distribution of points to compute the partition boundaries
p1p |T | connecting the first and last point of T .                          based on the sample of points drawn from the dataset.
                                                                                 After this step, the challenging task is to distribute the trajec-
    Definition 2.4. Given a query trajectory Q, a set of trajectories        tory data to different partitions in possibly different computing
D = {T1, . . . ,TN }, a distance function F , a distance threshold           nodes. Considering the complexities of trajectory data such as
δ > 0 and a minimum length threshold λ > 0, the sub-trajectory               skewness and inherent sequentiality, and avoiding the recon-
similarity search returns the set of trajectories R = {R 1, . . . , R M },   struction costs, we distribute the whole trajectory objects. That
s.t. for every R j ∈ R, the following holds:                                 is, all points of the same trajectory end up in the same partition.
                                                                             We assign a trajectory to the partition that its bounding rectangle
    (1) There exists a Ti ∈ D s.t. R j is a sub-trajectory of Ti ,
                                                                             has the largest intersection with the MBR of the trajectory. In
    (2) There exists a sub-trajectory Q[a : b] of Q such that
                                                                             case of ties, one is selected randomly. This technique avoids the
         F (R j , Q[a : b]) ≤ δ ,
                                                                             reconstruction of trajectories for similarity comparison that may
    (3) SL(R j ) ≥ λ
                                                                             require a huge amount of data shuffle among nodes.
3    SYSTEM OVERVIEW                                                         4.2    Local Indexing
In this section, we describe our proposed method for distributed
                                                                             Once partitioning is done, a local index should be constructed
processing of the sub-trajectory similarity search. After reading
                                                                             and cached for each partition to help in executing queries. In this
the trajectory dataset, three steps of sampling, partitioning, and
                                                                             step, we divide a trajectory at some significant points such that
indexing are executed. In the sampling phase, a uniform random
                                                                             the directional trend for each sub-trajectory is preserved. For this
sample of the trajectory points is drawn from the input points.
                                                                             reason, we use the popular Douglas-Peucker (DP) algorithm [1]
The idea is to acquire knowledge about the distribution of data
                                                                             to identify those points.
in the geographical extent of the input dataset. This sample of
                                                                                The original idea behind this algorithm is to approximate
points is used to build the partitioner object, which provides
                                                                             the trajectory with some points, known as splitting points and
information about the computed boundaries of partitions and
                                                                             discard other points. Different from the original algorithm, we
methods for locating objects in the partitions. Then, the physical
                                                                             segment the trajectory at those splitting points. We index each
partitioning of trajectory objects is performed using the parti-
                                                                             sub-trajectory T with ALS T using a query-only R-Tree. Thus, the
tioner, and trajectory objects are created inside partitions. In the
                                                                             ALS of sub-trajectories could be used for comparison and pruning.
final step, we build a 2-level distributed index over the trajectory
                                                                             In Figure 2 the whole trajectory is simplified as 3 approximate line
data. This index will be used by the query processor to discard
                                                                             segments, between points p1 , p7 , p14 and p22 . The line segments
the irrelevant data to the query, both at the global and local levels.
                                                                             approximate the original curve by a deviation of at most ϵ [1]:
Finally, the query processor verifies the similarity between the
candidate sub-trajectories that were not discarded and the query.               Lemma 4.1. Every point in the sub-trajectory T , segmented using
The structure of the distributed index is provided in Figure 3.              a tolerance threshold ϵ, has a distance at most ϵ to the ALS T .
                                                                                                                                             Q2


   Since the exact distance computation between trajectories is
expensive, in the search phase we perform most of the compar-                          T
isons with the ALS of trajectories to reduce the size of candidate
set (on which the exact distance is evaluated) as much as possible.
                                                                                                 A        B

4.3    Global Indexing                                                                                                   Q


To locate trajectories at the search phase, we need this global                                       D       C


index to keep track of real boundaries of data inside partitions
rather than computed boundaries after the sampling phase. The
real boundaries of all local indexes (i.e. the root node of each
local R-Tree) is fetched to the master node to build the global
R-Tree index. This index is the first access method utilized in the
search phase to prune partitions – and the trajectories inside –       Figure 4: Example of different arrangement of ALS of sub-
that are far away from the query trajectory.                           trajectories.

5     DISTRIBUTED SUB-TRAJECTORY
      SIMILARITY SEARCH                                                the ALS of data and query trajectories have minimum distance of
                                                                       more than 2ϵ + δ , and therefore the trajectories can be ignored.
The search procedure is composed of three steps. In the first
step, the global index is used to identify partitions that may            5.2.3 Arrangement of Query and Data Sub-trajectories. After
contain results and discard those that are sufficiently far away       the filtering of the previous step, the remaining sub-trajectories
from the query. In the second step, for each of the partitions         have ALS within the buffer around at least one of the query
identified, a task is initiated, and the partition’s local index is    segments. Figure 4 illustrates an arrangement between query and
used to find the trajectories whose Minimum Bounding Rectangle         data ALSs. The intersection area between the ϵ-buffers around
(MBR) overlaps with the query sub-trajectory. Before computing         query and data segments is determined by a polygon with vertices
the exact distance function, in the third step, we safely prune        A, B, C, D. We search for the similar parts of trajectories inside
some sub-trajectories by their ALS, and then for the remaining         these polygons.
we identify the relevant parts of sub-trajectories to the query
                                                                          5.2.4 Relevant Parts of Sub-trajectories. In this step, we find
sub-trajectory. For those relevant parts, we compute the exact
                                                                       the relevant parts of sub-trajectories by obtaining the parts of
distance function and collect the results.
                                                                       the original trajectories inside the intersection polygon. Those
                                                                       parts could be similar and the exact distance should be computed
5.1    Partition Pruning
                                                                       between them. We find the pairs of trajectory points pi and
Given a query trajectory Q and a distance threshold δ , we com-        pi+1 that overlap with the sides of the intersection polygon. We
pute the MBR of the Q and expand the MBR by δ to cover all             interpolate points where the line segment pi pi+1 exactly overlaps
trajectories which may be adjacent to the query trajectory by          the side and add them to the final candidate (sub)trajectory. We
its boundaries. Then we query the global index to obtain the           perform the same procedure for every overlapping point pairs.
partitions that have data boundary overlap with the expanded
MBR. It is required to initiate tasks only for these partitions and       5.2.5 Exact Distance Computation. The sub-trajectory pairs
send the query trajectory to them.                                     obtained from the previous step are the final candidates for the
                                                                       exact distance computation. We use discrete Fréchet distance [2]
5.2    Search within Partitions                                        for this purpose, which is a good measure for curve comparison
Given a query trajectory Q, a distance threshold δ and an error        and capable of maintaining the order of coupling points. If the
tolerance ϵ for simplification, simplify Q with the same algorithm     distance between a data and query sub-trajectory pair is not
described in Section 4.2 and the same ϵ used for segmenting data       larger then distance threshold δ , and its spatial length is higher
trajectories. We denote a (simplified) segment of the segmented        than λ, we add the data sub-trajectory to the results.
query as query segment.
                                                                       6     EXPERIMENTS
    5.2.1 Pruning by MBR of ALS. For each query segment, we            In this section, we describe the evaluation of the proposed method
first retrieve all data sub-trajectories from the local R-Tree whose   on a small cluster of machines. The objective of this experimental
MBR overlaps with its expanded MBR. Here, we expand the MBR            evaluation is to assess the performance of the proposed method
of the query segment by a buffer of 2ϵ + δ . The intuition is that,    in terms of query latency and index building time.
based on Lemma 4.1, the points of data sub-trajectories could be
in at most ϵ distance from their ALS and the same holds for the        6.1    Cluster Setup
query sub-trajectory. Thus, we add 2ϵ to account for the error
                                                                       All experiments were conducted on a cluster of 5 machines di-
bound of query and data sub-trajectory. The addition of δ to the
                                                                       vided as 1 master node and 4 worker nodes. Each machine has
expansion is justified in the same way for the partition pruning.
                                                                       a 4-core Intel Core i5-7400 @ 3.00GHz processor and 16 GB
This step prunes a large number of segments that could be safely
                                                                       main memory. From each machine, 4 GB of main memory and
discarded.
                                                                       1 CPU core is reserved for Operating System and Hadoop dae-
   5.2.2 Pruning by ALS. For the remaining data segments after         mons. Thus, the 4 worker nodes can provide a total of 48 GB of
pruning by MBR, we do a simple pruning based on the distance           RAM and 3 cores each. Each node runs Ubuntu 18.04.2 LTS with
computation between two line segments. Indeed, in some cases           Hadoop 2.7.2 and Spark 2.4.0. All implementations are in Scala
the MBR of data trajectories overlap with query segments, yet          programming language v2.11.6.
Figure 5: Query and index time with respect to data size.                Figure 6: Query and index time with respect to cluster size.

                                                                                      (a)                         (b)                        (c)
6.2    Dataset
The dataset used in the experiments is a total of 30 GB of GPS
traces of one year of car trajectories in the area of Tuscany, Italy.
Since the whole track of each user could be very long, if the dis-
tance or time interval between two subsequent points is higher
than predefined thresholds, we divide the track into two trajec-
tories, corresponding to meaningful trips. We used the cut-off
spatial and temporal thresholds of 300 meters and 30 minutes,
respectively.

6.3    Index Building Time                                                                  Figure 7: Visualization of results.
The index building is comprised of reading a sample of the dataset,
computing the partition boundaries, physical partitioning, build-        working with them is simple and fast. Our aim is to postpone
ing local indexes, and finally building the global index. The build-     the expensive exact distance computation until the end of a suite
ing of indexes takes a significant amount of time, but it should         of pruning techniques on ALSs. The pruning starts from simple
be noted that the generated indexes can be stored in HDFS and            line segment comparisons and continues with ALS matching
used later without the need for the recomputing. We study the            and extracting relevant parts of sub-trajectories. At the end, we
scalability of our method against the dataset size and cluster size,     compare the relevant parts and check whether they are similar
in building the index structure and querying. The right axes in          according to the distance threshold. We performed experiments
Figure 5 and Figure 6 show the time for building the index for           to analyze the performance of our method, which shows good
different sizes of data and different number of worker nodes in          results in answering queries, and visual representation of some
the cluster.                                                             of the results suggests that it can produce significant output.
                                                                            Our future works on this topic include experimenting the tool
6.4    Query Latency                                                     on larger datasets and larger parallel platforms, as well as explor-
For the query latency, we executed 50 queries randomly chosen            ing its applicability in applications such as traffic jam detection,
from the same dataset and the run-times are averaged. Since the          carpooling and other clustering-based ones.
latency of different queries may vary significantly, we also report
the latency for the 5% and for the 95% of the run-times. Depending       ACKNOWLEDGMENTS
on the location of the query, whether it is in a high density area       This work is partially supported by the European Community
or sub-urban area, different queries can have different run-times.       H2020 programme under the funding scheme Track &Know (Big
The left axes in Figure 5 and Figure 6 report the query latencies        Data for Mobility Tracking Knowledge Extraction in Urban Ar-
for different sizes of data and computing nodes. As an example,          eas), G.A. 780754, trackandknowproject.eu.
the query run-time for the 10GB dataset is 6.6 seconds in average
with a 5% − 95% interval of [3.9 − 12.9] seconds.                        REFERENCES
                                                                         [1] David Douglas and Thomas Peucker. 1973. Algorithms for the reduction of the
   6.4.1 Visualization of Results. Figure 7 shows three examples             number of points required to represent a digitized line or its caricature. The
of a query trajectory and one of the results from data. The dashed           Canadian Cartographer 10, 2 (1973), 112–122.
(red) and solid (blue) trajectories are query and data, respectively.    [2] Thomas Eiter and Heikki Mannila. 1994. Computing discrete Fréchet distance.
                                                                             Technical Report. Citeseer.
The markers show the starting point of the trajectories. The             [3] Scott T Leutenegger, Mario A Lopez, and Jeffrey Edgington. 1997. STR: A simple
highlighted part (in gray) of the data trajectory is the similar part.       and efficient algorithm for R-tree packing. In Proceedings 13th International
                                                                             Conference on Data Engineering. IEEE, 497–506.
This figure shows the effectiveness of our method in identifying         [4] Zeyuan Shang, Guoliang Li, and Zhifeng Bao. 2018. DITA: Distributed in-
the similar sub-trajectories in big trajectory dataset.                      memory trajectory analytics. In Proceedings of the 2018 International Conference
                                                                             on Management of Data. ACM, 725–740.
                                                                         [5] Panagiotis Tampakis, Christos Doulkeridis, Nikos Pelekis, and Yannis Theodor-
7     CONCLUSION                                                             idis. 2019. Distributed Subtrajectory Join on Massive Datasets. arXiv preprint
In this paper, we proposed a novel approach for sub-trajectory               arXiv:1903.07748 (2019).
                                                                         [6] Dong Xie, Feifei Li, and Jeff M. Phillips. 2017. Distributed Trajectory Similarity
similarity search in a big trajectory dataset. In our proposed               Search. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1478–1489. https://doi.org/10.
method, we rely on approximate line segments of trajectories, as             14778/3137628.3137655