Towards In-Memory Sub-Trajectory Similarity Search Omid Isfahani Alamdari Mirco Nanni University of Pisa ISTI-CNR Pisa, Italy alamdari@di.unipi.it mirco.nanni@isti.cnr.it Roberto Trasarti Dino Pedreschi ISTI-CNR Pisa, Italy University of Pisa roberto.trasarti@isti.cnr.it dino.pedreschi@di.unipi.it ABSTRACT between A and C is more than δ , the two trajectories are not Spatial-temporal trajectory data contains rich information about considered similar, although they have very similar routes. In this moving objects and has been widely used for a large number paper, we address the problem of sub-trajectory similarity search, of real-world applications. However, the complexity of spatial- where we aim to find similar sub-trajectories in the dataset to the temporal trajectory data, on the one hand, and the fast collection whole or a sub-trajectory of the query trajectory. An interesting of datasets, on the other hand, has made it challenging to ef- application of this query is carpooling, where a user can share a ficiently store, process, and query such data. In this paper, we ride with other users who have similar routes. For the example propose a scalable method to analyze the sub-trajectory simi- shown in Figure 1, the carpooling system can match trajectories larity search in an in-memory cluster computing environment. T1 and T2 and hence, suggest the user who follows T1 to share Notably, we have extended Apache Spark with efficient trajectory the ride with the user who follows T2, with possibly a small indexing, partitioning, and querying functionalities to support deviation from his/her routine path. the sub-trajectory similarity query. Our experiments on a real The two state-of-the-art works [6] and [4] find similarities be- trajectory dataset have shown the efficiency and effectiveness of tween whole trajectories and are unable to find similarities in the the proposed method. sub-trajectory level. Different from those works, our method targets the more computationally expensive problem of sub- trajectory similarity search and tries to find not only the sim- 1 INTRODUCTION ilarities in whole trajectory curves but also in between every Following the ubiquity of location-aware devices, collecting the sub-trajectory of data and query trajectories. Furthermore, our location data of moving objects is now much easier thanks to work is different from [5] in the sense that we focus on in-memory the developments in positioning systems and mobile communica- analytics to enhance the execution of interactive queries and at tions. Many applications in the areas of transportation and urban the same time being able to use the sub-trajectory similarity planning can benefit from this data to improve their quality of search as a primitive for more advanced trajectory pattern min- service. Moreover, the location-based services (LBS) market that ing algorithms that try to identify objects that move together primarily depends on the data gathered by location-aware de- closely. In this regard, our goal is to use less memory space and vices is rapidly expanding. Analyzing the historical trajectories accelerate the local computations inside partitions. of moving objects plays an essential role in the success of appli- cations, such as carpooling and route recommendation systems. However, the extensive use of LBS applications imposes signif- icant storage and processing challenges to application servers. Every day several gigabytes or even terabytes of moving objects tracks are accumulated in servers, making their post-processing an arduous task. One of the critical operations in trajectory data analytics is similarity search where given a query trajectory Q and a distance threshold δ , it returns all trajectories with distance at most δ to Q, according to a distance function. We call this problem the whole trajectory similarity search problem, in which trajectories are matched as a whole. In other words, for every comparison Figure 1: Two trajectories with similar routes between a query and a candidate trajectory, the first and last points of both trajectories are matched, and depending on the Parallel processing of trajectory data is inevitable while deal- type of the distance function, either matching or coupling of the ing with massive amounts of trajectory data. The main goal of in-between points is established. this paper is to introduce a novel approach to the sub-trajectory However, as illustrated in Figure 1, there are many situations similarity search problem in an in-memory cluster computing where start and end points of trajectories are far from each other environment, i.e., Apache Spark. while they have similar parts in-between. In this figure, the start and end points of trajectory T1 (blue solid line) are A and B, and 2 PROBLEM FORMULATION for trajectory T2 (red dashed line) they are C and D. If the distance In this section, we formally define trajectories, sub-trajectories, Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed- and the sub-trajectory similarity search problem. ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, , Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- Definition 2.1. A trajectory T is a sequence of time-stamped tribution 4.0 International (CC BY 4.0). sample points ⟨p1, . . . , p |T | ⟩ where each pi is a triplet (x i , yi , ti ) Real Partition Raw Trajectories Local Indexes Boundaries RDD Partitions Global Index .. .. .. .. . . . . Figure 2: A trajectory and its simplifications that indicates the spatial location (x i , yi ) of the moving object at Worker Nodes Master Node timestamp ti . Even though the movement is a continuous phenomenon, a Figure 3: The distributed indexing structure. trajectory is recorded by a finite set of discrete sample points. Let T [i] be the i t h sample point of T . We consider that the move in- 4 DISTRIBUTED TRAJECTORY INDEXING between two consecutive points T [i] and T [i + 1] is on a straight In this section, we elaborate on a bottom-up approach for building line with constant speed and hence, the expected location of the the distributed trajectory index. Each data partition builds its trajectory T between T [i] and T [i + 1] is obtained by a linear local index, and the global index summarizes the information interpolation. We use T (t) to denote the position (x, y) of the about local indexes. We dedicate this section for describing how object at timestamp t ∈ [t 1, t |T | ]. Furthermore, we represent the we partition trajectory data and build local and global indexes. number of points in a trajectory T with |T | and its spatial length Í |T |−1 as SL(T ) = i=1 d(pi , pi+1 ) where d is the Euclidean distance. 4.1 Trajectory Partitioning Definition 2.2. A sub-trajectory T [i : j] is a contiguous subse- We use a two-step partitioning approach, in which the data is first quence of T starting at point T [i] and ending at point T [j], where partitioned based on trajectory identifiers resulting in all points 1 ≤ i ≤ j ≤ |T |. of each trajectory being ended up in the same machine. Then, we Definition 2.3. An approximation line segment for trajec- utilize the Sort-Tile-Recursive (STR) [3] method that considers tory (or sub-trajectory) T , denoted by ALS T , is the line segment the distribution of points to compute the partition boundaries p1p |T | connecting the first and last point of T . based on the sample of points drawn from the dataset. After this step, the challenging task is to distribute the trajec- Definition 2.4. Given a query trajectory Q, a set of trajectories tory data to different partitions in possibly different computing D = {T1, . . . ,TN }, a distance function F , a distance threshold nodes. Considering the complexities of trajectory data such as δ > 0 and a minimum length threshold λ > 0, the sub-trajectory skewness and inherent sequentiality, and avoiding the recon- similarity search returns the set of trajectories R = {R 1, . . . , R M }, struction costs, we distribute the whole trajectory objects. That s.t. for every R j ∈ R, the following holds: is, all points of the same trajectory end up in the same partition. We assign a trajectory to the partition that its bounding rectangle (1) There exists a Ti ∈ D s.t. R j is a sub-trajectory of Ti , has the largest intersection with the MBR of the trajectory. In (2) There exists a sub-trajectory Q[a : b] of Q such that case of ties, one is selected randomly. This technique avoids the F (R j , Q[a : b]) ≤ δ , reconstruction of trajectories for similarity comparison that may (3) SL(R j ) ≥ λ require a huge amount of data shuffle among nodes. 3 SYSTEM OVERVIEW 4.2 Local Indexing In this section, we describe our proposed method for distributed Once partitioning is done, a local index should be constructed processing of the sub-trajectory similarity search. After reading and cached for each partition to help in executing queries. In this the trajectory dataset, three steps of sampling, partitioning, and step, we divide a trajectory at some significant points such that indexing are executed. In the sampling phase, a uniform random the directional trend for each sub-trajectory is preserved. For this sample of the trajectory points is drawn from the input points. reason, we use the popular Douglas-Peucker (DP) algorithm [1] The idea is to acquire knowledge about the distribution of data to identify those points. in the geographical extent of the input dataset. This sample of The original idea behind this algorithm is to approximate points is used to build the partitioner object, which provides the trajectory with some points, known as splitting points and information about the computed boundaries of partitions and discard other points. Different from the original algorithm, we methods for locating objects in the partitions. Then, the physical segment the trajectory at those splitting points. We index each partitioning of trajectory objects is performed using the parti- sub-trajectory T with ALS T using a query-only R-Tree. Thus, the tioner, and trajectory objects are created inside partitions. In the ALS of sub-trajectories could be used for comparison and pruning. final step, we build a 2-level distributed index over the trajectory In Figure 2 the whole trajectory is simplified as 3 approximate line data. This index will be used by the query processor to discard segments, between points p1 , p7 , p14 and p22 . The line segments the irrelevant data to the query, both at the global and local levels. approximate the original curve by a deviation of at most ϵ [1]: Finally, the query processor verifies the similarity between the candidate sub-trajectories that were not discarded and the query. Lemma 4.1. Every point in the sub-trajectory T , segmented using The structure of the distributed index is provided in Figure 3. a tolerance threshold ϵ, has a distance at most ϵ to the ALS T . Q2 Since the exact distance computation between trajectories is expensive, in the search phase we perform most of the compar- T isons with the ALS of trajectories to reduce the size of candidate set (on which the exact distance is evaluated) as much as possible. A B 4.3 Global Indexing Q To locate trajectories at the search phase, we need this global D C index to keep track of real boundaries of data inside partitions rather than computed boundaries after the sampling phase. The real boundaries of all local indexes (i.e. the root node of each local R-Tree) is fetched to the master node to build the global R-Tree index. This index is the first access method utilized in the search phase to prune partitions – and the trajectories inside – Figure 4: Example of different arrangement of ALS of sub- that are far away from the query trajectory. trajectories. 5 DISTRIBUTED SUB-TRAJECTORY SIMILARITY SEARCH the ALS of data and query trajectories have minimum distance of more than 2ϵ + δ , and therefore the trajectories can be ignored. The search procedure is composed of three steps. In the first step, the global index is used to identify partitions that may 5.2.3 Arrangement of Query and Data Sub-trajectories. After contain results and discard those that are sufficiently far away the filtering of the previous step, the remaining sub-trajectories from the query. In the second step, for each of the partitions have ALS within the buffer around at least one of the query identified, a task is initiated, and the partition’s local index is segments. Figure 4 illustrates an arrangement between query and used to find the trajectories whose Minimum Bounding Rectangle data ALSs. The intersection area between the ϵ-buffers around (MBR) overlaps with the query sub-trajectory. Before computing query and data segments is determined by a polygon with vertices the exact distance function, in the third step, we safely prune A, B, C, D. We search for the similar parts of trajectories inside some sub-trajectories by their ALS, and then for the remaining these polygons. we identify the relevant parts of sub-trajectories to the query 5.2.4 Relevant Parts of Sub-trajectories. In this step, we find sub-trajectory. For those relevant parts, we compute the exact the relevant parts of sub-trajectories by obtaining the parts of distance function and collect the results. the original trajectories inside the intersection polygon. Those parts could be similar and the exact distance should be computed 5.1 Partition Pruning between them. We find the pairs of trajectory points pi and Given a query trajectory Q and a distance threshold δ , we com- pi+1 that overlap with the sides of the intersection polygon. We pute the MBR of the Q and expand the MBR by δ to cover all interpolate points where the line segment pi pi+1 exactly overlaps trajectories which may be adjacent to the query trajectory by the side and add them to the final candidate (sub)trajectory. We its boundaries. Then we query the global index to obtain the perform the same procedure for every overlapping point pairs. partitions that have data boundary overlap with the expanded MBR. It is required to initiate tasks only for these partitions and 5.2.5 Exact Distance Computation. The sub-trajectory pairs send the query trajectory to them. obtained from the previous step are the final candidates for the exact distance computation. We use discrete Fréchet distance [2] 5.2 Search within Partitions for this purpose, which is a good measure for curve comparison Given a query trajectory Q, a distance threshold δ and an error and capable of maintaining the order of coupling points. If the tolerance ϵ for simplification, simplify Q with the same algorithm distance between a data and query sub-trajectory pair is not described in Section 4.2 and the same ϵ used for segmenting data larger then distance threshold δ , and its spatial length is higher trajectories. We denote a (simplified) segment of the segmented than λ, we add the data sub-trajectory to the results. query as query segment. 6 EXPERIMENTS 5.2.1 Pruning by MBR of ALS. For each query segment, we In this section, we describe the evaluation of the proposed method first retrieve all data sub-trajectories from the local R-Tree whose on a small cluster of machines. The objective of this experimental MBR overlaps with its expanded MBR. Here, we expand the MBR evaluation is to assess the performance of the proposed method of the query segment by a buffer of 2ϵ + δ . The intuition is that, in terms of query latency and index building time. based on Lemma 4.1, the points of data sub-trajectories could be in at most ϵ distance from their ALS and the same holds for the 6.1 Cluster Setup query sub-trajectory. Thus, we add 2ϵ to account for the error All experiments were conducted on a cluster of 5 machines di- bound of query and data sub-trajectory. The addition of δ to the vided as 1 master node and 4 worker nodes. Each machine has expansion is justified in the same way for the partition pruning. a 4-core Intel Core i5-7400 @ 3.00GHz processor and 16 GB This step prunes a large number of segments that could be safely main memory. From each machine, 4 GB of main memory and discarded. 1 CPU core is reserved for Operating System and Hadoop dae- 5.2.2 Pruning by ALS. For the remaining data segments after mons. Thus, the 4 worker nodes can provide a total of 48 GB of pruning by MBR, we do a simple pruning based on the distance RAM and 3 cores each. Each node runs Ubuntu 18.04.2 LTS with computation between two line segments. Indeed, in some cases Hadoop 2.7.2 and Spark 2.4.0. All implementations are in Scala the MBR of data trajectories overlap with query segments, yet programming language v2.11.6. Figure 5: Query and index time with respect to data size. Figure 6: Query and index time with respect to cluster size. (a) (b) (c) 6.2 Dataset The dataset used in the experiments is a total of 30 GB of GPS traces of one year of car trajectories in the area of Tuscany, Italy. Since the whole track of each user could be very long, if the dis- tance or time interval between two subsequent points is higher than predefined thresholds, we divide the track into two trajec- tories, corresponding to meaningful trips. We used the cut-off spatial and temporal thresholds of 300 meters and 30 minutes, respectively. 6.3 Index Building Time Figure 7: Visualization of results. The index building is comprised of reading a sample of the dataset, computing the partition boundaries, physical partitioning, build- working with them is simple and fast. Our aim is to postpone ing local indexes, and finally building the global index. The build- the expensive exact distance computation until the end of a suite ing of indexes takes a significant amount of time, but it should of pruning techniques on ALSs. The pruning starts from simple be noted that the generated indexes can be stored in HDFS and line segment comparisons and continues with ALS matching used later without the need for the recomputing. We study the and extracting relevant parts of sub-trajectories. At the end, we scalability of our method against the dataset size and cluster size, compare the relevant parts and check whether they are similar in building the index structure and querying. The right axes in according to the distance threshold. We performed experiments Figure 5 and Figure 6 show the time for building the index for to analyze the performance of our method, which shows good different sizes of data and different number of worker nodes in results in answering queries, and visual representation of some the cluster. of the results suggests that it can produce significant output. Our future works on this topic include experimenting the tool 6.4 Query Latency on larger datasets and larger parallel platforms, as well as explor- For the query latency, we executed 50 queries randomly chosen ing its applicability in applications such as traffic jam detection, from the same dataset and the run-times are averaged. Since the carpooling and other clustering-based ones. latency of different queries may vary significantly, we also report the latency for the 5% and for the 95% of the run-times. Depending ACKNOWLEDGMENTS on the location of the query, whether it is in a high density area This work is partially supported by the European Community or sub-urban area, different queries can have different run-times. H2020 programme under the funding scheme Track &Know (Big The left axes in Figure 5 and Figure 6 report the query latencies Data for Mobility Tracking Knowledge Extraction in Urban Ar- for different sizes of data and computing nodes. As an example, eas), G.A. 780754, trackandknowproject.eu. the query run-time for the 10GB dataset is 6.6 seconds in average with a 5% − 95% interval of [3.9 − 12.9] seconds. REFERENCES [1] David Douglas and Thomas Peucker. 1973. Algorithms for the reduction of the 6.4.1 Visualization of Results. Figure 7 shows three examples number of points required to represent a digitized line or its caricature. The of a query trajectory and one of the results from data. The dashed Canadian Cartographer 10, 2 (1973), 112–122. (red) and solid (blue) trajectories are query and data, respectively. [2] Thomas Eiter and Heikki Mannila. 1994. Computing discrete Fréchet distance. Technical Report. Citeseer. The markers show the starting point of the trajectories. The [3] Scott T Leutenegger, Mario A Lopez, and Jeffrey Edgington. 1997. STR: A simple highlighted part (in gray) of the data trajectory is the similar part. and efficient algorithm for R-tree packing. In Proceedings 13th International Conference on Data Engineering. IEEE, 497–506. This figure shows the effectiveness of our method in identifying [4] Zeyuan Shang, Guoliang Li, and Zhifeng Bao. 2018. DITA: Distributed in- the similar sub-trajectories in big trajectory dataset. memory trajectory analytics. In Proceedings of the 2018 International Conference on Management of Data. ACM, 725–740. [5] Panagiotis Tampakis, Christos Doulkeridis, Nikos Pelekis, and Yannis Theodor- 7 CONCLUSION idis. 2019. Distributed Subtrajectory Join on Massive Datasets. arXiv preprint In this paper, we proposed a novel approach for sub-trajectory arXiv:1903.07748 (2019). [6] Dong Xie, Feifei Li, and Jeff M. Phillips. 2017. Distributed Trajectory Similarity similarity search in a big trajectory dataset. In our proposed Search. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1478–1489. https://doi.org/10. method, we rely on approximate line segments of trajectories, as 14778/3137628.3137655