1. Introduction

Scalable Distributed Trajectory Clustering Using Apache Spark

Stamatis Stefanopoulos

4 5

Charilaos Akasiadis

cakasiadis@iit.demokritos.gr 2 4

Nikos Pelekis

npelekis@unipi.gr 1 4

Dimitris Zissis

dzissis@aegean.gr 0 3 4 0 Department of Product & Systems Design Engineering, University of the Aegean , Konstantinoupoleos 2, Ermoupolis, Syros, 84100 , Greece 1 Department of Statistics & Insurance Science, University of Piraeus , 80, M. Karaoli & A. Dimitriou St., Piraeus, 18534 , Greece 2 II&T, NCSR “Demokritos”, Patr. Grigoriou & Neapoleos 27, Agia Paraskevi , 15341 , Greece 3 MarineTrafic.com 4 Proceedings of the Workshop on Big Mobility Data Analytics (BMDA) co-located with EDBT/ICDT 2023 Joint Conference 5 University of the Peloponnese, Erythrou Stavrou 28 & Karyotaki , Tripolis, 22131 , Greece

Trajectory clustering is an important problem, where position data of mobile objects, such as vehicles and vessels, is analyzed to extract knowledge utilized for a plethora of management tasks. Recently, a vast increase in the production of data gathering devices has taken place, allowing the collection of data in much larger volumes. This challenges the application of existing clustering algorithms, as they are not always able to handle large datasets due to their design. In particular, TRACLUS is one of the most well-known trajectory clustering algorithms that is a generalization of DBSCAN for trajectory line segments. However, due to the iterative approach and the repetitive usage of a spatial index inherited from DBSCAN, TRACLUS's performance degrades as the datasets increase in size and can be extremely slow in some cases. To tackle this shortcoming, we propose a distributed implementation of TRACLUS, built on Apache Spark, that can operate on very large datasets by applying diferent types of partitioning to the input data. Results from an empirical evaluation on real-world trajectories illustrate that our distributed variant achieves improved runtime performance and clustering eficiency.

eol>Trajectory Clustering Data Mining Distributed Computing

1. Introduction

Clustering algorithms are valuable components in contemporary data analysis workflows. Being an unsupervised learning method, clustering analysis unveils the structure of the data and can be used as the basis for further learning [1]. The main objective of clustering algorithms is to divide data into groups, where members of the same cluster should be as similar as possible, while members belonging to other clusters should be quite diferent [ 2]. Due to the high diversity of the clustering problems properties, as these emerge from the features of the data to be clustered, a wide variety of clustering algorithms exists. Following diferent worklfows, these algorithms incorporate various methods for problem solving, e.g. data partitioning, data hierarchy, fuzzy theory, distributedness, data density, graph theory, grid structures, fractal theory, or other data models [1].

Among these categories, density based clustering is widely used to handle real-world problems, as such algorithms can efectively discover arbitrarily shaped groups of data based on the density of their distribution in the -dimensional space. From the algorithms of this category, DBSCAN [3] is one of the most well-known, as it comprises a simple, understandable concept that is also resilient to noise. DBSCAN’s initial design applies clustering to a spatial distribution of points. However, there are other types of spatial information that would be valuable to consider. Moving object trajectories for example, also include timestamps due to their time series nature and thus constitute more complex data, that can currently be measured using a variety of sensors, for example the GPS of mobile devices that track routes. Clustering analysis helps to extract useful knowledge from such data, and there are many real-world applications where trajectories must be analysed in a density-based manner, such as movement anomaly detection [4] or trafic pattern extraction for vessels [5], vehicles [6, 7] or humans [8].

We focus on a generalisation of DBSCAN designed to cluster mobile object trajectories, through a partitioningand-grouping procedure. This approach is called TRACLUS [9] and is used to cluster parts of trajectories, line segments in particular. One major drawback of TRACLUS however, is its inability to handle large trajectory datasets, as it does not scale up well, making big data clustering analysis dificult or extremely slow [ 10]. This is mainly due to the continuous use of a spatial index to re- scheme for subtrajectories that groups them into nodes, trieve the nearby segments, and to the highly sequential in the philosophy of spatio-temporal partitioning techand repetitive design that is inherited from the DBSCAN niques. This way, clustering results are calculated by simapproach. To tackle this issue, paralellization techniques ply applying query operators to the particular index. The can be employed, especially in the algorithm parts that knowledge discovery process presented in [16], extracts induce the highest runtime cost. Such techniques could popular routes and movement patterns, and is shown to also be integrated to existing analytics engines. scale with big, real-world vessel mobility data, managing

Motivated by these concerns, in this work we propose to detect real maritime incidents amongst them. two methods to parallelize TRACLUS using the Apache TRACLUS [9] has been considered as one of the stateSpark analytics engine.The first approach employs ran- of-the art algorithms for trajectory clustering [14, 17], dom partitioning of the dataset, while the second per- as it is a generalisation of DBSCAN for density-based forms spatial partitioning, based on the position of the line segment clustering. TRACLUS partitions the trajectrajectory segments in the two-dimensional space. To tories to line segments before the clustering step, which show the validity of the results, a distance-based statis- in turn incorporates a specially designed distance metric. tical significance measure is used to assess the overall After clusters are extracted, representative trajectories clustering eficiency. This measure can be interpreted are generated for the ones that fulfil particular criteria. as an indicator of the algorithm’s confidence regarding This algorithm also inspired the design proposed by [18] the formation of a particular cluster. The proposed dis- that is an implementation less sensitive to input parametributed variants are both compared against the original, ters, as well as [10], where a version of TRACLUS with single-threaded implementation of TRACLUS, in terms partitioning and clustering routines designed for GPUs of runtime and clustering accuracy. Our experimental was introduced. In these approaches the trajectories evaluation illustrates the improved execution time of our are partitioned to line segments using the MDL princiapproach as input datasets grow in size. ple [19] and are organized in graphs where breadth-first

The rest of this paper is organised as follows: Sec- search is executed in parallel during the clustering task. tion 2 presents the related work on trajectory clustering A multi-threaded implementation of TRACLUS is also and distributed density-based techniques; Section 3 in- implemented for benchmarking. However, there is no cludes a detailed description of the proposed distributed openly available sourcecode of this variant and, furtherTRACLUS approach. Then, Section 4 presents the ex- more, this is not a fully parallelized implementation as perimental evaluation of the methods using real-world it only splits the distance between segment pairs related datasets and, finally, Section 5 presents the conclusions computations, by assigning diferent threads to diferent and some future work directions. pairs. Instead, the approach that we propose aims to fully parallelize particular phases of the algorithm that influence its overall runtime performance.

2. Related work 2.1. Trajectory clustering methods

Moving object trajectories is data of a particular form with distinct characteristics, which can be simply described as finite sequences of geolocations with timestamps [11]. Due to this unique form, special algorithms are designed for trajectory analysis that often incorporate trajectory clustering techniques. MovCloud [12] follows the MapReduce paradigm in cloud-based infrastructures, to index trajectory data hierarchically and proceed with the clustering using semantic information along with location coordinates. Considering trajectories as streaming data, the work of [13] utilizes both an online and an ofline phase; trajectory clusters are updated dynamically to reflect incoming changes and are clustered to extract the final clustering result respectively. The method introduced in [14] is also based on the MapReduce paradigm to solve distributed tasks, such as the joining of subtrajectories, trajectory segmentation and clustering, and outlier detection. The approach of [15], adopts an indexing

2.2. Distributed DBSCAN Many DBSCAN parallelization methods have been pro

posed in the literature so far, which are based on various paradigms. Among them, the most popular are the MapReduce-based methods, as the case of [20], which uses the disjoint-set data structure to break the sequential nature of DBSCAN and achieve better load balancing. The method generates a single tree for each point of the dataset and merges those generated at diferent partitions iteratively to calculate the overall clustering result. In PatchWork [21] the feature space is divided into a grid, and only cells containing more than a threshold of points are sorted by density and are kept for further processing. The most dense cell is considered the first cluster and nearby cells can extend it if their density overpasses a suficient percentage. In [ 22] authors proposed a MapReduce method that partitions the dataset without overlapping, and executes DBSCAN in the partitions during the map phase, while it unifies the datasets and re-checks the noise points during the reduce phase. where all dataset entries are examined. Moreover, the line segment clustering follows an iterative procedure, which increases the time complexity as the dataset increases in size. On the other hand, the representatives generation phase processes a smaller portion of the dataset, since not every entry is assigned to the final clusters. Thus, this step does not induce significant impacts to the runtime performance. For this reason, only the first and second phases are selected to be parallelized.

The strategy used for the distributed implementation is based on the design that most Apache Spark programs follow. A driver (main) program initiates processing and a number of workers perform tasks concurrently and independently during the distributed processing (map) phases. The workers store their results in accumulator variables in the driver’s memory and are processed by the driver when all workers have finished their jobs (reduce phase). An Apache Spark script may include multiple distributed processing phases. In our implementation, the main phases of the algorithm are shown in Table 1.

MR-DBSCAN [23] proposes a three-step MapReduce procedure where the data points are in parallel partitioned, clustered, and then merged and relabelled. This method uses binary space partitioning (BSP) [24], an overlapping scheme to dynamically partition the dataset so that a balanced processing load on the cluster nodes is achieved, while retaining common points for the cluster merging phase. Moreover, a new partitioning method, CBP is introduced, which splits the dataset based on cost criteria. In [25] a similar procedure is followed, where clustering occurs in parallel in small data blocks of equal size and the results are combined during the reduce phase, using intermediate key-value pairs and hierarchical merging of the local clusters. For varying density clustering, VDMR-DBSCAN [26] uses MapReduce to divide the dataset into overlapping groups, perform the clustering step, and merge the resulting local ones according to their density and to partition adjacency criteria.

TRACLUS’s relation to DBSCAN, along with its limited scalability and the absence of a published MapReducebased parallel implementation to handle big data, led us to combine elements of some of the existing parallelization techniques of DBSCAN we discussed. In particular we incorporate BSP [23], the MapReduce strategy [22, 26], and intermediate key-value pairs [25], in order to achieve improved performance for larger datasets.

3. Methodology We now present our distributed approach that is designed

to scale up as the dataset size increases. To begin, the original TRACLUS algorithm consists of three main phases:

Trajectory partitioning. The moving object trajectories are partitioned into line segments, aiming to compress information without losing precision. To achieve this, the 3.1. Distributed trajectory partitioning Minimum Description Length principle (MDL) [19] is used that minimizes the sum of two costs: the encoding cost During the trajectory partitioning phase, the dataset is for a hypothesis and, a second, regarding the data that divided randomly at the trajectory level and each worker is encoded according to this hypothesis. In our case, a receives a set of trajectories to process. The results are hypothesis corresponds to a specific subset of all possible added to a list accumulator variable, to be unified in a partitions of a trajectory. single line segment dataset by the driver program for

Line segment clustering. At this phase, the clustering further processing in the next phases, as shown in Fig. 1. task takes place by applying DBSCAN to line segments, instead of points. A composite function which combines 3.2. Distributed line segment clustering perpendicular, parallel and angle distances is used to calculate the interspace between the line segments and As already explained, line segment clustering is by far the define neighbourhoods. most time-consuming phase , with its time cost growing

Representatives generation. The output of the clustering significantly as the input dataset becomes larger. Thus, phase is a set of line segment clusters. For each one of a distributed implementation for this part is imperative. them, a representative trajectory is generated based on We describe the methods used for dataset partitioning, the line segments belonging to the cluster. i.e. random and spatial partitioning, and the merging

In theory, the single-threaded implementation of TR- strategies that unify the results produced by the workers. ACLUS is expected to require more time during the line The DBSCAN algorithm that uses R-tree spatial insegment clustering and the trajectory partitioning phases, dexes has a run time complexity of ( log ) [3], where no line segments belonging to two diferent clusters to be considered as “bridges”, i.e. indicators of meaningful cluster merges. To avoid this, after partitioning, each worker receives two partitions of the dataset to build its spatial index (Figure 3). This leads in double-sized spatial indexes per worker (as they are created by two diferent dataset partitions instead of a single one), containing common line segments between partitions, as the index contains the segments of the corresponding partition and another one. This is useful for merging the local clusters discovered by each worker individually. is the number of the dataset entries. TRACLUS inherits this complexity [9]. Now, the splitting of the dataset into > 2 separate and equal partitions yields a time complexity of:

( log ) Figure 3: Example using random partitioning (Figure 2). a. As · log < log , overall, time complexity de- Line segments belonging to the blue partition (partition 1). b. creases as increases. Line segments used for partition’s spatial index, contain also line segments from the orange partition (partition 2). 3.2.1. Random partitioning (dTRACLUS-R) 3.2.2. Spatial partitioning (dTRACLUS-S)

In the case of random partitioning of the line segments,

a balanced split of the initial dataset is created, assigning approximately the same number of members to each partition (see, Figure 2). Then, partitions are assigned to separate workers, where local R-tree spatial indexes are created and used during line segment clustering. mentioned spatial index is used. tial partitions. Spatial partitioning borders are depicted by the grey continuous lines. Common line segments between partitions are colored with yellow.

However, higher precision comes with some extra run

time and memory cost. Prior to partitioning, the whole dataset needs to be indexed in a global R-tree whose insert and search operations are equal to (log ). The R-tree is then queried every time a split on the dataset is attempted, to check if each side of the split contains approximately equal number of line segments. Since the search for better splits may require many iterations to converge to the most appropriate one, we stop when a balanced split is found, i.e. the number of line segments in one partition is between ± 5% of the line segments in the other partition. Let ≤

2 anticipate an average of tries per split, resulting to 1 in the best case, and () for the worst.

More formally, if is the number of the line segments contained in the dataset, the number of partitions, the number of maximum tries per split, the number of extra splits needed in case the number of partitions is not a power of 2, then for the spatial partitioning procedure 2 be the iteration limit, we the time complexity equals to: ( + − 1) · 2

· log

Thus, the total runtime complexity for the whole spatial partitioning phase becomes:

︁( (︀ ( + ) · ︀) · log ︁)

This constitutes an additional efort that is induced by the spatial partitioning method, however this increase is negligible compared to the complexity gain that we have from the complete distributed implementation. 3.3. Merging local clusters

A crucial step for the integrity of the method’s results is the merging of the discovered clusters. The process followed for the merging of local clusters into global ones is similar for the random and the spatial partitioning implemetations. The main diference lies in the way common members of the clusters emerge, which are later used for cluster merging. 3.3.1. Random partitioning (dTRACLUS-R) For the random partitioning implementation, to be able to merge the local clusters we use extra line segments to build the spatial indexes each worker uses. Note though that these line segments are not considered as part of the worker’s dataset. This means that these line segments can be returned by querying the spatial index, but are considered as not belonging to the partition dataset. During the line segment clustering phase, each time a cluster is expanded by one line segment, this line segment is checked if it belongs to the worker’s dataset. If it does not belong to it, the line segment ID and the cluster in which it was assigned are added in a special “duplicates” accumulator to be further processed by the driver program after the end of this distributed phase. When the line segment clustering phase is completed, duplicate values are removed from the “duplicates” accumulator and the final merging couples are determined. 3.3.2. Spatial partitioning (dTRACLUS-S) In the spatial partitioning case, the merging procedure is simpler, as each worker contains the same line segments in both its spatial index and its dataset partition, so there is no need to store possible duplicates in a special-purpose accumulator, as was the case in random partitioning. The common line segments that will be used as “bridges” for cluster merging are already specified by the spatial partitioning phase itself. This happens because line segments are two-dimensional objects that can span one or more spatial partitioning rectangles. This results to few common line segments between the spatial partitions. Thus, in the merging phase, the algorithm searches for these segments to use for the local cluster merging.

The final clusters obtained after the merging phase are expected to be slightly diferent from the clusters extracted from the single-threaded TRACLUS implementation. This is because in both cases, line segments that would normally be grouped to the same cluster, are dispersed in other partitions. It is possible that some of these partitions might contain so few cluster members, that they are considered as noise by the line segment clustering phase and never make it to the cluster merging phase. The rejected segments could either be considered as members of bigger clusters, or as common segments used to merge two or more clusters during the merging phase. This deviation is expected to be larger for the random partitioning implementation, as the dispersion of nearby line segments between the resulting partitions is higher than dispersion caused by spatial partitioning. 3.4. Clustering significance To test the proposed methods, we conducted experiments using a part of NOAA vessels dataset, with data of June The final phase generates the cluster representatives, 2019. The datasets used for the experiments contained apwhich act as cluster medoids and describe the average proximately 50, 100, 200, 500, 1000, 2000, 5000 and 10000 movement of the objects included in each discovered trajectories, representing a total of 35K to 6M points. cluster. Intuitively, a representative that lies close to the The average length of the datasets’ trajectories spanned segments of its corresponding cluster, describes it better between 587 to 734 points, having a standard deviation than one lying further away. Randomly generated data ranging between 311 to 321 points. in the wider area of the representative are expected to be Experiments were run on an Apache Spark standalone more distant to the representative than to the real cluster cluster installed on a VirtualBox virtual machine running members, as no spatial criterion is considered for their Linux Ubuntu 18.04 with 14 CPU cores and 28 Gigabytes generation, contrary to the real data selected by spatial of RAM. Our implementation was based on Alex Polneighborhood queries during the clustering phase. cyn’s TRACLUS implementation in Python [27], which

To evaluate the clustering eficiency of the single- was properly modified to run in a distributed manner threaded version and our proposed distributed one, we using PySpark according to the approaches we propose employ statistical significance tests, used to determine if in Section 3. For the spatial indexing part, we use the the distances of the cluster members to the cluster repre- Pyrtree [28], an R-tree implementation written in pure sentative are similar to the distances of random segments. Python. Our implementation is openly available online.1

To achieve this, for each generated cluster representative, its minimum bounding box is calculated. We generate random line segments inside this box, equal in number to the line segments that belong to the cluster. The next step is to calculate the Fréchet distance between the cluster members and each line segment of the representative, and store the minimum. The process is repeated for the randomly generated line segments. In the end, we come up with two distance distributions, which are tested using Z-test and Kolmogorov-Smirnov test, to obtain a measure of significant diference. This measure indicates if the randomly generated (fake) line segments are significantly further and dissimilar to the representative than the real segments of the cluster and can be used to distinguish between tightly (significant) and loosely (non-significant) connected clusters, acting as a measure of quality for the generated clusters and how well they are described by their representative. An example of this process is shown in Fig. 5.

4.1. Line segment clustering runtime The main goal of the distributed implementation is to

make TRACLUS run faster for large trajectory datasets. By evaluating the single-threaded implementation we observe that the line segment clustering phase requires most of the overall runtime to complete (Figure 6).

Single-thread TRACLUS phases 105 104 )s ond103 sce (e102 m it n uR101 100 0 2000 4000 6000 No. of Trajectories

Partitioning Indexing Clustering Representatives 8000 10000

Now, the distributed implementation of the line segment clustering phase significantly reduces the total runtime cost, as we can see in Figure 7. This outcome is anticipated based on the complexity analysis of the distributed implementation (cf. Section 3.2). The smaller size of the spatial indexes sent to the workers, enables them to complete respectively smaller processing tasks during the clustering phase in less time.

TRACLUS dTraClus-S (14 workers) dTraClus-R (14 workers) 102

103 No. of Trajectories 104

4.2. Representatives and significance

A way to determine the quality of the algorithm’s output This measure of statistical significance can be used for between the presented implementations is the number hyperparameter tuning, as a better choice for the epsilon of the representative trajectories generated at the final TRACLUS parameter and can lead to more significant phase, compared to the number of representatives gener- clustering in cases such as the ones presented in Figure 9. ated by the base line (single-threaded) implementation. There, the inclusion of the remote line segments in the A number of representatives close to the baseline im- clusters could be avoided. Moreover, this measure can plies both successful cluster discovery and good perfor- be used as a filtering criterion to indicate cases where mance on local cluster merging. However, as mentioned representatives lie away from their corresponding cluster in Section 3.3, there can be deviations between the single- members and should be rejected. threaded and the distributed implementation results due to the dispersion of line segments to diferent partitions. 4.3. Overall performance and scalability

Evaluation of the clustering significance shows that in most of the cases, significant scores for both the Z-test We compared the total runtime results of the baseline and Kolmogorov-Smirnov statistical significance tests single-threaded implementation with the distributed are achieved. This phenomenon is due to the given TR- ones and, as we can see in Figure 7, the distributed impleACLUS hyperparameters (big neighborhood ratio) and mentations outperform by far the single-threaded one. the sparsity and nature of the dataset (vessel trajectories) We can also see that the diference in execution time inwhich led to fairly dense clusters, compared to their sur- creases as the datasets grow with respect to the number roundings. For the committed experiments, the highly of trajectories and, consequently, the points they contain. significant clusterings illustrate tightly-connected clus- To improve the distributed implementation performance ters when plotted on a map, as shown in Figure 8. for big data and achieve better scalability, more workers can be added. To take full advantage of the extra workers, the dataset partitions should be equal or more than the number of the available workers.

Clusterings of no significance were also present in some rare cases. Plotting their structures reveals looselyconnected clusters, often dispersed across multiple “subclusters” and not assigned to coherent ones (Figure 9). 5. Conclusion and future work In this paper we proposed two distributed TRACLUS

implementations, based on random and spatial partitioning of the initial dataset. Experiments conducted on an Apache Spark cluster revealed that the proposed methods significantly outperform the original single-threaded TRACLUS implementation in terms of execution runtime as the dataset size increases. The clustering results of the distributed implementations maintain an acceptable clustering performance, which was evaluated using statistical significance tests. Possible improvements of the proposed methods could be a faster, possibly distributed, merging strategy for the local clusters. Moreover, studying the performance of the proposed methods with even larger datasets on multi-node, distributed Apache Spark [12] S. Ghosh, S. K. Ghosh, R. Buyya, Movcloud: A cloudclusters can reveal more about the scalability of the pro- enabled framework to analyse movement behaviors, posed implementations. in: IEEE Int. Conf. on Cloud Computing Technology and Science (CloudCom), 2019, pp. 239–246. [13] Z. Li, J. Lee, X. Li, J. Han, Incremental clustering Acknowledgments for trajectories, in: Database Systems for Advanced Applications: 15th Int. Conf., 2010, pp. 32–46.

This research is supported by European Union’s Hori- [14] P. Tampakis, N. Pelekis, C. Doulkeridis, Y. Theodorzon 2020 RIA programme under grant agreement No idis, Scalable distributed subtrajectory clustering, 957237, project ENABLING MARITIME DIGITALIZA- IEEE Int. Conf. on Big Data (2019). TION BY EXTREME-SCALE ANALYTICS, AI AND DIGI- [15] N. Pelekis, P. Tampakis, M. Vodas, C. Doulkeridis, TAL TWINS (VesselAI). Y. Theodoridis, On temporal-constrained subtrajectory cluster analysis, Data Mining and KnowlReferences edge Discovery 31 (2017).

[16] D. Zissis, K. Chatzikokolakis, G. Spiliopoulos, M. Vo[1] D. Xu, Y. Tian, A comprehensive survey of clus- das, A distributed spatial method for modeling martering algorithms, Annals of Data Science (2015) itime routes, IEEE Access 8 (2020) 47556–47568. 165–193. [17] O. L. Hsu, C. Lee, Common sub-trajectory cluster[2] A. Jain, R. Dubes, Algorithms for clustering data, ing via hypercubes in spatiotemporal space, IEEE

Prentice-Hall, Inc, Upper Saddle River, 1988. Access 8 (2020) 23369–23377. [3] M. Ester, H. Kriegel, J. Sander, X. Xu, A density- [18] C. Jiashun, A new trajectory clustering algorithm based algorithm for discovering clusters in large based on traclus, in: Int. Conf. on Computer Science spatial databases with noise, in: Proc. of the Sec- and Network Technology, 2012, pp. 783–787. ond Int. Conf. on Knowledge Discovery and Data [19] E. Keogh, S. Lonardi, C. Ratanamahatana, Towards Mining, KDD’96, AAAI Press, 1996, p. 226–231. parameter-free data mining, 2004, pp. 206–215. [4] R. Barnwal, S. Baride, S. Majumder, S. Ghosh, A [20] M. M. A. Patwary, D. Palsetia, A. Agrawal, W. Liao, density-based algorithm for detecting anomalous F. Manne, A. Choudhary, A new scalable parallel trajectories, in: Int. Conf. on Microelectronics, Com- dbscan algorithm using the disjoint-set data strucputing and Comm. (MicroCom), IEEE, 2016, pp. 1–4. ture, in: Int. Conf. on High Performance Comp., [5] I. Kontopoulos, I. Varlamis, K. Tserpes, A dis- Networking, Storage and Analysis, 2012, pp. 1–11. tributed framework for extracting maritime trafic [21] F. Gouineau, T. Landry, T. Triplet, Patchwork, a patterns, Int. Journal of Geographical Information scalable density-grid clustering algorithm, in: Proc.

Science 35 (2020) 1–26. of the 31st Annual ACM Symposium on Applied [6] X. Li, J. Han, J. Lee, H. Gonzalez, Trafic density- Computing, ACM, 2016, p. 824–831. based discovery of hot routes in road networks, in: [22] X. Hu, L. Liu, N. Qiu, M. Li, A mapreduce-based Advances in Spatial and Temporal Databases: 10th improvement algorithm for dbscan, Journal of AlInt. Symposium, 2007, pp. 441–459. gorithms & Computational Technology 12 (2017). [7] H. Munaga, M. Sree, J. Murthy, Dentrac: A density [23] Y. He, H. Tan, W. Luo, S. Feng, J. Fan, Mr-dbscan: based trajectory clustering tool, Int. Journal of a scalable mapreduce-based dbscan algorithm for Computer Applications 41 (2012) 17–21. heavily skewed data, Front. of Comp. Science 8 [8] D. Das, D. Mishra, Unsupervised anomalous trajec- (2014).

tory detection for crowded scenes, in: IEEE 13th [24] M. J. Berger, S. H. Bokhari, A partitioning strategy Int. Conf. on Industrial and Information Systems for nonuniform problems on multiprocessors, IEEE (ICIIS), 2018, pp. 27–31. Transactions on Computers C-36 (1987) 570–580. [9] J. Lee, J. Han, K. Whang, Trajectory clustering: [25] X. Fu, S. Hu, Y. Wang, Research of parallel dbscan A partition-and-group framework, SIGMOD ’07, clustering algorithm based on mapreduce, Int. JourACM, New York, NY, USA, 2007, p. 593–604. nal of Database Theory and Application 7 (2014) [10] H. Mustafa, C. Barrus, E. Leal, L. Gruenwald, Gtr- 41–48.

aclus: A local trajectory clustering algorithm for [26] S. Bhardwaj, S. Dash, Vdmr-dbscan: Varied density gpus, in: 2021 IEEE 37th Int. Conf. on Data Engi- mapreduce dbscan, in: Big Data Analytics: 4th Int. neering Workshops (ICDEW), 2021, pp. 30–35. Conf., volume 9498, 2015, pp. 134–150. [11] P. Sun, S. Xia, G. Yuan, D. Li, An overview of moving [27] A. Polcyn, traclus_impl, https://github.com/ object trajectory compression algorithms, Mathe- apolcyn/traclus_impl, 2016. matical Problems in Engineering 2016 (2016) 1–13. [28] A. S. Peleg, pyrtree, https://github.com/Rhoana/ pyrtree, 2018.