<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalable Distributed Trajectory Clustering Using Apache Spark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stamatis Stefanopoulos</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charilaos Akasiadis</string-name>
          <email>cakasiadis@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Pelekis</string-name>
          <email>npelekis@unipi.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Zissis</string-name>
          <email>dzissis@aegean.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Product &amp; Systems Design Engineering, University of the Aegean</institution>
          ,
          <addr-line>Konstantinoupoleos 2, Ermoupolis, Syros, 84100</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Statistics &amp; Insurance Science, University of Piraeus</institution>
          ,
          <addr-line>80, M. Karaoli &amp; A. Dimitriou St., Piraeus, 18534</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>II&amp;T, NCSR “Demokritos”, Patr. Grigoriou &amp; Neapoleos 27, Agia Paraskevi</institution>
          ,
          <addr-line>15341</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>MarineTrafic.com</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Proceedings of the Workshop on Big Mobility Data Analytics (BMDA) co-located with EDBT/ICDT 2023 Joint Conference</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of the Peloponnese, Erythrou Stavrou 28 &amp; Karyotaki</institution>
          ,
          <addr-line>Tripolis, 22131</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Trajectory clustering is an important problem, where position data of mobile objects, such as vehicles and vessels, is analyzed to extract knowledge utilized for a plethora of management tasks. Recently, a vast increase in the production of data gathering devices has taken place, allowing the collection of data in much larger volumes. This challenges the application of existing clustering algorithms, as they are not always able to handle large datasets due to their design. In particular, TRACLUS is one of the most well-known trajectory clustering algorithms that is a generalization of DBSCAN for trajectory line segments. However, due to the iterative approach and the repetitive usage of a spatial index inherited from DBSCAN, TRACLUS's performance degrades as the datasets increase in size and can be extremely slow in some cases. To tackle this shortcoming, we propose a distributed implementation of TRACLUS, built on Apache Spark, that can operate on very large datasets by applying diferent types of partitioning to the input data. Results from an empirical evaluation on real-world trajectories illustrate that our distributed variant achieves improved runtime performance and clustering eficiency.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Trajectory Clustering</kwd>
        <kwd>Data Mining</kwd>
        <kwd>Distributed Computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Clustering algorithms are valuable components in
contemporary data analysis workflows. Being an
unsupervised learning method, clustering analysis unveils the
structure of the data and can be used as the basis for
further learning [1]. The main objective of clustering
algorithms is to divide data into groups, where
members of the same cluster should be as similar as possible,
while members belonging to other clusters should be
quite diferent [ 2]. Due to the high diversity of the
clustering problems properties, as these emerge from the
features of the data to be clustered, a wide variety of
clustering algorithms exists. Following diferent
worklfows, these algorithms incorporate various methods for
problem solving, e.g. data partitioning, data hierarchy,
fuzzy theory, distributedness, data density, graph theory,
grid structures, fractal theory, or other data models [1].</p>
      <p>Among these categories, density based clustering is
widely used to handle real-world problems, as such
algorithms can efectively discover arbitrarily shaped groups
of data based on the density of their distribution in the
-dimensional space. From the algorithms of this
category, DBSCAN [3] is one of the most well-known, as it
comprises a simple, understandable concept that is also
resilient to noise. DBSCAN’s initial design applies
clustering to a spatial distribution of points. However, there are
other types of spatial information that would be valuable
to consider. Moving object trajectories for example, also
include timestamps due to their time series nature and
thus constitute more complex data, that can currently be
measured using a variety of sensors, for example the GPS
of mobile devices that track routes. Clustering analysis
helps to extract useful knowledge from such data, and
there are many real-world applications where
trajectories must be analysed in a density-based manner, such
as movement anomaly detection [4] or trafic pattern
extraction for vessels [5], vehicles [6, 7] or humans [8].</p>
      <p>We focus on a generalisation of DBSCAN designed to
cluster mobile object trajectories, through a
partitioningand-grouping procedure. This approach is called
TRACLUS [9] and is used to cluster parts of trajectories, line
segments in particular. One major drawback of
TRACLUS however, is its inability to handle large trajectory
datasets, as it does not scale up well, making big data
clustering analysis dificult or extremely slow [ 10]. This is
mainly due to the continuous use of a spatial index to re- scheme for subtrajectories that groups them into nodes,
trieve the nearby segments, and to the highly sequential in the philosophy of spatio-temporal partitioning
techand repetitive design that is inherited from the DBSCAN niques. This way, clustering results are calculated by
simapproach. To tackle this issue, paralellization techniques ply applying query operators to the particular index. The
can be employed, especially in the algorithm parts that knowledge discovery process presented in [16], extracts
induce the highest runtime cost. Such techniques could popular routes and movement patterns, and is shown to
also be integrated to existing analytics engines. scale with big, real-world vessel mobility data, managing</p>
      <p>Motivated by these concerns, in this work we propose to detect real maritime incidents amongst them.
two methods to parallelize TRACLUS using the Apache TRACLUS [9] has been considered as one of the
stateSpark analytics engine.The first approach employs ran- of-the art algorithms for trajectory clustering [14, 17],
dom partitioning of the dataset, while the second per- as it is a generalisation of DBSCAN for density-based
forms spatial partitioning, based on the position of the line segment clustering. TRACLUS partitions the
trajectrajectory segments in the two-dimensional space. To tories to line segments before the clustering step, which
show the validity of the results, a distance-based statis- in turn incorporates a specially designed distance metric.
tical significance measure is used to assess the overall After clusters are extracted, representative trajectories
clustering eficiency. This measure can be interpreted are generated for the ones that fulfil particular criteria.
as an indicator of the algorithm’s confidence regarding This algorithm also inspired the design proposed by [18]
the formation of a particular cluster. The proposed dis- that is an implementation less sensitive to input
parametributed variants are both compared against the original, ters, as well as [10], where a version of TRACLUS with
single-threaded implementation of TRACLUS, in terms partitioning and clustering routines designed for GPUs
of runtime and clustering accuracy. Our experimental was introduced. In these approaches the trajectories
evaluation illustrates the improved execution time of our are partitioned to line segments using the MDL
princiapproach as input datasets grow in size. ple [19] and are organized in graphs where breadth-first</p>
      <p>The rest of this paper is organised as follows: Sec- search is executed in parallel during the clustering task.
tion 2 presents the related work on trajectory clustering A multi-threaded implementation of TRACLUS is also
and distributed density-based techniques; Section 3 in- implemented for benchmarking. However, there is no
cludes a detailed description of the proposed distributed openly available sourcecode of this variant and,
furtherTRACLUS approach. Then, Section 4 presents the ex- more, this is not a fully parallelized implementation as
perimental evaluation of the methods using real-world it only splits the distance between segment pairs related
datasets and, finally, Section 5 presents the conclusions computations, by assigning diferent threads to diferent
and some future work directions. pairs. Instead, the approach that we propose aims to
fully parallelize particular phases of the algorithm that
influence its overall runtime performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Trajectory clustering methods</title>
        <p>Moving object trajectories is data of a particular form
with distinct characteristics, which can be simply
described as finite sequences of geolocations with
timestamps [11]. Due to this unique form, special algorithms
are designed for trajectory analysis that often incorporate
trajectory clustering techniques. MovCloud [12] follows
the MapReduce paradigm in cloud-based infrastructures,
to index trajectory data hierarchically and proceed with
the clustering using semantic information along with
location coordinates. Considering trajectories as
streaming data, the work of [13] utilizes both an online and an
ofline phase; trajectory clusters are updated dynamically
to reflect incoming changes and are clustered to extract
the final clustering result respectively. The method
introduced in [14] is also based on the MapReduce paradigm
to solve distributed tasks, such as the joining of
subtrajectories, trajectory segmentation and clustering, and
outlier detection. The approach of [15], adopts an indexing</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Distributed DBSCAN</title>
        <sec id="sec-2-2-1">
          <title>Many DBSCAN parallelization methods have been pro</title>
          <p>posed in the literature so far, which are based on
various paradigms. Among them, the most popular are the
MapReduce-based methods, as the case of [20], which
uses the disjoint-set data structure to break the
sequential nature of DBSCAN and achieve better load balancing.
The method generates a single tree for each point of the
dataset and merges those generated at diferent
partitions iteratively to calculate the overall clustering result.
In PatchWork [21] the feature space is divided into a
grid, and only cells containing more than a threshold
of points are sorted by density and are kept for further
processing. The most dense cell is considered the first
cluster and nearby cells can extend it if their density
overpasses a suficient percentage. In [ 22] authors proposed
a MapReduce method that partitions the dataset
without overlapping, and executes DBSCAN in the partitions
during the map phase, while it unifies the datasets and
re-checks the noise points during the reduce phase.
where all dataset entries are examined. Moreover, the line
segment clustering follows an iterative procedure, which
increases the time complexity as the dataset increases in
size. On the other hand, the representatives generation
phase processes a smaller portion of the dataset, since not
every entry is assigned to the final clusters. Thus, this
step does not induce significant impacts to the runtime
performance. For this reason, only the first and second
phases are selected to be parallelized.</p>
          <p>The strategy used for the distributed implementation
is based on the design that most Apache Spark programs
follow. A driver (main) program initiates processing and
a number of workers perform tasks concurrently and
independently during the distributed processing (map)
phases. The workers store their results in accumulator
variables in the driver’s memory and are processed by the
driver when all workers have finished their jobs (reduce
phase). An Apache Spark script may include multiple
distributed processing phases. In our implementation,
the main phases of the algorithm are shown in Table 1.</p>
          <p>MR-DBSCAN [23] proposes a three-step MapReduce
procedure where the data points are in parallel
partitioned, clustered, and then merged and relabelled. This
method uses binary space partitioning (BSP) [24], an
overlapping scheme to dynamically partition the dataset
so that a balanced processing load on the cluster nodes is
achieved, while retaining common points for the cluster
merging phase. Moreover, a new partitioning method,
CBP is introduced, which splits the dataset based on cost
criteria. In [25] a similar procedure is followed, where
clustering occurs in parallel in small data blocks of equal
size and the results are combined during the reduce phase,
using intermediate key-value pairs and hierarchical
merging of the local clusters. For varying density
clustering, VDMR-DBSCAN [26] uses MapReduce to divide the
dataset into overlapping groups, perform the clustering
step, and merge the resulting local ones according to their
density and to partition adjacency criteria.</p>
          <p>TRACLUS’s relation to DBSCAN, along with its limited
scalability and the absence of a published
MapReducebased parallel implementation to handle big data, led us to
combine elements of some of the existing parallelization
techniques of DBSCAN we discussed. In particular we
incorporate BSP [23], the MapReduce strategy [22, 26],
and intermediate key-value pairs [25], in order to achieve
improved performance for larger datasets.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>We now present our distributed approach that is designed</title>
        <p>to scale up as the dataset size increases. To begin, the
original TRACLUS algorithm consists of three main phases:</p>
        <p>Trajectory partitioning. The moving object trajectories
are partitioned into line segments, aiming to compress
information without losing precision. To achieve this, the 3.1. Distributed trajectory partitioning
Minimum Description Length principle (MDL) [19] is used
that minimizes the sum of two costs: the encoding cost During the trajectory partitioning phase, the dataset is
for a hypothesis and, a second, regarding the data that divided randomly at the trajectory level and each worker
is encoded according to this hypothesis. In our case, a receives a set of trajectories to process. The results are
hypothesis corresponds to a specific subset of all possible added to a list accumulator variable, to be unified in a
partitions of a trajectory. single line segment dataset by the driver program for</p>
        <p>Line segment clustering. At this phase, the clustering further processing in the next phases, as shown in Fig. 1.
task takes place by applying DBSCAN to line segments,
instead of points. A composite function which combines 3.2. Distributed line segment clustering
perpendicular, parallel and angle distances is used to
calculate the interspace between the line segments and As already explained, line segment clustering is by far the
define neighbourhoods. most time-consuming phase , with its time cost growing</p>
        <p>Representatives generation. The output of the clustering significantly as the input dataset becomes larger. Thus,
phase is a set of line segment clusters. For each one of a distributed implementation for this part is imperative.
them, a representative trajectory is generated based on We describe the methods used for dataset partitioning,
the line segments belonging to the cluster. i.e. random and spatial partitioning, and the merging</p>
        <p>In theory, the single-threaded implementation of TR- strategies that unify the results produced by the workers.
ACLUS is expected to require more time during the line The DBSCAN algorithm that uses R-tree spatial
insegment clustering and the trajectory partitioning phases, dexes has a run time complexity of ( log ) [3], where
no line segments belonging to two diferent clusters to
be considered as “bridges”, i.e. indicators of meaningful
cluster merges. To avoid this, after partitioning, each
worker receives two partitions of the dataset to build its
spatial index (Figure 3). This leads in double-sized spatial
indexes per worker (as they are created by two diferent
dataset partitions instead of a single one), containing
common line segments between partitions, as the index
contains the segments of the corresponding partition and
another one. This is useful for merging the local clusters
discovered by each worker individually.
 is the number of the dataset entries. TRACLUS
inherits this complexity [9]. Now, the splitting of the dataset
into  &gt; 2 separate and equal partitions yields a time
complexity of:</p>
        <p>(  log  ) Figure 3: Example using random partitioning (Figure 2). a.
As  ·  log  &lt;  log , overall, time complexity de- Line segments belonging to the blue partition (partition 1). b.
creases as  increases. Line segments used for partition’s spatial index, contain also
line segments from the orange partition (partition 2).
3.2.1. Random partitioning (dTRACLUS-R)
3.2.2. Spatial partitioning (dTRACLUS-S)</p>
      </sec>
      <sec id="sec-3-2">
        <title>In the case of random partitioning of the line segments,</title>
        <p>a balanced split of the initial dataset is created, assigning
approximately the same number of members to each
partition (see, Figure 2). Then, partitions are assigned to
separate workers, where local R-tree spatial indexes are
created and used during line segment clustering.
mentioned spatial index is used.
tial partitions. Spatial partitioning borders are depicted by
the grey continuous lines. Common line segments between
partitions are colored with yellow.</p>
      </sec>
      <sec id="sec-3-3">
        <title>However, higher precision comes with some extra run</title>
        <p>time and memory cost. Prior to partitioning, the whole
dataset needs to be indexed in a global R-tree whose
insert and search operations are equal to (log ). The
R-tree is then queried every time a split on the dataset
is attempted, to check if each side of the split contains
approximately equal number of line segments. Since the
search for better splits may require many iterations to
converge to the most appropriate one, we stop when a
balanced split is found, i.e. the number of line segments
in one partition is between ± 5% of the line segments in
the other partition. Let  ≤</p>
        <p>2
anticipate an average of  tries per split, resulting to 1
in the best case, and () for the worst.</p>
        <p>More formally, if  is the number of the line segments
contained in the dataset,  the number of partitions, 
the number of maximum tries per split,  the number of
extra splits needed in case the number of partitions is not
a power of 2, then for the spatial partitioning procedure
2 be the iteration limit, we
the time complexity equals to:
( +  − 1) · 2</p>
        <p>· log</p>
      </sec>
      <sec id="sec-3-4">
        <title>Thus, the total runtime complexity for the whole spatial partitioning phase becomes:</title>
        <p>︁( (︀ ( + ) · ︀) · log 
︁)</p>
      </sec>
      <sec id="sec-3-5">
        <title>This constitutes an additional efort that is induced by the spatial partitioning method, however this increase is negligible compared to the complexity gain that we have from the complete distributed implementation.</title>
        <sec id="sec-3-5-1">
          <title>3.3. Merging local clusters</title>
          <p>A crucial step for the integrity of the method’s results
is the merging of the discovered clusters. The process
followed for the merging of local clusters into global ones
is similar for the random and the spatial partitioning
implemetations. The main diference lies in the way
common members of the clusters emerge, which are later
used for cluster merging.
3.3.1. Random partitioning (dTRACLUS-R)
For the random partitioning implementation, to be able
to merge the local clusters we use extra line segments to
build the spatial indexes each worker uses. Note though
that these line segments are not considered as part of the
worker’s dataset. This means that these line segments
can be returned by querying the spatial index, but are
considered as not belonging to the partition dataset.
During the line segment clustering phase, each time a cluster
is expanded by one line segment, this line segment is
checked if it belongs to the worker’s dataset. If it does
not belong to it, the line segment ID and the cluster in
which it was assigned are added in a special “duplicates”
accumulator to be further processed by the driver
program after the end of this distributed phase. When the
line segment clustering phase is completed, duplicate
values are removed from the “duplicates” accumulator and
the final merging couples are determined.
3.3.2. Spatial partitioning (dTRACLUS-S)
In the spatial partitioning case, the merging procedure is
simpler, as each worker contains the same line segments
in both its spatial index and its dataset partition, so there
is no need to store possible duplicates in a special-purpose
accumulator, as was the case in random partitioning. The
common line segments that will be used as “bridges” for
cluster merging are already specified by the spatial
partitioning phase itself. This happens because line segments
are two-dimensional objects that can span one or more
spatial partitioning rectangles. This results to few
common line segments between the spatial partitions. Thus,
in the merging phase, the algorithm searches for these
segments to use for the local cluster merging.</p>
          <p>The final clusters obtained after the merging phase
are expected to be slightly diferent from the clusters
extracted from the single-threaded TRACLUS
implementation. This is because in both cases, line segments that
would normally be grouped to the same cluster, are
dispersed in other partitions. It is possible that some of
these partitions might contain so few cluster members,
that they are considered as noise by the line segment
clustering phase and never make it to the cluster merging
phase. The rejected segments could either be considered
as members of bigger clusters, or as common segments
used to merge two or more clusters during the merging
phase. This deviation is expected to be larger for the
random partitioning implementation, as the dispersion
of nearby line segments between the resulting partitions
is higher than dispersion caused by spatial partitioning.
3.4. Clustering significance To test the proposed methods, we conducted experiments
using a part of NOAA vessels dataset, with data of June
The final phase generates the cluster representatives, 2019. The datasets used for the experiments contained
apwhich act as cluster medoids and describe the average proximately 50, 100, 200, 500, 1000, 2000, 5000 and 10000
movement of the objects included in each discovered trajectories, representing a total of 35K to 6M points.
cluster. Intuitively, a representative that lies close to the The average length of the datasets’ trajectories spanned
segments of its corresponding cluster, describes it better between 587 to 734 points, having a standard deviation
than one lying further away. Randomly generated data ranging between 311 to 321 points.
in the wider area of the representative are expected to be Experiments were run on an Apache Spark standalone
more distant to the representative than to the real cluster cluster installed on a VirtualBox virtual machine running
members, as no spatial criterion is considered for their Linux Ubuntu 18.04 with 14 CPU cores and 28 Gigabytes
generation, contrary to the real data selected by spatial of RAM. Our implementation was based on Alex
Polneighborhood queries during the clustering phase. cyn’s TRACLUS implementation in Python [27], which</p>
          <p>To evaluate the clustering eficiency of the single- was properly modified to run in a distributed manner
threaded version and our proposed distributed one, we using PySpark according to the approaches we propose
employ statistical significance tests, used to determine if in Section 3. For the spatial indexing part, we use the
the distances of the cluster members to the cluster repre- Pyrtree [28], an R-tree implementation written in pure
sentative are similar to the distances of random segments. Python. Our implementation is openly available online.1</p>
          <p>To achieve this, for each generated cluster
representative, its minimum bounding box is calculated. We
generate random line segments inside this box, equal in
number to the line segments that belong to the cluster. The
next step is to calculate the Fréchet distance between the
cluster members and each line segment of the
representative, and store the minimum. The process is repeated
for the randomly generated line segments. In the end,
we come up with two distance distributions, which are
tested using Z-test and Kolmogorov-Smirnov test, to
obtain a measure of significant diference. This measure
indicates if the randomly generated (fake) line segments
are significantly further and dissimilar to the
representative than the real segments of the cluster and can be used
to distinguish between tightly (significant) and loosely
(non-significant) connected clusters, acting as a measure
of quality for the generated clusters and how well they
are described by their representative. An example of this
process is shown in Fig. 5.</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>4.1. Line segment clustering runtime</title>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>The main goal of the distributed implementation is to</title>
        <p>make TRACLUS run faster for large trajectory datasets.
By evaluating the single-threaded implementation we
observe that the line segment clustering phase requires
most of the overall runtime to complete (Figure 6).</p>
        <p>Single-thread TRACLUS phases
105
104
)s
ond103
sce
(e102
m
it
n
uR101
100
0
2000
4000 6000
No. of Trajectories</p>
        <p>Partitioning
Indexing
Clustering
Representatives
8000 10000</p>
        <p>Now, the distributed implementation of the line
segment clustering phase significantly reduces the total
runtime cost, as we can see in Figure 7. This outcome is
anticipated based on the complexity analysis of the
distributed implementation (cf. Section 3.2). The smaller
size of the spatial indexes sent to the workers, enables
them to complete respectively smaller processing tasks
during the clustering phase in less time.</p>
        <p>TRACLUS
dTraClus-S (14 workers)
dTraClus-R (14 workers)
102</p>
        <p>103
No. of Trajectories
104</p>
        <sec id="sec-3-6-1">
          <title>4.2. Representatives and significance</title>
          <p>A way to determine the quality of the algorithm’s output This measure of statistical significance can be used for
between the presented implementations is the number hyperparameter tuning, as a better choice for the epsilon
of the representative trajectories generated at the final TRACLUS parameter and can lead to more significant
phase, compared to the number of representatives gener- clustering in cases such as the ones presented in Figure 9.
ated by the base line (single-threaded) implementation. There, the inclusion of the remote line segments in the
A number of representatives close to the baseline im- clusters could be avoided. Moreover, this measure can
plies both successful cluster discovery and good perfor- be used as a filtering criterion to indicate cases where
mance on local cluster merging. However, as mentioned representatives lie away from their corresponding cluster
in Section 3.3, there can be deviations between the single- members and should be rejected.
threaded and the distributed implementation results due
to the dispersion of line segments to diferent partitions. 4.3. Overall performance and scalability</p>
          <p>Evaluation of the clustering significance shows that in
most of the cases, significant scores for both the Z-test We compared the total runtime results of the baseline
and Kolmogorov-Smirnov statistical significance tests single-threaded implementation with the distributed
are achieved. This phenomenon is due to the given TR- ones and, as we can see in Figure 7, the distributed
impleACLUS hyperparameters (big neighborhood ratio) and mentations outperform by far the single-threaded one.
the sparsity and nature of the dataset (vessel trajectories) We can also see that the diference in execution time
inwhich led to fairly dense clusters, compared to their sur- creases as the datasets grow with respect to the number
roundings. For the committed experiments, the highly of trajectories and, consequently, the points they contain.
significant clusterings illustrate tightly-connected clus- To improve the distributed implementation performance
ters when plotted on a map, as shown in Figure 8. for big data and achieve better scalability, more workers
can be added. To take full advantage of the extra workers,
the dataset partitions should be equal or more than the
number of the available workers.</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>Clusterings of no significance were also present in some rare cases. Plotting their structures reveals looselyconnected clusters, often dispersed across multiple “subclusters” and not assigned to coherent ones (Figure 9).</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusion and future work</title>
      <sec id="sec-4-1">
        <title>In this paper we proposed two distributed TRACLUS</title>
        <p>implementations, based on random and spatial
partitioning of the initial dataset. Experiments conducted on an
Apache Spark cluster revealed that the proposed
methods significantly outperform the original single-threaded
TRACLUS implementation in terms of execution runtime
as the dataset size increases. The clustering results of
the distributed implementations maintain an acceptable
clustering performance, which was evaluated using
statistical significance tests. Possible improvements of the
proposed methods could be a faster, possibly distributed,
merging strategy for the local clusters. Moreover,
studying the performance of the proposed methods with even
larger datasets on multi-node, distributed Apache Spark [12] S. Ghosh, S. K. Ghosh, R. Buyya, Movcloud: A
cloudclusters can reveal more about the scalability of the pro- enabled framework to analyse movement behaviors,
posed implementations. in: IEEE Int. Conf. on Cloud Computing Technology
and Science (CloudCom), 2019, pp. 239–246.
[13] Z. Li, J. Lee, X. Li, J. Han, Incremental clustering
Acknowledgments for trajectories, in: Database Systems for Advanced
Applications: 15th Int. Conf., 2010, pp. 32–46.</p>
        <p>This research is supported by European Union’s Hori- [14] P. Tampakis, N. Pelekis, C. Doulkeridis, Y.
Theodorzon 2020 RIA programme under grant agreement No idis, Scalable distributed subtrajectory clustering,
957237, project ENABLING MARITIME DIGITALIZA- IEEE Int. Conf. on Big Data (2019).
TION BY EXTREME-SCALE ANALYTICS, AI AND DIGI- [15] N. Pelekis, P. Tampakis, M. Vodas, C. Doulkeridis,
TAL TWINS (VesselAI). Y. Theodoridis, On temporal-constrained
subtrajectory cluster analysis, Data Mining and
KnowlReferences edge Discovery 31 (2017).</p>
        <p>[16] D. Zissis, K. Chatzikokolakis, G. Spiliopoulos, M.
Vo[1] D. Xu, Y. Tian, A comprehensive survey of clus- das, A distributed spatial method for modeling
martering algorithms, Annals of Data Science (2015) itime routes, IEEE Access 8 (2020) 47556–47568.
165–193. [17] O. L. Hsu, C. Lee, Common sub-trajectory
cluster[2] A. Jain, R. Dubes, Algorithms for clustering data, ing via hypercubes in spatiotemporal space, IEEE</p>
        <p>Prentice-Hall, Inc, Upper Saddle River, 1988. Access 8 (2020) 23369–23377.
[3] M. Ester, H. Kriegel, J. Sander, X. Xu, A density- [18] C. Jiashun, A new trajectory clustering algorithm
based algorithm for discovering clusters in large based on traclus, in: Int. Conf. on Computer Science
spatial databases with noise, in: Proc. of the Sec- and Network Technology, 2012, pp. 783–787.
ond Int. Conf. on Knowledge Discovery and Data [19] E. Keogh, S. Lonardi, C. Ratanamahatana, Towards
Mining, KDD’96, AAAI Press, 1996, p. 226–231. parameter-free data mining, 2004, pp. 206–215.
[4] R. Barnwal, S. Baride, S. Majumder, S. Ghosh, A [20] M. M. A. Patwary, D. Palsetia, A. Agrawal, W. Liao,
density-based algorithm for detecting anomalous F. Manne, A. Choudhary, A new scalable parallel
trajectories, in: Int. Conf. on Microelectronics, Com- dbscan algorithm using the disjoint-set data
strucputing and Comm. (MicroCom), IEEE, 2016, pp. 1–4. ture, in: Int. Conf. on High Performance Comp.,
[5] I. Kontopoulos, I. Varlamis, K. Tserpes, A dis- Networking, Storage and Analysis, 2012, pp. 1–11.
tributed framework for extracting maritime trafic [21] F. Gouineau, T. Landry, T. Triplet, Patchwork, a
patterns, Int. Journal of Geographical Information scalable density-grid clustering algorithm, in: Proc.</p>
        <p>Science 35 (2020) 1–26. of the 31st Annual ACM Symposium on Applied
[6] X. Li, J. Han, J. Lee, H. Gonzalez, Trafic density- Computing, ACM, 2016, p. 824–831.
based discovery of hot routes in road networks, in: [22] X. Hu, L. Liu, N. Qiu, M. Li, A mapreduce-based
Advances in Spatial and Temporal Databases: 10th improvement algorithm for dbscan, Journal of
AlInt. Symposium, 2007, pp. 441–459. gorithms &amp; Computational Technology 12 (2017).
[7] H. Munaga, M. Sree, J. Murthy, Dentrac: A density [23] Y. He, H. Tan, W. Luo, S. Feng, J. Fan, Mr-dbscan:
based trajectory clustering tool, Int. Journal of a scalable mapreduce-based dbscan algorithm for
Computer Applications 41 (2012) 17–21. heavily skewed data, Front. of Comp. Science 8
[8] D. Das, D. Mishra, Unsupervised anomalous trajec- (2014).</p>
        <p>tory detection for crowded scenes, in: IEEE 13th [24] M. J. Berger, S. H. Bokhari, A partitioning strategy
Int. Conf. on Industrial and Information Systems for nonuniform problems on multiprocessors, IEEE
(ICIIS), 2018, pp. 27–31. Transactions on Computers C-36 (1987) 570–580.
[9] J. Lee, J. Han, K. Whang, Trajectory clustering: [25] X. Fu, S. Hu, Y. Wang, Research of parallel dbscan
A partition-and-group framework, SIGMOD ’07, clustering algorithm based on mapreduce, Int.
JourACM, New York, NY, USA, 2007, p. 593–604. nal of Database Theory and Application 7 (2014)
[10] H. Mustafa, C. Barrus, E. Leal, L. Gruenwald, Gtr- 41–48.</p>
        <p>aclus: A local trajectory clustering algorithm for [26] S. Bhardwaj, S. Dash, Vdmr-dbscan: Varied density
gpus, in: 2021 IEEE 37th Int. Conf. on Data Engi- mapreduce dbscan, in: Big Data Analytics: 4th Int.
neering Workshops (ICDEW), 2021, pp. 30–35. Conf., volume 9498, 2015, pp. 134–150.
[11] P. Sun, S. Xia, G. Yuan, D. Li, An overview of moving [27] A. Polcyn, traclus_impl, https://github.com/
object trajectory compression algorithms, Mathe- apolcyn/traclus_impl, 2016.
matical Problems in Engineering 2016 (2016) 1–13. [28] A. S. Peleg, pyrtree, https://github.com/Rhoana/
pyrtree, 2018.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>