Hot Spot Analysis for Big Trajectory Data in Road Networks
                         Panagiota Keziou, Christos Doulkeridis
                         Department of Digital Systems, University of Piraeus, Greece


                                           Abstract
                                           Hot spot analysis is the problem of identifying statistically significant spatial clusters and is typically applied on point data. The aim
                                           of this paper is to discover hot spots in an urban environment, namely road segments with statistically significant amount of traffic
                                           congestion, using massive GPS data of moving objects. To this end, we adjust the Getis-Ord index for hot spot discovery to become
                                           applicable for road networks. Then, we propose two data-parallel algorithms for hot spot discovery in road networks implemented in
                                           Apache Spark; an exact algorithm that has scalability limitations for very large data sets, and a scalable approximate algorithm that
                                           balances between performance and accuracy. We provide experiments over a large real-life data set that indicate the salient features of
                                           our approach.

                                           Keywords
                                           Hot spot analysis, road networks, big data, urban mobility


                         1. Introduction                                                                                                       millions of GPS records per temporal interval. Therefore, we
                                                                                                                                               need to design scalable algorithms for hot spot analysis by
                         In recent years, the analysis of mobility data has drawn                                                              exploiting big data technologies. To this end, we introduce
                         the attention of both the academic society and the industry.                                                          two data-parallel algorithms for hot spot discovery in road
                         Tracking of moving objects on road networks is ubiquitous                                                             networks, with a goal to be applicable for massive trajectory
                         in fleet management applications and can support diverse                                                              data sets. The first algorithm returns the exact result, but
                         scenarios of urban planning, more efficient transportation,                                                           faces performance limitations for really large data sets due
                         reduction of greenhouse gas emissions, and improvement                                                                to the underlying exchange of information between pairs
                         of life quality. Due to the technological evolution of GPS                                                            of edges. Then, we present an approximate algorithm with
                         sensors and their integration in mobile devices and cars, it                                                          a cut-off threshold for excluding edges that are located too
                         is possible to collect vast amounts of spatio-temporal data.                                                          far away spatially or temporally, as their contribution to
                         Thus, our work is motivated by the need for identifying                                                               hot spot discovery is minimal. It turns out that our approxi-
                         traffic congestion from GPS data on the road network.                                                                 mate algorithm provides an interesting trade-off between
                            In this paper, we present an approach for identifying                                                              scalability and accuracy, returning hot spots that are very
                         traffic hot spots, using big trajectory data collected by GPS                                                         similar to ones discovered by the exact algorithm but orders
                         sensors. Through hot spot analysis, we aim to identify road                                                           of magnitude faster.
                         segments with statistically significant high levels of traffic                                                           In summary, we make the following contributions:
                         congestion. This means that we do not just aim to identify
                         road segments that have high congestion values; such road                                                                  • We formulate the problem of hot spot analysis in
                         segments may be of interest, but they are not necessarily                                                                    road networks, by the means of Getis-Ord Statistic,
                         statistically significant. Instead, a hot spot is statistically                                                              appropriately tailored for graphs.
                         significant if it has high congestion value but additionally is                                                            • We design and implement an exact algorithm in
                         surrounded by other segments with high congestion values.                                                                    Apache Spark that computes hot spots in parallel.
                            For this purpose, we employ the Getis-Ord Statistic index,                                                              • We present a scalable approximate algorithm that
                         a popular metric for hot spot analysis, which – to the best of                                                               discovers hot spots of high accuracy but significantly
                         our knowledge – has so far been used for spatial and spatio-                                                                 faster.
                         temporal data of objects with free movement. Instead, we                                                                   • We evaluate the performance and scalability of our
                         adapt the Getis-Ord index to become applicable for graphs                                                                    algorithms for different parameters, using a real-life
                         that represent the road network, such as the ones obtain-                                                                    data set from the Metropolitan Area of Athens.
                         able via OpenStreetMap. Moreover, to handle the temporal
                         dimension, we consider different temporal snapshots of the                                                              The rest of this paper is structured as follows: Section 2
                         road network whose duration is user-defined according to                                                              reviews related research efforts. Section 3 formulates the
                         the application requirements. Effectively, we have multiple                                                           problem under discussion. In Section 4, we present our
                         snapshots of the same graph, each corresponding to the                                                                approach for hot spot analysis over road networks, while in
                         same road network but for a different time interval and thus                                                          Section 5 we present the parallel algorithms for detecting
                         each road segment has different traffic congestion values.                                                            hot spots. Then, in Section 6 describes the results of our
                         In this way, we formulate the problem of hot spot discovery                                                           evaluation, while in Section 7, we conclude the paper.
                         in road networks using a model that consists of a sequence
                         of temporal snapshots of the same graph (road network).
                            Computing the hot spots using this model is a processing-
                                                                                                                                               2. Related Work
                         intensive operation, which relies on diffusion of information                                                         Existing works in hot spot analysis over mobility data can be
                         over (parts of) the graphs. Moreover, the necessary data                                                              broadly classified in two categories: (a) point-based hot spot
                         to infer traffic congestion on roads may be in the order of                                                           analysis, where mobility data consist of individual spatial
                                                                                                                                               or spatio-temporal points, and (b) trajectory-based hot spot
                          Published in the Proceedings of the Workshops of the EDBT/ICDT 2025                                                  analysis, where sequences of spatio-temporal points that
                          Joint Conference (March 25-28, 2025), Barcelona, Spain                                                               belong to the same user/vehicle/vessel (i.e., trajectories) are
                          $ keziou@hotmail.com (P. Keziou); cdoulk@unipi.gr (C. Doulkeridis)
                                   © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-       considered. As our work belongs to the latter category, we
                                   tion 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                                                                           1
briefly review point-based approaches and put our main            propose the density-based algorithm FlowScan to discover
focus on trajectory-based approaches.                             patterns in the flow of traffic, referred to as “hot routes”. By
                                                                  defining the minimum common traffic that successive road
2.1. Point-based Hot Spot Analysis                                segments should share and the the number of edges that
                                                                  form a neighborhood, clusters of the moving objects are
Eftelioglu et al. [1] propose the CGC (Cubic Grid Circle) al-     detected. The algorithm is capable of recognizing complex
gorithm for the problem of detecting geographically robust        spatial patterns. The OPS methodology proposed by Häsner
hot spots, modeled as circles in a 2D space. The proposed         et al. [8] predicts near future traffic hot spots, based on
algorithm is an improvement of SaTScan algorithm, as it           recent positions of vehicles on the road network. A heuristic
detects non-contiguous or sparse center hot spots and elim-       method is employed to predict the future position of the
inates very small hot spots. In their studies, [2] and [3]        vehicles, and a weight is assigned to the nodes of the road
expand the methodology for ring-shaped and elliptical hot         network that reflects the likely traffic intensity within a
spots detection, respectively. In [4], a density-based clus-      specified time window. Finally, through an outlier detection
tering algorithm (sig-DBSCAN) improves the quality of re-         approach, hot spots nodes are detected and the DBSCAN
sults by discarding random patterns, while the use of Dual-       algorithm derives the sub-graphs that represent traffic hot
Convergence algorithm significantly reduces computational         spot regions.
cost. For the above studies, the statistical significance of         Sacharidis et al. [10] propose a framework for on-line
the proposed hot spots is assessed via Monte Carlo simula-        discovery of hot motion paths, in order to detect frequently
tion. In contrast, Nikitopoulos et al. [5] propose a parallel     travelled paths. Their study differentiates in the fact that
and scalable solution for hot spot analysis over big spatio-      the hotness of a road segment is determined as a function
temporal data. The spatio-temporal domain is divided into         of the amount of time the moving object has spent on the
3D cells and the Getis-Ord 𝐺*𝑖 statistic is used to detect the    path. Similarly, the studies in [11] and [12] detect hot spots
top-𝑘 statistically significant cells:                            without taking into account the road network restrictions.
                    ∑︀𝑛                 ∑︀𝑛                       In [11], the authors address the problem of hot spot analy-
             *         𝑗=1 𝑤𝑖,𝑗 𝑥𝑗 − 𝑋    𝑗=1 𝑤𝑖,𝑗                sis over big trajectory data in a parallel and scalable way.
           𝐺𝑖 = √︂ ∑︀                                       (1)
                       [𝑛   𝑛    2
                            𝑗=1 𝑤𝑖,𝑗 −(
                                       ∑︀ 𝑛         2
                                          𝑗=1 𝑤𝑖,𝑗 ) ]
                                                                  Their approach is based on spatio-temporal partitioning
                  𝑆                𝑛−1                            of the 3D space in cells. The Getis-Ord index is employed
                                                                  and appropriately tailored to reveal hot spots in terms of
where 𝑥𝑗 is the attribute value for cell 𝑗, 𝑤𝑖,𝑗 is the spatial
                                                                  spatio-temporal cells. Two algorithms are proposed: the
weight between cell 𝑖 and 𝑗, 𝑛 is equal to the total number
                                                                  THS algorithm defines the z-score of the cell under study
of cells, and 𝑋 and 𝑆 represent the mean and standard
                                                                  depending to the cell values of the whole grid, which is com-
deviation respectively:
                                                                  putationally expensive. The aTHS algorithm as the authors
                            ∑︀𝑛                                   mentions trade-off accuracy for efficiency in a controlled
                               𝑗=1 𝑥𝑗
                      𝑋=                                          manner and ignore the contribution of cells located out side
                                𝑛
                                                                  a predefined area of influence. Finally, Qiao et al. [12] in-
                                                                  troduce a cloud-based analytical framework for big mobile
                      √︃ ∑︀
                           𝑛     2
                           𝑗=1 𝑥𝑗                                 data from the view of user mobility in densely populated
                𝑆=                 − (𝑋)2
                             𝑛                                    areas, collected from 2G/3G/4G networks.
In our work, we adapt the Getis-Ord statistic to be appli-           Our work differs from these approaches as we consider
cable on graphs representing temporal snapshots of road           road segments as candidate hot spots and adapt a well-
networks. The attribute 𝑥𝑗 can be any indicator of traffic        known hot spot index to be applicable on temporal graphs
congestion, such as the average speed or the number of            that represent the road network for different temporal in-
vehicles per time unit.                                           tervals. In this way, we capture the effect of spatial and tem-
   The proposed approaches of [6] and [7] take into account       poral proximity of neighboring congested road segments.
the underlying road network. Tang et al. [6] present a dy-        This is a differentiating factor compared to related works
namic segmentation model of the road network, which al-           that operate over trajectories, as they produce hot spots at
lows the identification of statistically significant linear hot   different level of granularity (e.g., routes, paths, grid cells)
spots with high concentration of activities, such as pedes-       and using other metrics. To the best of our knowledge ours
trian accidents on the road network, which otherwise would        is the first work that shows how to apply hot spot analysis
not have been recognized. According to the proposed ap-           via Getis-Ord in road networks.
proach, the shortest paths between activities are identified
and a density index is calculated for each of them. The
algorithm returns all the paths that are statistically signifi-
                                                                  3. Problem Formulation
cant (based on the Monte Carlo method) and their density
value exceeds a threshold. In [7], the Network Isodistance        3.1. Preliminaries
Hotspot Detection (NIHD) problem is studied, which reveals        We consider a directed graph 𝐺(𝑉, 𝐸) that represents the
sub-graph hot spots with high concentration of activities,        road network, where an edge 𝑒𝑖 ∈ 𝐸 corresponds to a road
diffused isotropically along the network. The deducted com-       segment and is defined by two vertices 𝑣, 𝑣 ′ ∈ 𝑉 , which
putational cost of the algorithm is achieved through network      represent crossroads. The edge 𝑒𝑖 is directed, indicating the
partitioning and upper-bound pruning.                             flow of movement on the road network from 𝑣 to 𝑣 ′ . We
                                                                  use the terms edge and road segment interchangeably.
2.2. Trajectory-based Hot Spot Analysis                              To model the effect of time on traffic congestion, we
                                                                  consider a sequence 𝐺𝑇 = {𝐺1 , 𝐺2 , . . . , 𝐺𝑚 } of tempo-
In their studies, Häsner et al. [8] and Li et al. [9] take into   ral snapshots of the graph 𝐺, where each snapshot corre-
account the limitations of the road network. Li et al. [9]
          Figure 1: Left: Example of an edge 𝑒3𝑝 of temporal snapshot 𝐺𝑝 , which is affected only by subsequent edges 𝑒4𝑝 , 𝑒5𝑝 and 𝑒6𝑝
          (in blue). Right: Two temporal snapshots of 𝐺𝑝 and 𝐺𝑞 (𝑞 = 𝑝 + 1), where 𝑒3𝑝 in 𝐺𝑝 is affected also by the edges 𝑒3𝑞 , . . . , 𝑒6𝑞
          of the next snapshot 𝐺𝑞 .


sponds to a pre-defined temporal duration 𝑇 . A snapshot                      The weight 𝑤(𝑖𝑝),(𝑗𝑞) between two edges 𝑒𝑖𝑝 (edge 𝑒𝑖 at
𝐺𝑝 corresponds to an interval [𝑡𝑝 , 𝑡𝑝+1 ) of duration 𝑇 , i.e.,              snapshot 𝐺𝑝 ) and 𝑒𝑗𝑞 (edge 𝑒𝑗 at snapshot 𝐺𝑞 ) indicates
𝑡𝑝+1 − 𝑡𝑝 = 𝑇 . For example, if 𝑇 is set to 5 minutes, one                    the influence of edge 𝑒𝑗𝑞 to edge 𝑒𝑖𝑝 . Intuitively, if a road
such interval could be 9:00 am – 9:05 am. A temporal snap-                    segment is congested, this also affects other road segments
shot 𝐺𝑝 of the graph 𝐺 is used to represent the traffic for                   in its vicinity. More precisely, in this paper, we assume that
the specific time interval [𝑡𝑝 , 𝑡𝑝+1 ). Fig. 1 illustrates the               if a road segment 𝑒 is congested, this affects (or will affect
concept of temporal snapshots of a graph that represents                      in the near future) the previous road segments, i.e., those
the road network.                                                             leading to 𝑒. For example, in Fig. 1 (left), the edge 𝑒3𝑝 (in
                                                                              red color) is affected by the weights of edges 𝑒4𝑝 , 𝑒5𝑝 and
3.2. Problem Formulation                                                      𝑒6𝑝 (in blue color), whereas all other weights are considered
                                                                              equal to zero 𝑤(3𝑝)(1𝑝) = 𝑤(3𝑝)(2𝑝) = 0 .
The problem of hot spot analysis addressed in this work is to                    Moreover, when considering different temporal snap-
identify statistically significant road segments that indicate                shots of the graph 𝐺, the congestion of edges in snapshot
hot spots, due to traffic congestion. An edge 𝑒𝑖 at snapshot                  𝐺𝑖+1 affect edges of previous and next temporal snapshots
𝐺𝑝 , denoted 𝑒𝑖𝑝 ∈ 𝐸(𝐺𝑝 ), where 𝐸(𝐺𝑝 ) represents the                        {. . . , 𝐺𝑖+2 , 𝐺𝑖 , 𝐺𝑖−1 , . . . }. In Fig. 1 (right), edge 𝑒3𝑝 at
edges of the snapshot 𝐺𝑝 , is associated with an attribute                    snapshot 𝐺𝑝 is affected by edges 𝑒3𝑞 , . . . , 𝑒6𝑞 at 𝐺𝑞 .
value 𝑥𝑖𝑝 that indicates traffic congestion on 𝑒𝑖 during the                     According to the above, the problem of hot spot analysis
temporal interval of 𝐺𝑝 . As such, whether an edge 𝑒𝑖𝑝 is                     in road networks is to identify the 𝑘 most statistically sig-
considered hot spot is a function of the edge’s attribute value               nificant road segments (edges) according to the Getis-Ord
𝑥𝑖𝑝 , but also of other neighboring edges’ attribute values. By               statistic and can be formally stated as follows.
neighboring edges, we mean both spatially (nearby edges
on the graph 𝐺) and temporally (edges at previous/next                        Problem 1. (Hot spot analysis in road networks) Given a se-
temporal snapshots).                                                          quence 𝐺𝑇 of temporal snapshots of a directed graph 𝐺(𝑉, 𝐸)
   To express this dependence on neighboring edges, we                        representing the road network, find the top-𝑘 edges 𝐸𝑘 of 𝐺𝑇
                                                                                                                   *              *      *
adjust a commonly used function, namely the Getis-Ord                                             ⋃︀ statistic 𝐺𝑖𝑝 , such that: 𝐺𝑖𝑝 ≥ 𝐺𝑗𝑞 ,
                                                                              based on the Getis-Ord
statistic [13]. To the best of our knowledge, Getis-Ord has                   ∀𝑒𝑖𝑝 ∈ 𝐸𝑘 , 𝑒𝑗𝑞 ∈ ( 𝑚 𝑟=1 𝐸(𝐺   𝑟 ))  − 𝐸𝑘 .
been used so far to discover hot spots in cell-based parti-
tions for spatial (2D) and spatio-temporal (3D) domain, for                      Intuitively, for each road segment we compute a value
point [14, 5] and trajectory data [11]. Our work extends the                  that indicates if it is a hot spot. This value depends on
applicability of Getis-Ord for hot spot discovery in graphs                   the traffic congestion of the road segment itself, but also
representing the road network. Thus, for any edge 𝑒𝑖𝑝 of a                    on the traffic congestion of subsequent road segments (as
snapshot 𝐺𝑝 of 𝐺 we adapt Eq. 1 and define the Getis-Ord                      congestion is propagated backwards). Moreover, it depends
value of the 𝑖-th edge of the 𝑝-th temporal snapshot:                         on the traffic congestion of these road segments in the next
                                                                              temporal snapshots of the network, as illustrated in Fig. 1.
         ∑︀|𝐸| ∑︀𝑚                                 ∑︀|𝐸| ∑︀𝑚                     The novelty of our work lies in the identification of hot
           𝑗=1           𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) 𝑥𝑗𝑞 − 𝑋      𝑗=1     𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞)   spots in road networks using a statistically significant met-
𝐺*𝑖𝑝 =         √︂        ∑︀|𝐸| ∑︀𝑚               ∑︀|𝐸| ∑︀𝑚                    ric. In contrast, existing works (Section 2) either use ad-hoc
                                      2                       2
                    [𝑛           𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) − 𝑗=1    𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) ]
           𝑆               𝑗=1
                                             𝑛−1
                                                                              ways to compute hot spots or compute statistically signifi-
                                                         (2)                  cant hot spots as 2D/3D cells of predefined size. Adapting
where 𝑛 = |𝐸| · 𝑚 is equal to the total number of edges,                      the Getis-Ord statistic to become applicable for graphs is
𝑥𝑗𝑞 is the attribute value for edge 𝑒𝑗 at temporal snapshot                   the main technical contribution of our work. The challenge
𝐺𝑞 , 𝑤(𝑖𝑝),(𝑗𝑞) is the weight imposed by edge 𝑒𝑗𝑞 to edge                     relates to the definition of neighboring edges over multiple
𝑒𝑖𝑝 , and:                                                                    temporal snapshots of the same graph.
                          ∑︀|𝐸| ∑︀𝑚
                            𝑗=1    𝑞=1 𝑥𝑗𝑞
                    𝑋=                                   (3)
                    √︃
                                 𝑛                                            4. Hot Spot Analysis
                       ∑︀|𝐸| ∑︀𝑚     2
                         𝑗=1    𝑞=1 𝑥𝑗𝑞                                       This section presents our approach for hot spot analysis
              𝑆=                        − (𝑋)2           (4)
                              𝑛                                               in road networks. The input is a data set of GPS records,
each corresponding to a specific vehicle (based on identifier
vehID), a timestamp 𝑡, as well as the longitude 𝑥 and latitude
𝑦 values. Also, a graph 𝐺(𝑉, 𝐸) is available that represents
the road network, which consists of road segments, their
geometry, as well as additional information, such as direc-
tion, speed limit, etc. Such a graph can be typically obtained
from external sources, such as OpenStreetMap.
                                                                  Figure 2: Example showing how to compute the entry time for
                                                                  an edge given two positions on successive edges.
4.1. Map-matching
Data collected from GPS sensors typically contain small
errors, thereby making the mapping of GPS coordinates to          congestion in comparison with another edge 𝑒′𝑗𝑞 that is
the underlying road network hard. This problem is typically       further away. As in [18], we use a Gaussian kernel as weight
solved by applying a map-matching algorithm, which is re-         function, defined as follows:
sponsible for mapping each pair of GPS coordinates (longi-
tude, latitude) to another point located on the road network.                                         𝑑2(𝑖𝑝),(𝑗𝑞)
                                                                                 𝑤(𝑖𝑝),(𝑗𝑞) = 𝑒𝑥𝑝(−               )
Several sophisticated algorithms for map-matching have                                                    ℎ2
been developed, such as [15, 16].                                 where ℎ is the kernel’s bandwidth and 𝑑(𝑖𝑝),(𝑗𝑞) denotes the
   As the problem of map-matching is orthogonal to the one        distance between edges 𝑒𝑖𝑝 and 𝑒𝑗𝑞 . The distance function
we are trying to solve, we adopt a simplistic approach where      takes into account both spatial as well as temporal proximity.
each pair of GPS coordinates is assigned to the nearest edge.     As such, we define the distance as a linear combination of
In this way, each GPS record is enriched with an edge (road       spatial and temporal distance:
segment) of the graph, the coordinates of the two points that
define the road segment, its length, as well as with other
available information such as the speed limit.                           𝑑2(𝑖𝑝),(𝑗𝑞) = 𝜇𝑆 (𝑑𝑆          2   𝑇  𝑇
                                                                                            (𝑖𝑝),(𝑗𝑞) ) + 𝜇 (𝑑(𝑖𝑝),(𝑗𝑞) )
                                                                                                                         2


4.2. Estimating an Edge’s Traffic Congestion                      where 𝜇𝑆 , 𝜇𝑇 are scale factors for spatial and temporal dis-
                                                                  tance respectively, whereas 𝑑𝑆(𝑖𝑝),(𝑗𝑞) is the spatial distance
In order to compute the Getis-Ord value of an edge 𝑒𝑖𝑝 , the
                                                                  and 𝑑𝑇(𝑖𝑝),(𝑗𝑞) the temporal distance. The spatial distance
first step is to compute the edge’s attribute value 𝑥𝑖𝑝 , an
indicator of traffic congestion. This is computed as the aver-     (𝑖𝑝),(𝑗𝑞) is measured as the number of edges between 𝑒𝑖
                                                                  𝑑𝑆
age value of contributions from each vehicle that reported        and 𝑒𝑗 on graph 𝐺. The temporal distance is the difference
its position on 𝑒𝑖𝑝 during the timespan of 𝐺𝑝 . In the same       between temporal snapshots, e.g., edges at snapshots 𝐺𝑝
spirit as in [17], we define this contribution as:                and 𝐺𝑝+1 have temporal distance equal to 1.

                               𝑣𝑜𝑏𝑠
                          1−                                      5. Parallel Algorithms
                               𝑣𝑓 𝑓 𝑠
where 𝑣𝑜𝑏𝑠 is the observed speed of a vehicle on edge 𝑒𝑖𝑝 ,       In this section, we present two data-parallel algorithms
whereas 𝑣𝑓 𝑓 𝑠 is the free flow speed for edge 𝑒𝑖 . One way       designed and implemented in Apache Spark that compute
to compute 𝑣𝑜𝑏𝑠 is to exploit the enter (𝑡𝑖𝑛 ) and exit (𝑡𝑜𝑢𝑡 )   hot spots over very-large collections of GPS records in a
time of the vehicle on edge 𝑒𝑖𝑝 . Thus, we have:                  scalable way. The first algorithm is an exact algorithm,
                                                                  whereas the second computes approximate results but with
                                   𝑙
                     𝑣𝑜𝑏𝑠 =                                       high accuracy and much faster.
                              𝑡𝑜𝑢𝑡 − 𝑡𝑖𝑛
                                                                     In Apache Spark, the main data structure is the
where 𝑙 is the length of the edge, and 𝑡𝑖𝑛 (𝑡𝑜𝑢𝑡 ) is the en-     DataFrame, and its predecessor the RDD (Resilient Dis-
try (exit) time. Unfortunately, for a vehicle that provides       tributed Dataset) [19]. RDDs are immutable, distributed
its position on edge 𝑒𝑖𝑝 (after map-matching), we do not          collections of records, which support transformations and
know the exact entry and exit time. Therefore, we estimate        actions. Transformations produce new RDDs from a given
these values, in particular given two successive positions        RDD, with notable examples: map, flatMap, mapValues, fil-
we compute the entry time 𝑡 as follows:                           ter, etc. Actions perform some computation on an RDD
                                                                  and produce a new result (e.g., count). Our algorithms are
                                    𝑡2 − 𝑡1
                   𝑡 = 𝑡 1 + 𝑑1 ·                                 implemented in Spark to support parallel processing.
                                    𝑑1 + 𝑑2
where 𝑡1 and 𝑡2 correspond to the two timestamps, while           5.1. Pre-processing
𝑑1 and 𝑑2 are the distances of the two points from the entry
node, as shown in Fig. 2. In this way, we can estimate the        In the pre-processing phase, we perform map-matching (see
entry/exit times for each edge.                                   Section 4.1) and we split the data of each vehicle in a set of
                                                                  trajectories. For each vehicle, we create a new trajectory
                                                                  when we observe a temporal gap of 𝛿𝑡 in the timestamps
4.3. Defining the Weights                                         of reported positions. The value of 𝛿𝑡 is a parameter that
Given two edges 𝑒𝑖𝑝 and 𝑒𝑗𝑞 , where 𝑒𝑖 precedes 𝑒𝑗 on graph       depends on the sampling rate of the data set at hand.
𝐺, the weight 𝑤(𝑖𝑝),(𝑗𝑞) indicates the influence of traffic          As a result, the output of the pre-processing phase is a
congestion of edge 𝑒𝑗𝑞 to edge 𝑒𝑖𝑝 . The weight is a function     set of records in the following format: (vehID, trajID, 𝑥, 𝑦,
of the distance between the edges, as an edge 𝑒𝑗𝑞 that is         𝑡, 𝑒𝑖 , length, speed, 𝑣, 𝑣 ′ ), where each GPS record (vehID,
in close proximity with 𝑒𝑖𝑝 , has a stronger effect on 𝑒𝑖𝑝 ’s     𝑥, 𝑦, 𝑡) of the original data set is enriched with information
Algorithm 1 Computation of traffic congestion on edges          Algorithm 2 Computation of Getis-Ord index
 1: Input: The result of pre-processing: inputfile, num-         1: Input: The attrRDD: RDD[(𝑒𝑖 ,𝑡𝑝 ),𝑥𝑖𝑝 ], number of
    ber of Apache Spark partitions 𝑃 , temporal parameters:         Apache Spark partitions 𝑃 , temporal parameters: 𝑡𝑚𝑖𝑛 ,
    𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚                                             𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚, bandwidth ℎ, output of stage 2: 𝑋, 𝑆, 𝐺
 2: Output: RDD[(𝑒𝑖 ,𝑡𝑝 ),𝑥𝑖𝑝 ]                                  2: Output: RDD[(𝑒𝑖 ,𝑡𝑝 ),𝐺𝑖𝑝 ]
                                                                                             *

 3: function                                                     3: function
 4: dataRDD = sc.load(inputfile)                                 4: GetisOrdRDD = attrRDD.flatMap(lambda x : GetisOrd-
 5: attrRDD = dataRDD.groupByKey().                                 Calculations(x, 𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚, ℎ, 𝑋, 𝑆, 𝐺))).
 6:     mapValues(sort_grouped_values).                          5:     partitionBy(𝑃 , lambda k: hash_partitioner).
 7:     flatMap(lambda group: calc_attribute_value(group,        6:     reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1],
    𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚)).                                          x[2] + y[2])).
 8:     partitionBy(𝑃 , lambda k: hash_partitioner).             7:     mapValues(lambda x: Gi_formula(x, 𝑋, 𝑆))
 9:     mapValues(lambda x: (x,1)).                              8: end function
10:     reduceByKey(lambda x, y: (x[0] + y[0], x[1] +
    y[1])).
11:     mapValues(lambda x: x[0] / x[1])                        on the contributions of all vehicles (lines 9–11). It should
12: end function                                                be noted that the function calc_attribute_value() performs
                                                                the computation of 𝑥𝑖𝑝 , as described in Section 4.2.
                                                                   Then, in stage 2, we compute the aggregate values neces-
about the road segment (𝑒𝑖 , length, speed, 𝑣, 𝑣 ′ ) and each   sary for the computation of Getid-Ord. These include the
record is assigned with a trajectory identifier (trajID). The   mean 𝑋, the standard deviation 𝑆, and the total number
values 𝑣 and 𝑣 ′ correspond to the start and end vertices of    of edges 𝑛 for all temporal snapshots of the graph 𝐺. This
edge 𝑒𝑖 , while speed is the speed limit which is typically     information together with the graph 𝐺 is broadcast to all
provided in the road network data set and we use it as free     worker nodes in the cluster, so as to be available during the
flow speed 𝑣𝑓 𝑓 𝑠 .                                             computation of Getis-Ord.
                                                                   In stage 3, we compute the Getis-Ord index 𝐺*𝑖𝑝 per edge
5.2. The Exact Algorithm                                        𝑒𝑖𝑝 and temporal snapshot 𝐺𝑝 . Algorithm 2 provides the
                                                                pseudocode for the computation of Getis-Ord values. First,
The main stages of our algorithm are summarized below:          using flatMap (line 4), we transform attrRDD into a new
                                                                RDD with key the edge 𝑒𝑖𝑝 and value a triple:
     • Stage 1: A set of transformations and aggregations
       is applied to the input data set to assign a traffic                               2
                                                                            𝑤(𝑖𝑝),(𝑗𝑞) , 𝑤(𝑖𝑝),(𝑗𝑞) , 𝑤(𝑖𝑝),(𝑗𝑞) 𝑥𝑗𝑞
       congestion value 𝑥𝑖𝑝 to each edge 𝑒𝑖𝑝 .
     • Stage 2: The road network is represented in the          that corresponds to another edge 𝑒𝑗𝑞 . This computation is
       form of a graph and the mean value (𝑋) and stan-         performed by function GetisOrdCalculations(), which takes
       dard deviation (𝑆) of the traffic congestion variable    into account neighboring edges 𝑒𝑗𝑞 to the current edge 𝑒𝑖𝑝
       are computed. This data is assigned to a broadcast       and computes their contribution to 𝑒𝑖𝑝 in terms of weight,
       variable in order to become available to all the nodes   squared weight and weight times attribute value, which are
       of the cluster.                                          the basic constituent parts for computing Getis-Ord. Notice
     • Stage 3: The 𝐺*𝑖𝑝 z-score for each edge 𝑒𝑖𝑝 is com-      that this function takes as parameters the information about
       puted.                                                   temporal snapshots, the kernel’s bandwidth ℎ, as well the
     • Stage 4: The results of the previous step are filtered   broadcast data. Then, we perform hash partitioning based
       for a level of confidence and the top-𝑘 edges with       on edge in order to accumulate all records that correspond
       the highest 𝐺*𝑖𝑝 z-score are selected.                   the same edge for the same timestamp (line 5), in order to
                                                                compute the aggregate values (line 6):
It should be noted that stages 1 and 3 are the time-consuming
                                                                                     |𝐸| 𝑚
steps, therefore we delve into their details next. Stage 3 is                        ∑︁ ∑︁
the dominant cost, while stage 1 practically involves reading                                  𝑤(𝑖𝑝),(𝑗𝑞)
data from disk, but it is still much lower than the cost of                          𝑗=1 𝑞=1

stage 3.                                                                             |𝐸| 𝑚
    Algorithm 1 describes how the traffic congestion values
                                                                                     ∑︁ ∑︁      2
                                                                                               𝑤(𝑖𝑝),(𝑗𝑞)
are computed (i.e., stage 1). First, we create an RDD of key-                        𝑗=1 𝑞=1
value pairs (known as PairRDD) with trajID as key (group-
                                                                                   |𝐸| 𝑚
ByKey, line 5) and value the records that correspond to                            ∑︁ ∑︁
this trajectory in sorted temporal order (mapValues, line 5).                                𝑤(𝑖𝑝),(𝑗𝑞) 𝑥𝑗𝑞
                                                                                   𝑗=1 𝑞=1
Then, for subsequent edges of each trajectory we compute
the congestion value per vehicle per edge (using function       Thereafter, we compute the Getis-Ord value 𝐺*𝑖𝑝 per edge 𝑒𝑖
calc_attribute_value()) and we create a new PairRDD with        and per temporal snapshot 𝐺𝑝 using function Gi_formula()
a composite key that consists of the edge and the temporal      (line 7).
snapshot, while the value is the congestion value (flatMap,        Finally, in stage 4, the edges are ordered based on the
line 7). Subsequently, we partition the records based on edge   computed Getis-Ord value and only the top-𝑘 edges are
using hash partitioning (line 8). In other words, all records   returned. Obviously, 𝑘 is a parameter that is set according
that have been mapped to the 𝑖-th edge 𝑒𝑖𝑝 at snapshot 𝐺𝑝       to the application requirements.
are assigned to the same partition. This allows the compu-
tation of the average congestion value 𝑥𝑖𝑝 per edge based
                                                                  Parameters                                       Values
                                                                  Size of the data set (x106 GPS records)          2.5, 5, 7.4
                                                                  Duration (𝑇 ) of temporal snapshots (in hrs)     24, 48, 168
                                                                  Number of partitions (𝑃 )                        6, 8, 18, 60, 120
                                                                  Cut-off threshold (𝑐) in hops                    6, 12, 25, ∞


                                                                Table 1
                                                                Experimental parameters and values (default values in bold).


                                                                6. Experimental Evaluation
Figure 3: The data set used in the experimental study.
                                                                We evaluate the performance of our approach for urban hot
                                                                spot analysis. We implemented both algorithms in PySpark,
                                                                using Apache Spark 3.1.1 Core API.
5.3. Discussion
From the analysis of the exact algorithm we can derive          6.1. Experimental Setup
useful information about its complexity with respect to the
                                                                Platform. We deployed our code in Google Cloud Platform
input parameters. We focus on the number of records that
                                                                (GCP). The platform supports the process of big data in
need to be processed, as they influence the processing cost.
                                                                Apache Spark. The initiated cluster runs on Debian 10,
   In Algorithm 1, the size of dataRDD is 𝑂(𝑁 ), where 𝑁
                                                                while we used the version Spark 3.1.1 in Hadoop 3.2.2. The
denotes the number of GPS records. Then, the algorithm
                                                                cluster consisted of 4 nodes in total, 1 Master and 3 Workers.
builds the attrRDD which contains one record per edge per
                                                                Each node has 2vCPU 13GB RAM, 500GB hard drive and
temporal snapshot. If |𝐸| is used to denote the number of
                                                                375GB SSD.
edges of graph 𝐺 (the road network) and 𝑚 denotes the
                                                                   Data Set. We employed a real data set of anonymized
number of temporal snapshots of 𝐺, then attrRDD contains
                                                                GPS records collected by a fleet management provider. The
𝑂(|𝐸| · 𝑚) records. As such, its size is linearly dependent
                                                                data was collected over a a period of five months from July
on the size of the road network (in edges) and the number
                                                                to November 2018, consisting of 368,977 individual vehicle
of temporal snapshots.
                                                                trajectories moving at the Metropolitan Area of Athens
   In Algorithm 2, the attrRDD is given as input and an-
                                                                in Greece (Fig. 3). This data set is 1.1GB in total size and
other RDD called GetisOrdRDD is built. For a given tem-
                                                                contains approximately 7.4M (million) GPS records. We
poral snapshot, this contains 𝑂(|𝐸|2 ) records because it
                                                                also create two subsets of size 5M and 2.5M records. After
combines each edge with its neighboring edges with non-
                                                                pre-processing (Section 5.1), each record consists of the
zero congestion value (in worst case with all edges). When
                                                                fields: (vehID, trajID, 𝑥, 𝑦, 𝑡, 𝑒𝑖 , length, speed, 𝑣, 𝑣 ′ ). Even
other temporal snapshots are considered too, the number of
                                                                though the specific data set is sparse and medium-sized, it
records increases exponentially. Distributing these records
                                                                is still a real-world data set that allows the evaluation of our
and aggregating per edge is the main factor that affects the
                                                                approach.
processing cost of the exact algorithm.
                                                                   Metrics. Our main evaluation metric was the execution
                                                                time needed for each individual stage of the algorithm to be
5.4. The Approximate Algorithm                                  completed. The execution time is measured in seconds.
The previous algorithm computes the exact Getis-Ord z-             Evaluation Methodology. The efficiency of our algo-
scores, but it is computationally expensive because every       rithm was assessed based on four parameters: (a) the size
edge 𝑒𝑗𝑞 with non-zero congestion value (𝑥𝑗𝑞 ) has an influ-    of the data set, (b) the temporal window 𝑇 , which deter-
ence on the z-score of all other edges. However, edges at       mines the number of temporal snapshots, (c) the number
high distance (in spatial or temporal terms) are expected to    of Apache Spark partitions 𝑃 , and (d) the cut-off threshold
have extremely small impact on the computed z-score.            𝑐, used only by the approximate algorithm (for the exact
   Motivated by this observation, we propose the use of an      𝑐 = ∞), that defines the neighboring edges in the axes of
approximate algorithm that takes as input a cut-off threshold   time and space that contribute in Getis-Ord z-score, mea-
𝑐, which determines the subset of neighboring edges that        sured in hops. After different tests, we also set the kernel’s
will be considered when computing the Getis-Ord values of       bandwidth ℎ = 8 for the exact algorithm, while ℎ = 23 · 𝑐
a given edge. In spatial terms, 𝑐 corresponds to number of      for the approximate algorithm. Our experimental setup is
hops between two edges. In temporal terms, 𝑐 corresponds        summarized in Table 1.
to the number of (previous/next) temporal snapshots that
should be considered. In this way, we limit the number          6.2. Results for the Exact Algorithm
of records that need to be processed and do not allow this
                                                                Varying the size of the data set. In Fig. 4 (left), we demon-
number to increase exponentially. In turn, this drastically
                                                                strate the performance of the exact algorithm when increas-
reduced number of records that need to be processed makes
                                                                ing the data set size. We measure the time of stage 3, which
the algorithm work much faster, at the expense of slightly
                                                                is the most time-consuming part of the algorithm. As the
inaccurate results.
                                                                number of observations to be processed increases from 2.5M
   As shown in our experimental study, the approximate al-
                                                                to 7.4M, so does the total execution time of the algorithm
gorithm offers an interesting trade-off between performance
                                                                but not linearly. This is due to the fact that the larger data
and accuracy. Thus, we can compute hot spots with high
                                                                set implies more temporal snapshots, hence more edges in
accuracy but in a much shorter execution time.
          Figure 4: Left: execution time of stage 3 of the exact algorithm when increasing the data set size and varying the temporal
          duration 𝑇 of snapshots. Right: total execution time for the data set of 7.4M records for different number of partitions 𝑃 .


total, which all need to be processed for the computation of          shows that our approximate algorithm that uses the cut-off
the Getis-Ord value of an edge.                                       threshold can make the discovery of hot spots practically
   Varying the duration 𝑇 of temporal snapshots. The                  applicable for extremely large data sets.
temporal duration 𝑇 of 24, 48 and 168 hours for a period                 However, the remaining question is how accurate are
of five months leads to the generation of 153, 27 and 22              the results of the approximate algorithm? Fig. 6 shows
non-empty temporal snapshots respectively. Therefore, the             the hot spots discovered by the approximate and the exact
fewer the temporal partitions, the smaller the overall exe-           algorithm. This demonstrates visually that the results of the
cution time. In Fig. 4 (left) the temporal partitioning that          experiment for 𝑐 = 12 are similar to those obtained when
comes from 𝑇 = 48 hrs time window is twice as efficient               applying the exact algorithm.
in terms of execution time as the 𝑇 = 24 hrs time window.                To complement the visual evaluation, we also performed
The 𝑇 = 168 hrs time window is about seven times more                 a quantitative study. We obtained a ranked list of the top-
efficient than 𝑇 = 24 hrs, indicating a linear relationship           100 edges in terms of traffic congestion for 𝑐 = 12 and for
between execution time and temporal duration 𝑇 for the                the exact algorithm. Then, we computed the Spearman’s
data set of 2.5M and 5M records.                                      rank correlation for these two lists, in order to assess if the
   Varying the number 𝑃 of Spark partitions. Fig. 4                   same ordering is observed (i.e., if the ranking for 𝑐 = 12
(right) shows the total execution time of the exact algorithm         resembles the ranking obtained by the exact algorithm). If
and the time of stage 3 for different numbers of Spark par-           the two rankings were identical, we would obtain a value of
titions. The first observation is that the execution time of          1. In our case, we obtained a value of 0.82 for the Spearman’s
stage 3 practically determines the overall execution time.            rank correlation. This indicates that the two lists are highly
Regarding the effect of using a higher number of parti-               correlated, or (put differently) that the hot spots discovered
tions, this seems to reduce execution time. Using more                for 𝑐 = 12 are very similar to the ones of discovered by
partitions results in smaller subsets of data to be processed,        the exact algorithm. In the case of the results using 𝑐 =
which require less processing time. However, the reduction            6, there is less sensitivity in the recognition of extreme
is not that big (about 15% when using 120 partitions for              values, while in the case of 𝑐 = 24 there is hypersensitivity
𝑇 = 24hrs).                                                           resulting in more edges being recognized as hot spots. Our
                                                                      experiments indicate that using a cut-off threshold around
                                                                      𝑐 = 12 provides a nice trade-off between performance and
                                                                      accuracy.


                                                                      7. Conclusions
                                                                      In this paper, we formulate the problem of hot-spot dis-
                                                                      covery in road networks. We proposed an exact and an
                                                                      approximate algorithm for hot spot discovery, which are
                                                                      implemented as data-parallel algorithms on top of Apache
Figure 5: Execution time of the approximate algorithm for dif-        Spark. Our experiments using a real-life data set indicate
ferent cut-off values 𝑐 and temporal duration 𝑇 .                     that the approximate algorithm provides a viable solution
                                                                      that trades performance for accuracy. In the future, we will
                                                                      apply our algorithms to other real-life data sets of larger
                                                                      scale.
6.3. Results for the Approximate Algorithm
Fig. 5 demonstrates that the overall execution time is sig-           Acknowledgements
nificantly reduced for smaller values of 𝑐, since the weight
function is calculated only for the edges of the graph that           This research has received funding from the European
are 𝑐 hops away from an edge both spatially and temporally.           Union’s funded Project EMERALDS under grant agreement
Compared to the exact algorithm, the gain in execution time           no 101093051.
can be of orders of magnitude. For example, for 𝑇 = 24
hrs, the exact algorithm needs 27, 799 sec, whereas the ap-
proximate algorithm with 𝑐 = 6 takes only 659 sec. This
          Figure 6: Hot spots discovered by the approximate algorithm for 𝑐 = 12 (left), and for the exact algorithm, i.e., 𝑐 = ∞ (right).


References                                                                   N. Pelekis, Y. Theodoridis, Hot spot analysis over big
                                                                             trajectory data, in: Proc. of Big Data’18, 2018, pp.
 [1] E. Eftelioglu, X. Tang, S. Shekhar, Geographically                      761–770.
     robust hotspot detection: A summary of results, in:                [12] Y. Qiao, Y. Cheng, J. Yang, J. Liu, N. Kato, A mobility
     Proc. of ICDMW’15, 2015, pp. 1447–1456.                                 analytical framework for big mobile data in densely
 [2] E. Eftelioglu, S. Shekhar, D. Oliver, X. Zhou, M. R.                    populated area, IEEE Trans. on Veh. Techn. 66 (2017)
     Evans, Y. Xie, J. M. Kang, R. Laubscher, C. Farah, Ring-                1443–1455.
     shaped hotspot detection: A summary of results, in:                [13] J. K. Ord, A. Getis, Local spatial autocorrelation statis-
     Proc. of ICDM’14, 2014, pp. 815–820.                                    tics: Distributional issues and an application, Geo-
 [3] X. Tang, E. Eftelioglu, S. Shekhar, Elliptical hotspot                  graphical Analysis 27 (1995) 286–306.
     detection: A summary of results, in: Proc. of BigSpa-              [14] G. Makrai, Efficient method for large-scale spatio-
     tial’15, 2015, p. 15–24.                                                temporal hotspot analysis, in: Proc. of SIGSPATIAL’16,
 [4] Y. Xie, S. Shekhar, Significant DBSCAN towards sta-                     2016.
     tistically robust clustering, in: Proc. of SSTD’19, 2019,          [15] C. Y. Goh, J. Dauwels, N. Mitrovic, M. T. Asif, A. Oran,
     p. 31–40.                                                               P. Jaillet, Online map-matching based on hidden
 [5] P. Nikitopoulos, A.-I. Paraskevopoulos, C. Doulkeridis,                 markov model for real-time traffic sensing applica-
     N. Pelekis, Y. Theodoridis, BigCAB: Distributed hot                     tions, in: Proc. of ITSC’12, 2012, pp. 776–781.
     spot analysis over big spatio-temporal data using                  [16] Y. Lou, C. Zhang, Y. Zheng, X. Xie, W. Wang, Y. Huang,
     Apache Spark, in: Proc. of SIGSPATIAL’16, 2016.                         Map-matching for low-sampling-rate GPS trajectories,
 [6] X. Tang, E. Eftelioglu, D. Oliver, S. Shekhar, Significant              in: Proc. of SIGSPATIAL’09, 2009, pp. 352–361.
     linear hotspot discovery, IEEE Trans. Big Data 3 (2017)            [17] H. Xiong, A. Vahedian, X. Zhou, Y. Li, J. Luo, Predicting
     140–153.                                                                traffic congestion propagation patterns: A propagation
 [7] X. Tang, E. Eftelioglu, S. Shekhar, Detecting isodis-                   graph approach, in: Proc. of IWCTS@SIGSPATIAL’18,
     tance hotspots on spatial networks: A summary of                        2018, pp. 60–69.
     results, in: Proc. of SSTD’17, 2017, pp. 281–299.                  [18] B. Wu, R. Li, B. Huang, A geographically and tempo-
 [8] M. Häsner, C. Junghans, C. Sengstock, M. Gertz, On-                     rally weighted autoregressive model with application
     line hot spot prediction in road networks, in: Proc. of                 to housing prices, Int. J. Geogr. Inf. Sci. 28 (2014) 1186–
     BTW’11, 2011, pp. 187–206.                                              1204.
 [9] X. Li, J. Han, J. Lee, H. Gonzalez, Traffic density-based          [19] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
     discovery of hot routes in road networks, in: Proc. of                  M. McCauly, M. J. Franklin, S. Shenker, I. Stoica, Re-
     SSTD’07, 2007, pp. 441–459.                                             silient distributed datasets: A fault-tolerant abstrac-
[10] D. Sacharidis, K. Patroumpas, M. Terrovitis, V. Kantere,                tion for in-memory cluster computing, in: Proc. of
     M. Potamias, K. Mouratidis, T. Sellis, On-line discovery                NSDI, USENIX Association, 2012, pp. 15–28.
     of hot motion paths, in: Proc. of EDBT’08, 2008.
[11] P. Nikitopoulos, A.-I. Paraskevopoulos, C. Doulkeridis,