Hot Spot Analysis for Big Trajectory Data in Road Networks Panagiota Keziou, Christos Doulkeridis Department of Digital Systems, University of Piraeus, Greece Abstract Hot spot analysis is the problem of identifying statistically significant spatial clusters and is typically applied on point data. The aim of this paper is to discover hot spots in an urban environment, namely road segments with statistically significant amount of traffic congestion, using massive GPS data of moving objects. To this end, we adjust the Getis-Ord index for hot spot discovery to become applicable for road networks. Then, we propose two data-parallel algorithms for hot spot discovery in road networks implemented in Apache Spark; an exact algorithm that has scalability limitations for very large data sets, and a scalable approximate algorithm that balances between performance and accuracy. We provide experiments over a large real-life data set that indicate the salient features of our approach. Keywords Hot spot analysis, road networks, big data, urban mobility 1. Introduction millions of GPS records per temporal interval. Therefore, we need to design scalable algorithms for hot spot analysis by In recent years, the analysis of mobility data has drawn exploiting big data technologies. To this end, we introduce the attention of both the academic society and the industry. two data-parallel algorithms for hot spot discovery in road Tracking of moving objects on road networks is ubiquitous networks, with a goal to be applicable for massive trajectory in fleet management applications and can support diverse data sets. The first algorithm returns the exact result, but scenarios of urban planning, more efficient transportation, faces performance limitations for really large data sets due reduction of greenhouse gas emissions, and improvement to the underlying exchange of information between pairs of life quality. Due to the technological evolution of GPS of edges. Then, we present an approximate algorithm with sensors and their integration in mobile devices and cars, it a cut-off threshold for excluding edges that are located too is possible to collect vast amounts of spatio-temporal data. far away spatially or temporally, as their contribution to Thus, our work is motivated by the need for identifying hot spot discovery is minimal. It turns out that our approxi- traffic congestion from GPS data on the road network. mate algorithm provides an interesting trade-off between In this paper, we present an approach for identifying scalability and accuracy, returning hot spots that are very traffic hot spots, using big trajectory data collected by GPS similar to ones discovered by the exact algorithm but orders sensors. Through hot spot analysis, we aim to identify road of magnitude faster. segments with statistically significant high levels of traffic In summary, we make the following contributions: congestion. This means that we do not just aim to identify road segments that have high congestion values; such road • We formulate the problem of hot spot analysis in segments may be of interest, but they are not necessarily road networks, by the means of Getis-Ord Statistic, statistically significant. Instead, a hot spot is statistically appropriately tailored for graphs. significant if it has high congestion value but additionally is • We design and implement an exact algorithm in surrounded by other segments with high congestion values. Apache Spark that computes hot spots in parallel. For this purpose, we employ the Getis-Ord Statistic index, • We present a scalable approximate algorithm that a popular metric for hot spot analysis, which – to the best of discovers hot spots of high accuracy but significantly our knowledge – has so far been used for spatial and spatio- faster. temporal data of objects with free movement. Instead, we • We evaluate the performance and scalability of our adapt the Getis-Ord index to become applicable for graphs algorithms for different parameters, using a real-life that represent the road network, such as the ones obtain- data set from the Metropolitan Area of Athens. able via OpenStreetMap. Moreover, to handle the temporal dimension, we consider different temporal snapshots of the The rest of this paper is structured as follows: Section 2 road network whose duration is user-defined according to reviews related research efforts. Section 3 formulates the the application requirements. Effectively, we have multiple problem under discussion. In Section 4, we present our snapshots of the same graph, each corresponding to the approach for hot spot analysis over road networks, while in same road network but for a different time interval and thus Section 5 we present the parallel algorithms for detecting each road segment has different traffic congestion values. hot spots. Then, in Section 6 describes the results of our In this way, we formulate the problem of hot spot discovery evaluation, while in Section 7, we conclude the paper. in road networks using a model that consists of a sequence of temporal snapshots of the same graph (road network). Computing the hot spots using this model is a processing- 2. Related Work intensive operation, which relies on diffusion of information Existing works in hot spot analysis over mobility data can be over (parts of) the graphs. Moreover, the necessary data broadly classified in two categories: (a) point-based hot spot to infer traffic congestion on roads may be in the order of analysis, where mobility data consist of individual spatial or spatio-temporal points, and (b) trajectory-based hot spot Published in the Proceedings of the Workshops of the EDBT/ICDT 2025 analysis, where sequences of spatio-temporal points that Joint Conference (March 25-28, 2025), Barcelona, Spain belong to the same user/vehicle/vessel (i.e., trajectories) are $ keziou@hotmail.com (P. Keziou); cdoulk@unipi.gr (C. Doulkeridis) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- considered. As our work belongs to the latter category, we tion 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1 briefly review point-based approaches and put our main propose the density-based algorithm FlowScan to discover focus on trajectory-based approaches. patterns in the flow of traffic, referred to as “hot routes”. By defining the minimum common traffic that successive road 2.1. Point-based Hot Spot Analysis segments should share and the the number of edges that form a neighborhood, clusters of the moving objects are Eftelioglu et al. [1] propose the CGC (Cubic Grid Circle) al- detected. The algorithm is capable of recognizing complex gorithm for the problem of detecting geographically robust spatial patterns. The OPS methodology proposed by Häsner hot spots, modeled as circles in a 2D space. The proposed et al. [8] predicts near future traffic hot spots, based on algorithm is an improvement of SaTScan algorithm, as it recent positions of vehicles on the road network. A heuristic detects non-contiguous or sparse center hot spots and elim- method is employed to predict the future position of the inates very small hot spots. In their studies, [2] and [3] vehicles, and a weight is assigned to the nodes of the road expand the methodology for ring-shaped and elliptical hot network that reflects the likely traffic intensity within a spots detection, respectively. In [4], a density-based clus- specified time window. Finally, through an outlier detection tering algorithm (sig-DBSCAN) improves the quality of re- approach, hot spots nodes are detected and the DBSCAN sults by discarding random patterns, while the use of Dual- algorithm derives the sub-graphs that represent traffic hot Convergence algorithm significantly reduces computational spot regions. cost. For the above studies, the statistical significance of Sacharidis et al. [10] propose a framework for on-line the proposed hot spots is assessed via Monte Carlo simula- discovery of hot motion paths, in order to detect frequently tion. In contrast, Nikitopoulos et al. [5] propose a parallel travelled paths. Their study differentiates in the fact that and scalable solution for hot spot analysis over big spatio- the hotness of a road segment is determined as a function temporal data. The spatio-temporal domain is divided into of the amount of time the moving object has spent on the 3D cells and the Getis-Ord 𝐺*𝑖 statistic is used to detect the path. Similarly, the studies in [11] and [12] detect hot spots top-𝑘 statistically significant cells: without taking into account the road network restrictions. ∑︀𝑛 ∑︀𝑛 In [11], the authors address the problem of hot spot analy- * 𝑗=1 𝑤𝑖,𝑗 𝑥𝑗 − 𝑋 𝑗=1 𝑤𝑖,𝑗 sis over big trajectory data in a parallel and scalable way. 𝐺𝑖 = √︂ ∑︀ (1) [𝑛 𝑛 2 𝑗=1 𝑤𝑖,𝑗 −( ∑︀ 𝑛 2 𝑗=1 𝑤𝑖,𝑗 ) ] Their approach is based on spatio-temporal partitioning 𝑆 𝑛−1 of the 3D space in cells. The Getis-Ord index is employed and appropriately tailored to reveal hot spots in terms of where 𝑥𝑗 is the attribute value for cell 𝑗, 𝑤𝑖,𝑗 is the spatial spatio-temporal cells. Two algorithms are proposed: the weight between cell 𝑖 and 𝑗, 𝑛 is equal to the total number THS algorithm defines the z-score of the cell under study of cells, and 𝑋 and 𝑆 represent the mean and standard depending to the cell values of the whole grid, which is com- deviation respectively: putationally expensive. The aTHS algorithm as the authors ∑︀𝑛 mentions trade-off accuracy for efficiency in a controlled 𝑗=1 𝑥𝑗 𝑋= manner and ignore the contribution of cells located out side 𝑛 a predefined area of influence. Finally, Qiao et al. [12] in- troduce a cloud-based analytical framework for big mobile √︃ ∑︀ 𝑛 2 𝑗=1 𝑥𝑗 data from the view of user mobility in densely populated 𝑆= − (𝑋)2 𝑛 areas, collected from 2G/3G/4G networks. In our work, we adapt the Getis-Ord statistic to be appli- Our work differs from these approaches as we consider cable on graphs representing temporal snapshots of road road segments as candidate hot spots and adapt a well- networks. The attribute 𝑥𝑗 can be any indicator of traffic known hot spot index to be applicable on temporal graphs congestion, such as the average speed or the number of that represent the road network for different temporal in- vehicles per time unit. tervals. In this way, we capture the effect of spatial and tem- The proposed approaches of [6] and [7] take into account poral proximity of neighboring congested road segments. the underlying road network. Tang et al. [6] present a dy- This is a differentiating factor compared to related works namic segmentation model of the road network, which al- that operate over trajectories, as they produce hot spots at lows the identification of statistically significant linear hot different level of granularity (e.g., routes, paths, grid cells) spots with high concentration of activities, such as pedes- and using other metrics. To the best of our knowledge ours trian accidents on the road network, which otherwise would is the first work that shows how to apply hot spot analysis not have been recognized. According to the proposed ap- via Getis-Ord in road networks. proach, the shortest paths between activities are identified and a density index is calculated for each of them. The algorithm returns all the paths that are statistically signifi- 3. Problem Formulation cant (based on the Monte Carlo method) and their density value exceeds a threshold. In [7], the Network Isodistance 3.1. Preliminaries Hotspot Detection (NIHD) problem is studied, which reveals We consider a directed graph 𝐺(𝑉, 𝐸) that represents the sub-graph hot spots with high concentration of activities, road network, where an edge 𝑒𝑖 ∈ 𝐸 corresponds to a road diffused isotropically along the network. The deducted com- segment and is defined by two vertices 𝑣, 𝑣 ′ ∈ 𝑉 , which putational cost of the algorithm is achieved through network represent crossroads. The edge 𝑒𝑖 is directed, indicating the partitioning and upper-bound pruning. flow of movement on the road network from 𝑣 to 𝑣 ′ . We use the terms edge and road segment interchangeably. 2.2. Trajectory-based Hot Spot Analysis To model the effect of time on traffic congestion, we consider a sequence 𝐺𝑇 = {𝐺1 , 𝐺2 , . . . , 𝐺𝑚 } of tempo- In their studies, Häsner et al. [8] and Li et al. [9] take into ral snapshots of the graph 𝐺, where each snapshot corre- account the limitations of the road network. Li et al. [9] Figure 1: Left: Example of an edge 𝑒3𝑝 of temporal snapshot 𝐺𝑝 , which is affected only by subsequent edges 𝑒4𝑝 , 𝑒5𝑝 and 𝑒6𝑝 (in blue). Right: Two temporal snapshots of 𝐺𝑝 and 𝐺𝑞 (𝑞 = 𝑝 + 1), where 𝑒3𝑝 in 𝐺𝑝 is affected also by the edges 𝑒3𝑞 , . . . , 𝑒6𝑞 of the next snapshot 𝐺𝑞 . sponds to a pre-defined temporal duration 𝑇 . A snapshot The weight 𝑤(𝑖𝑝),(𝑗𝑞) between two edges 𝑒𝑖𝑝 (edge 𝑒𝑖 at 𝐺𝑝 corresponds to an interval [𝑡𝑝 , 𝑡𝑝+1 ) of duration 𝑇 , i.e., snapshot 𝐺𝑝 ) and 𝑒𝑗𝑞 (edge 𝑒𝑗 at snapshot 𝐺𝑞 ) indicates 𝑡𝑝+1 − 𝑡𝑝 = 𝑇 . For example, if 𝑇 is set to 5 minutes, one the influence of edge 𝑒𝑗𝑞 to edge 𝑒𝑖𝑝 . Intuitively, if a road such interval could be 9:00 am – 9:05 am. A temporal snap- segment is congested, this also affects other road segments shot 𝐺𝑝 of the graph 𝐺 is used to represent the traffic for in its vicinity. More precisely, in this paper, we assume that the specific time interval [𝑡𝑝 , 𝑡𝑝+1 ). Fig. 1 illustrates the if a road segment 𝑒 is congested, this affects (or will affect concept of temporal snapshots of a graph that represents in the near future) the previous road segments, i.e., those the road network. leading to 𝑒. For example, in Fig. 1 (left), the edge 𝑒3𝑝 (in red color) is affected by the weights of edges 𝑒4𝑝 , 𝑒5𝑝 and 3.2. Problem Formulation 𝑒6𝑝 (in blue color), whereas all other weights are considered equal to zero 𝑤(3𝑝)(1𝑝) = 𝑤(3𝑝)(2𝑝) = 0 . The problem of hot spot analysis addressed in this work is to Moreover, when considering different temporal snap- identify statistically significant road segments that indicate shots of the graph 𝐺, the congestion of edges in snapshot hot spots, due to traffic congestion. An edge 𝑒𝑖 at snapshot 𝐺𝑖+1 affect edges of previous and next temporal snapshots 𝐺𝑝 , denoted 𝑒𝑖𝑝 ∈ 𝐸(𝐺𝑝 ), where 𝐸(𝐺𝑝 ) represents the {. . . , 𝐺𝑖+2 , 𝐺𝑖 , 𝐺𝑖−1 , . . . }. In Fig. 1 (right), edge 𝑒3𝑝 at edges of the snapshot 𝐺𝑝 , is associated with an attribute snapshot 𝐺𝑝 is affected by edges 𝑒3𝑞 , . . . , 𝑒6𝑞 at 𝐺𝑞 . value 𝑥𝑖𝑝 that indicates traffic congestion on 𝑒𝑖 during the According to the above, the problem of hot spot analysis temporal interval of 𝐺𝑝 . As such, whether an edge 𝑒𝑖𝑝 is in road networks is to identify the 𝑘 most statistically sig- considered hot spot is a function of the edge’s attribute value nificant road segments (edges) according to the Getis-Ord 𝑥𝑖𝑝 , but also of other neighboring edges’ attribute values. By statistic and can be formally stated as follows. neighboring edges, we mean both spatially (nearby edges on the graph 𝐺) and temporally (edges at previous/next Problem 1. (Hot spot analysis in road networks) Given a se- temporal snapshots). quence 𝐺𝑇 of temporal snapshots of a directed graph 𝐺(𝑉, 𝐸) To express this dependence on neighboring edges, we representing the road network, find the top-𝑘 edges 𝐸𝑘 of 𝐺𝑇 * * * adjust a commonly used function, namely the Getis-Ord ⋃︀ statistic 𝐺𝑖𝑝 , such that: 𝐺𝑖𝑝 ≥ 𝐺𝑗𝑞 , based on the Getis-Ord statistic [13]. To the best of our knowledge, Getis-Ord has ∀𝑒𝑖𝑝 ∈ 𝐸𝑘 , 𝑒𝑗𝑞 ∈ ( 𝑚 𝑟=1 𝐸(𝐺 𝑟 )) − 𝐸𝑘 . been used so far to discover hot spots in cell-based parti- tions for spatial (2D) and spatio-temporal (3D) domain, for Intuitively, for each road segment we compute a value point [14, 5] and trajectory data [11]. Our work extends the that indicates if it is a hot spot. This value depends on applicability of Getis-Ord for hot spot discovery in graphs the traffic congestion of the road segment itself, but also representing the road network. Thus, for any edge 𝑒𝑖𝑝 of a on the traffic congestion of subsequent road segments (as snapshot 𝐺𝑝 of 𝐺 we adapt Eq. 1 and define the Getis-Ord congestion is propagated backwards). Moreover, it depends value of the 𝑖-th edge of the 𝑝-th temporal snapshot: on the traffic congestion of these road segments in the next temporal snapshots of the network, as illustrated in Fig. 1. ∑︀|𝐸| ∑︀𝑚 ∑︀|𝐸| ∑︀𝑚 The novelty of our work lies in the identification of hot 𝑗=1 𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) 𝑥𝑗𝑞 − 𝑋 𝑗=1 𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) spots in road networks using a statistically significant met- 𝐺*𝑖𝑝 = √︂ ∑︀|𝐸| ∑︀𝑚 ∑︀|𝐸| ∑︀𝑚 ric. In contrast, existing works (Section 2) either use ad-hoc 2 2 [𝑛 𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) − 𝑗=1 𝑞=1 𝑤(𝑖𝑝),(𝑗𝑞) ] 𝑆 𝑗=1 𝑛−1 ways to compute hot spots or compute statistically signifi- (2) cant hot spots as 2D/3D cells of predefined size. Adapting where 𝑛 = |𝐸| · 𝑚 is equal to the total number of edges, the Getis-Ord statistic to become applicable for graphs is 𝑥𝑗𝑞 is the attribute value for edge 𝑒𝑗 at temporal snapshot the main technical contribution of our work. The challenge 𝐺𝑞 , 𝑤(𝑖𝑝),(𝑗𝑞) is the weight imposed by edge 𝑒𝑗𝑞 to edge relates to the definition of neighboring edges over multiple 𝑒𝑖𝑝 , and: temporal snapshots of the same graph. ∑︀|𝐸| ∑︀𝑚 𝑗=1 𝑞=1 𝑥𝑗𝑞 𝑋= (3) √︃ 𝑛 4. Hot Spot Analysis ∑︀|𝐸| ∑︀𝑚 2 𝑗=1 𝑞=1 𝑥𝑗𝑞 This section presents our approach for hot spot analysis 𝑆= − (𝑋)2 (4) 𝑛 in road networks. The input is a data set of GPS records, each corresponding to a specific vehicle (based on identifier vehID), a timestamp 𝑡, as well as the longitude 𝑥 and latitude 𝑦 values. Also, a graph 𝐺(𝑉, 𝐸) is available that represents the road network, which consists of road segments, their geometry, as well as additional information, such as direc- tion, speed limit, etc. Such a graph can be typically obtained from external sources, such as OpenStreetMap. Figure 2: Example showing how to compute the entry time for an edge given two positions on successive edges. 4.1. Map-matching Data collected from GPS sensors typically contain small errors, thereby making the mapping of GPS coordinates to congestion in comparison with another edge 𝑒′𝑗𝑞 that is the underlying road network hard. This problem is typically further away. As in [18], we use a Gaussian kernel as weight solved by applying a map-matching algorithm, which is re- function, defined as follows: sponsible for mapping each pair of GPS coordinates (longi- tude, latitude) to another point located on the road network. 𝑑2(𝑖𝑝),(𝑗𝑞) 𝑤(𝑖𝑝),(𝑗𝑞) = 𝑒𝑥𝑝(− ) Several sophisticated algorithms for map-matching have ℎ2 been developed, such as [15, 16]. where ℎ is the kernel’s bandwidth and 𝑑(𝑖𝑝),(𝑗𝑞) denotes the As the problem of map-matching is orthogonal to the one distance between edges 𝑒𝑖𝑝 and 𝑒𝑗𝑞 . The distance function we are trying to solve, we adopt a simplistic approach where takes into account both spatial as well as temporal proximity. each pair of GPS coordinates is assigned to the nearest edge. As such, we define the distance as a linear combination of In this way, each GPS record is enriched with an edge (road spatial and temporal distance: segment) of the graph, the coordinates of the two points that define the road segment, its length, as well as with other available information such as the speed limit. 𝑑2(𝑖𝑝),(𝑗𝑞) = 𝜇𝑆 (𝑑𝑆 2 𝑇 𝑇 (𝑖𝑝),(𝑗𝑞) ) + 𝜇 (𝑑(𝑖𝑝),(𝑗𝑞) ) 2 4.2. Estimating an Edge’s Traffic Congestion where 𝜇𝑆 , 𝜇𝑇 are scale factors for spatial and temporal dis- tance respectively, whereas 𝑑𝑆(𝑖𝑝),(𝑗𝑞) is the spatial distance In order to compute the Getis-Ord value of an edge 𝑒𝑖𝑝 , the and 𝑑𝑇(𝑖𝑝),(𝑗𝑞) the temporal distance. The spatial distance first step is to compute the edge’s attribute value 𝑥𝑖𝑝 , an indicator of traffic congestion. This is computed as the aver- (𝑖𝑝),(𝑗𝑞) is measured as the number of edges between 𝑒𝑖 𝑑𝑆 age value of contributions from each vehicle that reported and 𝑒𝑗 on graph 𝐺. The temporal distance is the difference its position on 𝑒𝑖𝑝 during the timespan of 𝐺𝑝 . In the same between temporal snapshots, e.g., edges at snapshots 𝐺𝑝 spirit as in [17], we define this contribution as: and 𝐺𝑝+1 have temporal distance equal to 1. 𝑣𝑜𝑏𝑠 1− 5. Parallel Algorithms 𝑣𝑓 𝑓 𝑠 where 𝑣𝑜𝑏𝑠 is the observed speed of a vehicle on edge 𝑒𝑖𝑝 , In this section, we present two data-parallel algorithms whereas 𝑣𝑓 𝑓 𝑠 is the free flow speed for edge 𝑒𝑖 . One way designed and implemented in Apache Spark that compute to compute 𝑣𝑜𝑏𝑠 is to exploit the enter (𝑡𝑖𝑛 ) and exit (𝑡𝑜𝑢𝑡 ) hot spots over very-large collections of GPS records in a time of the vehicle on edge 𝑒𝑖𝑝 . Thus, we have: scalable way. The first algorithm is an exact algorithm, whereas the second computes approximate results but with 𝑙 𝑣𝑜𝑏𝑠 = high accuracy and much faster. 𝑡𝑜𝑢𝑡 − 𝑡𝑖𝑛 In Apache Spark, the main data structure is the where 𝑙 is the length of the edge, and 𝑡𝑖𝑛 (𝑡𝑜𝑢𝑡 ) is the en- DataFrame, and its predecessor the RDD (Resilient Dis- try (exit) time. Unfortunately, for a vehicle that provides tributed Dataset) [19]. RDDs are immutable, distributed its position on edge 𝑒𝑖𝑝 (after map-matching), we do not collections of records, which support transformations and know the exact entry and exit time. Therefore, we estimate actions. Transformations produce new RDDs from a given these values, in particular given two successive positions RDD, with notable examples: map, flatMap, mapValues, fil- we compute the entry time 𝑡 as follows: ter, etc. Actions perform some computation on an RDD and produce a new result (e.g., count). Our algorithms are 𝑡2 − 𝑡1 𝑡 = 𝑡 1 + 𝑑1 · implemented in Spark to support parallel processing. 𝑑1 + 𝑑2 where 𝑡1 and 𝑡2 correspond to the two timestamps, while 5.1. Pre-processing 𝑑1 and 𝑑2 are the distances of the two points from the entry node, as shown in Fig. 2. In this way, we can estimate the In the pre-processing phase, we perform map-matching (see entry/exit times for each edge. Section 4.1) and we split the data of each vehicle in a set of trajectories. For each vehicle, we create a new trajectory when we observe a temporal gap of 𝛿𝑡 in the timestamps 4.3. Defining the Weights of reported positions. The value of 𝛿𝑡 is a parameter that Given two edges 𝑒𝑖𝑝 and 𝑒𝑗𝑞 , where 𝑒𝑖 precedes 𝑒𝑗 on graph depends on the sampling rate of the data set at hand. 𝐺, the weight 𝑤(𝑖𝑝),(𝑗𝑞) indicates the influence of traffic As a result, the output of the pre-processing phase is a congestion of edge 𝑒𝑗𝑞 to edge 𝑒𝑖𝑝 . The weight is a function set of records in the following format: (vehID, trajID, 𝑥, 𝑦, of the distance between the edges, as an edge 𝑒𝑗𝑞 that is 𝑡, 𝑒𝑖 , length, speed, 𝑣, 𝑣 ′ ), where each GPS record (vehID, in close proximity with 𝑒𝑖𝑝 , has a stronger effect on 𝑒𝑖𝑝 ’s 𝑥, 𝑦, 𝑡) of the original data set is enriched with information Algorithm 1 Computation of traffic congestion on edges Algorithm 2 Computation of Getis-Ord index 1: Input: The result of pre-processing: inputfile, num- 1: Input: The attrRDD: RDD[(𝑒𝑖 ,𝑡𝑝 ),𝑥𝑖𝑝 ], number of ber of Apache Spark partitions 𝑃 , temporal parameters: Apache Spark partitions 𝑃 , temporal parameters: 𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚, bandwidth ℎ, output of stage 2: 𝑋, 𝑆, 𝐺 2: Output: RDD[(𝑒𝑖 ,𝑡𝑝 ),𝑥𝑖𝑝 ] 2: Output: RDD[(𝑒𝑖 ,𝑡𝑝 ),𝐺𝑖𝑝 ] * 3: function 3: function 4: dataRDD = sc.load(inputfile) 4: GetisOrdRDD = attrRDD.flatMap(lambda x : GetisOrd- 5: attrRDD = dataRDD.groupByKey(). Calculations(x, 𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚, ℎ, 𝑋, 𝑆, 𝐺))). 6: mapValues(sort_grouped_values). 5: partitionBy(𝑃 , lambda k: hash_partitioner). 7: flatMap(lambda group: calc_attribute_value(group, 6: reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1], 𝑡𝑚𝑖𝑛 , 𝑡𝑚𝑎𝑥 , 𝑇 , 𝑚)). x[2] + y[2])). 8: partitionBy(𝑃 , lambda k: hash_partitioner). 7: mapValues(lambda x: Gi_formula(x, 𝑋, 𝑆)) 9: mapValues(lambda x: (x,1)). 8: end function 10: reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])). 11: mapValues(lambda x: x[0] / x[1]) on the contributions of all vehicles (lines 9–11). It should 12: end function be noted that the function calc_attribute_value() performs the computation of 𝑥𝑖𝑝 , as described in Section 4.2. Then, in stage 2, we compute the aggregate values neces- about the road segment (𝑒𝑖 , length, speed, 𝑣, 𝑣 ′ ) and each sary for the computation of Getid-Ord. These include the record is assigned with a trajectory identifier (trajID). The mean 𝑋, the standard deviation 𝑆, and the total number values 𝑣 and 𝑣 ′ correspond to the start and end vertices of of edges 𝑛 for all temporal snapshots of the graph 𝐺. This edge 𝑒𝑖 , while speed is the speed limit which is typically information together with the graph 𝐺 is broadcast to all provided in the road network data set and we use it as free worker nodes in the cluster, so as to be available during the flow speed 𝑣𝑓 𝑓 𝑠 . computation of Getis-Ord. In stage 3, we compute the Getis-Ord index 𝐺*𝑖𝑝 per edge 5.2. The Exact Algorithm 𝑒𝑖𝑝 and temporal snapshot 𝐺𝑝 . Algorithm 2 provides the pseudocode for the computation of Getis-Ord values. First, The main stages of our algorithm are summarized below: using flatMap (line 4), we transform attrRDD into a new RDD with key the edge 𝑒𝑖𝑝 and value a triple: • Stage 1: A set of transformations and aggregations is applied to the input data set to assign a traffic 2 𝑤(𝑖𝑝),(𝑗𝑞) , 𝑤(𝑖𝑝),(𝑗𝑞) , 𝑤(𝑖𝑝),(𝑗𝑞) 𝑥𝑗𝑞 congestion value 𝑥𝑖𝑝 to each edge 𝑒𝑖𝑝 . • Stage 2: The road network is represented in the that corresponds to another edge 𝑒𝑗𝑞 . This computation is form of a graph and the mean value (𝑋) and stan- performed by function GetisOrdCalculations(), which takes dard deviation (𝑆) of the traffic congestion variable into account neighboring edges 𝑒𝑗𝑞 to the current edge 𝑒𝑖𝑝 are computed. This data is assigned to a broadcast and computes their contribution to 𝑒𝑖𝑝 in terms of weight, variable in order to become available to all the nodes squared weight and weight times attribute value, which are of the cluster. the basic constituent parts for computing Getis-Ord. Notice • Stage 3: The 𝐺*𝑖𝑝 z-score for each edge 𝑒𝑖𝑝 is com- that this function takes as parameters the information about puted. temporal snapshots, the kernel’s bandwidth ℎ, as well the • Stage 4: The results of the previous step are filtered broadcast data. Then, we perform hash partitioning based for a level of confidence and the top-𝑘 edges with on edge in order to accumulate all records that correspond the highest 𝐺*𝑖𝑝 z-score are selected. the same edge for the same timestamp (line 5), in order to compute the aggregate values (line 6): It should be noted that stages 1 and 3 are the time-consuming |𝐸| 𝑚 steps, therefore we delve into their details next. Stage 3 is ∑︁ ∑︁ the dominant cost, while stage 1 practically involves reading 𝑤(𝑖𝑝),(𝑗𝑞) data from disk, but it is still much lower than the cost of 𝑗=1 𝑞=1 stage 3. |𝐸| 𝑚 Algorithm 1 describes how the traffic congestion values ∑︁ ∑︁ 2 𝑤(𝑖𝑝),(𝑗𝑞) are computed (i.e., stage 1). First, we create an RDD of key- 𝑗=1 𝑞=1 value pairs (known as PairRDD) with trajID as key (group- |𝐸| 𝑚 ByKey, line 5) and value the records that correspond to ∑︁ ∑︁ this trajectory in sorted temporal order (mapValues, line 5). 𝑤(𝑖𝑝),(𝑗𝑞) 𝑥𝑗𝑞 𝑗=1 𝑞=1 Then, for subsequent edges of each trajectory we compute the congestion value per vehicle per edge (using function Thereafter, we compute the Getis-Ord value 𝐺*𝑖𝑝 per edge 𝑒𝑖 calc_attribute_value()) and we create a new PairRDD with and per temporal snapshot 𝐺𝑝 using function Gi_formula() a composite key that consists of the edge and the temporal (line 7). snapshot, while the value is the congestion value (flatMap, Finally, in stage 4, the edges are ordered based on the line 7). Subsequently, we partition the records based on edge computed Getis-Ord value and only the top-𝑘 edges are using hash partitioning (line 8). In other words, all records returned. Obviously, 𝑘 is a parameter that is set according that have been mapped to the 𝑖-th edge 𝑒𝑖𝑝 at snapshot 𝐺𝑝 to the application requirements. are assigned to the same partition. This allows the compu- tation of the average congestion value 𝑥𝑖𝑝 per edge based Parameters Values Size of the data set (x106 GPS records) 2.5, 5, 7.4 Duration (𝑇 ) of temporal snapshots (in hrs) 24, 48, 168 Number of partitions (𝑃 ) 6, 8, 18, 60, 120 Cut-off threshold (𝑐) in hops 6, 12, 25, ∞ Table 1 Experimental parameters and values (default values in bold). 6. Experimental Evaluation Figure 3: The data set used in the experimental study. We evaluate the performance of our approach for urban hot spot analysis. We implemented both algorithms in PySpark, using Apache Spark 3.1.1 Core API. 5.3. Discussion From the analysis of the exact algorithm we can derive 6.1. Experimental Setup useful information about its complexity with respect to the Platform. We deployed our code in Google Cloud Platform input parameters. We focus on the number of records that (GCP). The platform supports the process of big data in need to be processed, as they influence the processing cost. Apache Spark. The initiated cluster runs on Debian 10, In Algorithm 1, the size of dataRDD is 𝑂(𝑁 ), where 𝑁 while we used the version Spark 3.1.1 in Hadoop 3.2.2. The denotes the number of GPS records. Then, the algorithm cluster consisted of 4 nodes in total, 1 Master and 3 Workers. builds the attrRDD which contains one record per edge per Each node has 2vCPU 13GB RAM, 500GB hard drive and temporal snapshot. If |𝐸| is used to denote the number of 375GB SSD. edges of graph 𝐺 (the road network) and 𝑚 denotes the Data Set. We employed a real data set of anonymized number of temporal snapshots of 𝐺, then attrRDD contains GPS records collected by a fleet management provider. The 𝑂(|𝐸| · 𝑚) records. As such, its size is linearly dependent data was collected over a a period of five months from July on the size of the road network (in edges) and the number to November 2018, consisting of 368,977 individual vehicle of temporal snapshots. trajectories moving at the Metropolitan Area of Athens In Algorithm 2, the attrRDD is given as input and an- in Greece (Fig. 3). This data set is 1.1GB in total size and other RDD called GetisOrdRDD is built. For a given tem- contains approximately 7.4M (million) GPS records. We poral snapshot, this contains 𝑂(|𝐸|2 ) records because it also create two subsets of size 5M and 2.5M records. After combines each edge with its neighboring edges with non- pre-processing (Section 5.1), each record consists of the zero congestion value (in worst case with all edges). When fields: (vehID, trajID, 𝑥, 𝑦, 𝑡, 𝑒𝑖 , length, speed, 𝑣, 𝑣 ′ ). Even other temporal snapshots are considered too, the number of though the specific data set is sparse and medium-sized, it records increases exponentially. Distributing these records is still a real-world data set that allows the evaluation of our and aggregating per edge is the main factor that affects the approach. processing cost of the exact algorithm. Metrics. Our main evaluation metric was the execution time needed for each individual stage of the algorithm to be 5.4. The Approximate Algorithm completed. The execution time is measured in seconds. The previous algorithm computes the exact Getis-Ord z- Evaluation Methodology. The efficiency of our algo- scores, but it is computationally expensive because every rithm was assessed based on four parameters: (a) the size edge 𝑒𝑗𝑞 with non-zero congestion value (𝑥𝑗𝑞 ) has an influ- of the data set, (b) the temporal window 𝑇 , which deter- ence on the z-score of all other edges. However, edges at mines the number of temporal snapshots, (c) the number high distance (in spatial or temporal terms) are expected to of Apache Spark partitions 𝑃 , and (d) the cut-off threshold have extremely small impact on the computed z-score. 𝑐, used only by the approximate algorithm (for the exact Motivated by this observation, we propose the use of an 𝑐 = ∞), that defines the neighboring edges in the axes of approximate algorithm that takes as input a cut-off threshold time and space that contribute in Getis-Ord z-score, mea- 𝑐, which determines the subset of neighboring edges that sured in hops. After different tests, we also set the kernel’s will be considered when computing the Getis-Ord values of bandwidth ℎ = 8 for the exact algorithm, while ℎ = 23 · 𝑐 a given edge. In spatial terms, 𝑐 corresponds to number of for the approximate algorithm. Our experimental setup is hops between two edges. In temporal terms, 𝑐 corresponds summarized in Table 1. to the number of (previous/next) temporal snapshots that should be considered. In this way, we limit the number 6.2. Results for the Exact Algorithm of records that need to be processed and do not allow this Varying the size of the data set. In Fig. 4 (left), we demon- number to increase exponentially. In turn, this drastically strate the performance of the exact algorithm when increas- reduced number of records that need to be processed makes ing the data set size. We measure the time of stage 3, which the algorithm work much faster, at the expense of slightly is the most time-consuming part of the algorithm. As the inaccurate results. number of observations to be processed increases from 2.5M As shown in our experimental study, the approximate al- to 7.4M, so does the total execution time of the algorithm gorithm offers an interesting trade-off between performance but not linearly. This is due to the fact that the larger data and accuracy. Thus, we can compute hot spots with high set implies more temporal snapshots, hence more edges in accuracy but in a much shorter execution time. Figure 4: Left: execution time of stage 3 of the exact algorithm when increasing the data set size and varying the temporal duration 𝑇 of snapshots. Right: total execution time for the data set of 7.4M records for different number of partitions 𝑃 . total, which all need to be processed for the computation of shows that our approximate algorithm that uses the cut-off the Getis-Ord value of an edge. threshold can make the discovery of hot spots practically Varying the duration 𝑇 of temporal snapshots. The applicable for extremely large data sets. temporal duration 𝑇 of 24, 48 and 168 hours for a period However, the remaining question is how accurate are of five months leads to the generation of 153, 27 and 22 the results of the approximate algorithm? Fig. 6 shows non-empty temporal snapshots respectively. Therefore, the the hot spots discovered by the approximate and the exact fewer the temporal partitions, the smaller the overall exe- algorithm. This demonstrates visually that the results of the cution time. In Fig. 4 (left) the temporal partitioning that experiment for 𝑐 = 12 are similar to those obtained when comes from 𝑇 = 48 hrs time window is twice as efficient applying the exact algorithm. in terms of execution time as the 𝑇 = 24 hrs time window. To complement the visual evaluation, we also performed The 𝑇 = 168 hrs time window is about seven times more a quantitative study. We obtained a ranked list of the top- efficient than 𝑇 = 24 hrs, indicating a linear relationship 100 edges in terms of traffic congestion for 𝑐 = 12 and for between execution time and temporal duration 𝑇 for the the exact algorithm. Then, we computed the Spearman’s data set of 2.5M and 5M records. rank correlation for these two lists, in order to assess if the Varying the number 𝑃 of Spark partitions. Fig. 4 same ordering is observed (i.e., if the ranking for 𝑐 = 12 (right) shows the total execution time of the exact algorithm resembles the ranking obtained by the exact algorithm). If and the time of stage 3 for different numbers of Spark par- the two rankings were identical, we would obtain a value of titions. The first observation is that the execution time of 1. In our case, we obtained a value of 0.82 for the Spearman’s stage 3 practically determines the overall execution time. rank correlation. This indicates that the two lists are highly Regarding the effect of using a higher number of parti- correlated, or (put differently) that the hot spots discovered tions, this seems to reduce execution time. Using more for 𝑐 = 12 are very similar to the ones of discovered by partitions results in smaller subsets of data to be processed, the exact algorithm. In the case of the results using 𝑐 = which require less processing time. However, the reduction 6, there is less sensitivity in the recognition of extreme is not that big (about 15% when using 120 partitions for values, while in the case of 𝑐 = 24 there is hypersensitivity 𝑇 = 24hrs). resulting in more edges being recognized as hot spots. Our experiments indicate that using a cut-off threshold around 𝑐 = 12 provides a nice trade-off between performance and accuracy. 7. Conclusions In this paper, we formulate the problem of hot-spot dis- covery in road networks. We proposed an exact and an approximate algorithm for hot spot discovery, which are implemented as data-parallel algorithms on top of Apache Figure 5: Execution time of the approximate algorithm for dif- Spark. Our experiments using a real-life data set indicate ferent cut-off values 𝑐 and temporal duration 𝑇 . that the approximate algorithm provides a viable solution that trades performance for accuracy. In the future, we will apply our algorithms to other real-life data sets of larger scale. 6.3. Results for the Approximate Algorithm Fig. 5 demonstrates that the overall execution time is sig- Acknowledgements nificantly reduced for smaller values of 𝑐, since the weight function is calculated only for the edges of the graph that This research has received funding from the European are 𝑐 hops away from an edge both spatially and temporally. Union’s funded Project EMERALDS under grant agreement Compared to the exact algorithm, the gain in execution time no 101093051. can be of orders of magnitude. For example, for 𝑇 = 24 hrs, the exact algorithm needs 27, 799 sec, whereas the ap- proximate algorithm with 𝑐 = 6 takes only 659 sec. This Figure 6: Hot spots discovered by the approximate algorithm for 𝑐 = 12 (left), and for the exact algorithm, i.e., 𝑐 = ∞ (right). References N. Pelekis, Y. Theodoridis, Hot spot analysis over big trajectory data, in: Proc. of Big Data’18, 2018, pp. [1] E. Eftelioglu, X. Tang, S. Shekhar, Geographically 761–770. robust hotspot detection: A summary of results, in: [12] Y. Qiao, Y. Cheng, J. Yang, J. Liu, N. Kato, A mobility Proc. of ICDMW’15, 2015, pp. 1447–1456. analytical framework for big mobile data in densely [2] E. Eftelioglu, S. Shekhar, D. Oliver, X. Zhou, M. R. populated area, IEEE Trans. on Veh. Techn. 66 (2017) Evans, Y. Xie, J. M. Kang, R. Laubscher, C. Farah, Ring- 1443–1455. shaped hotspot detection: A summary of results, in: [13] J. K. Ord, A. Getis, Local spatial autocorrelation statis- Proc. of ICDM’14, 2014, pp. 815–820. tics: Distributional issues and an application, Geo- [3] X. Tang, E. Eftelioglu, S. Shekhar, Elliptical hotspot graphical Analysis 27 (1995) 286–306. detection: A summary of results, in: Proc. of BigSpa- [14] G. Makrai, Efficient method for large-scale spatio- tial’15, 2015, p. 15–24. temporal hotspot analysis, in: Proc. of SIGSPATIAL’16, [4] Y. Xie, S. Shekhar, Significant DBSCAN towards sta- 2016. tistically robust clustering, in: Proc. of SSTD’19, 2019, [15] C. Y. Goh, J. Dauwels, N. Mitrovic, M. T. Asif, A. Oran, p. 31–40. P. Jaillet, Online map-matching based on hidden [5] P. Nikitopoulos, A.-I. Paraskevopoulos, C. Doulkeridis, markov model for real-time traffic sensing applica- N. Pelekis, Y. Theodoridis, BigCAB: Distributed hot tions, in: Proc. of ITSC’12, 2012, pp. 776–781. spot analysis over big spatio-temporal data using [16] Y. Lou, C. Zhang, Y. Zheng, X. Xie, W. Wang, Y. Huang, Apache Spark, in: Proc. of SIGSPATIAL’16, 2016. Map-matching for low-sampling-rate GPS trajectories, [6] X. Tang, E. Eftelioglu, D. Oliver, S. Shekhar, Significant in: Proc. of SIGSPATIAL’09, 2009, pp. 352–361. linear hotspot discovery, IEEE Trans. Big Data 3 (2017) [17] H. Xiong, A. Vahedian, X. Zhou, Y. Li, J. Luo, Predicting 140–153. traffic congestion propagation patterns: A propagation [7] X. Tang, E. Eftelioglu, S. Shekhar, Detecting isodis- graph approach, in: Proc. of IWCTS@SIGSPATIAL’18, tance hotspots on spatial networks: A summary of 2018, pp. 60–69. results, in: Proc. of SSTD’17, 2017, pp. 281–299. [18] B. Wu, R. Li, B. Huang, A geographically and tempo- [8] M. Häsner, C. Junghans, C. Sengstock, M. Gertz, On- rally weighted autoregressive model with application line hot spot prediction in road networks, in: Proc. of to housing prices, Int. J. Geogr. Inf. Sci. 28 (2014) 1186– BTW’11, 2011, pp. 187–206. 1204. [9] X. Li, J. Han, J. Lee, H. Gonzalez, Traffic density-based [19] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, discovery of hot routes in road networks, in: Proc. of M. McCauly, M. J. Franklin, S. Shenker, I. Stoica, Re- SSTD’07, 2007, pp. 441–459. silient distributed datasets: A fault-tolerant abstrac- [10] D. Sacharidis, K. Patroumpas, M. Terrovitis, V. Kantere, tion for in-memory cluster computing, in: Proc. of M. Potamias, K. Mouratidis, T. Sellis, On-line discovery NSDI, USENIX Association, 2012, pp. 15–28. of hot motion paths, in: Proc. of EDBT’08, 2008. [11] P. Nikitopoulos, A.-I. Paraskevopoulos, C. Doulkeridis,