<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Path-based trafic flow prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Efstratios Karkanis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikos Pelekis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Chondrodima</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannis Theodoridis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, University of Piraeus</institution>
          ,
          <addr-line>Piraeus</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Statistics and Insurance Science, University of Piraeus</institution>
          ,
          <addr-line>Piraeus</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Predicting trafic flow on a road network is crucial in various aspects of the transportation domain, encompassing safety and logistics. This paper introduces an innovative approach to forecast trafic flow, with a specific emphasis on predicting future trafic along paths within a road network. The proposed framework is tailored to accept GPS trajectory data as input, generating time series data illustrating trafic flow along designated pathways. It achieves this by utilizing only a subset of trajectories that strictly follow the paths without detours. This intuitive approach results in training data that better captures the essence of the forecasting problem. In the final step of our methodology, we employ state-of-the-art time series forecasting methods, including ensemble trees and recurrent neural networks. To validate the efectiveness of our approach, we evaluate it using a real-world dataset, demonstrating its capability in predicting trafic flow.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Mobility</kwd>
        <kwd>Trajectories</kwd>
        <kwd>Trafic flow</kwd>
        <kwd>Forecasting methods</kwd>
        <kwd>Ensemble trees</kwd>
        <kwd>Neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        network. On the other hand, specific vehicles, such as
buses and taxis, are also expected to contain sensors that
In the contemporary era, urbanisation has given rise to a periodically transmit their mobility status information,
conspicuous surge in the demand for public transporta- including their GPS location, along with the respective
tion within urban areas. This escalating need, fueled by sampling timestamp. The proposed framework aims to
population growth, economic advancement, and evolv- provide predictions for urban trafic flow along
predeing lifestyle preferences, underscores a significant trans- ifned paths by leveraging data collected from sensors
formation in the dynamics of urban mobility [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Con- embedded in specific vehicles, such as buses and taxis.
currently, challenges like trafic congestion, navigation In summary, the complete workflow of the proposed
optimisation, and air quality management have become framework involves several key steps. Starting with the
prevalent1. As a result, path-based trafic flow, which original trajectory data provided by vehicles, we create
in terms of this study refers to the number of vehicles a time series dataset that captures historical trafic flow
passing through a particular path (a set of consecutive values along predefined paths at uniform-sized time
inroad parts between intersections) on a road network, tervals. A key aspect of our methodology involves the
has increased substantially over the recent decades, em- measurement of trafic flow along each predefined path.
phasising the critical necessity for accurate forecasting Therefore, we perform a search procedure on trajectory
methods. data (the so-called Strict Path Query [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) to retrieve all
      </p>
      <p>
        Mobility data analytics and prediction has gained an trajectories that strictly cross a predefined path of a road
increasing interest in recent years, due to its time-critical network within a specicfi time interval. A trajectory
applications in urban, maritime and aviation domains strictly follows a path if and only if this path is a subset
[
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]. In particular, predicting trafic flow is chal- of the trajectory. In other words, the trajectory does not
lenging due to its non-linear nature. This complexity deviate from this predefined path. In this way, we are
arises because trafic flow is influenced by non-linear sure that we compute trafic flow accurately,
eliminatfactors, such as weather conditions, road trafic accidents ing trajectories that perform detours while crossing the
and public holidays, posing dificulty in accurate fore- path. This approach allows us to reliably assess trafic
casting. lfow throughout the entire length of the path, avoiding
      </p>
      <p>Typically, road networks are equipped with sensors potential inaccuracies that might arise from considering
that measure trafic flow at a specific point of the road trajectories with detours, which could lead to misleading
indications of trafic flow.</p>
      <p>Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 Figure 1 illustrates a road network where letters a
Joint Conference (March 25-28, 2024), Paestum, Italy to l represent intersections. Four trajectories 1, 2, 3,
$ stratoskarkanis2@gmail.com (E. Karkanis); npelekis@unipi.gr and 4 are depicted, each marked with a distinct colour,
(yNt h.ePoedle@kiusn);ipeiv.gacrh(Yo.n@Thuenoidpoi.rgirdi(sE). Chondrodima); navigating within the network. Additionally, there is a</p>
      <p>Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License predefined path b –&gt; c –&gt; h –&gt; i –&gt; d, marked with orange
1The EurAottrpibeutaionn 4.0 GIntreerneatnional D(CCeaBYl:4.0). https://eur-lex.europa.eu/legal- colour. To assess trafic flow within this specified path
content/EN/TXT/?uri=COM:2019:640:FIN
during a particular time period, trajectory 3 is retained
as it follows this predefined path. Trajectories 1, 2 and
4 are excluded from consideration since they involve
detours. In this particular example, we argue that trafic
lfow within the orange path is represented better by only
considering 3, although all trajectories touch the path.</p>
      <p>
        Following the generation of time series data, we
construct suitable Machine Learning (ML) models to address
the specific problem at hand. In this research, we evaluate
two such models: an ensemble-based decision tree model
using XGBoost [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and a model based on Long
ShortTerm Memory (LSTM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], an evolution of the Recurrent
Neural Network (RNN). We also analyze the advantages
and disadvantages of using each model for forecasting
trafic flow along predefined paths.
      </p>
      <p>This paper aims to address the problem of efectively
predicting trafic flow on road networks. While several
studies have been conducted regarding this topic, the
novelty of our work stands on that it proposes a
distinct path-based methodology that focuses on solving
this problem. To the best of our knowledge, no similar
research has been published in this field yet.</p>
      <p>In practice, the role of the proposed framework is to
assist urban trafic management systems in guiding and
managing trafic flow, thus controlling trafic volume,
avoiding trafic congestion, and constructing an eficient
road network.</p>
      <p>To summarise, the main contributions of this paper
are listed as follows:
• We propose a novel methodology for predicting
trafic flow on paths within a road network. Our
methodology is generic as it based on timeseries
forecasting, and towards this direction, we build
an ensemble-based and RNN-based model.
• We demonstrate the eficacy of our proposal by a
thorough experimental study using a real-world
dataset.</p>
      <sec id="sec-1-1">
        <title>The rest of this paper is organised as follows: Section</title>
        <p>2 discusses related work. In Section 3, preliminary terms
and definitions are provided. In Section 4, we introduce
our methodology. Section 5 presents the experimental
study, followed by conclusions in Section 6.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Several studies approach the problem of predicting trafic
lfow on road networks as follows: given a dataset of
historical trafic flow values, which consists of information
about the number of vehicles moving through specific
sections of the road network between intersections or
multiple areas within the same road network monitored
by trafic sensors, the objective is to predict the future
trafic flow at these locations at a given time in the future.</p>
      <p>These studies mainly focus on ML approaches.</p>
      <p>
        Many researchers have tried algorithms like Support
Vector Machines (SVM) [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], K-Nearest Neighbors
(KNN) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Bayesian networks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Autoregressive
Integrated Moving Average (ARIMA) [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ], Shallow
Neural Networks (SNNs)[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Deep Neural Networks
(DNNs) [
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref14">13, 11, 10, 14</xref>
        ] and hybrid implementations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
each with its strengths and weaknesses. For example,
SVM and KNN are capable of capturing complex
patterns in multidimensional data [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. SNNs, while useful
in many applications, struggle with non-linear patterns
of trafic flow [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. ARIMA models perform well in
predicting short-term trafic [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], making them useful for
scenarios where precise, immediate forecasts are needed.
      </p>
      <p>Last but not least, hybrid approaches leverage the unique
strengths of many models combined to achieve better
predictions.</p>
      <p>
        Due to the inherent volatility of trafic flow patterns,
DNNs become essential [
        <xref ref-type="bibr" rid="ref11 ref13">11, 13</xref>
        ]. Numerous studies in
the literature have recognized this necessity. In a specific
instance, researchers introduced a trafic flow
prediction framework centred on Fully Connected Neural
Networks (FCNN) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This model utilized historical trafic
lfow data to predict trafic conditions one timestep ahead.
      </p>
      <p>Notably, the authors enhanced model performance and
generalization by incorporating optimization techniques
such as Batch Normalization (BN) and dropout layers.</p>
      <p>
        Another aspect of addressing the trafic flow prediction
problem using DNNs involves the utilisation of Stacked
Autoencoders (SAEs) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The SAEs were used to distil
meaningful information from trafic data by compressing
input data into a lower-dimensional representation and
reconstructing the original data. Comparative evaluation
against alternative ML algorithms, such as SVM, Random
Walk (RW), and Radial Basis Function Neural Network
(RBFNN), underscored the superior performance of the
proposed SAE model, particularly in situations
characterised by high trafic flow volumes.
      </p>
      <p>
        However, the most common deep learning approach
proposed when utilising time-related historical data
involves the use of an LSTM model. For instance, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
this model was employed by researchers to take historical process. The model is compared with other algorithms,
trafic flow data as input and predict short-term trafic pat- such as the SVM and the XGBoost, without the use of WD
terns within a 15-minute time horizon in the future. The technique. Results indicate that the proposed approach
LSTM exhibited superior performance compared to SVM achieves superior performance.
and Feedforward Neural Networks (FNN), efectively ad- In summary, the examined studies showcase notable
dressing trafic flow’s non-linear and time-dependent progress in the domain of trafic flow prediction.
Howcharacteristics. In another study, scientists introduced ever, these researches primarily focus on predicting
futhe DeepCrowd model [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which integrates Convolu- ture trafic flow in specific locations, whether it is a single
tional LSTM Neural Networks with High-Dimensional road segment between intersections or many road
segAttention Mechanisms (HDAM) to capture spatial and ments within intersections, each treated independently.
temporal trafic data correlations. This difers from our objective, which involves
forecast
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] the authors present the Foresight cloud-based ing trafic flow within paths. To the best of our
knowlsystem for real-time spatio-temporal forecasting. The pa- edge, this study represents a novel contribution to the
per discusses practical aspects to implement a real-time ifeld.
system for real-world, while it presents experimental
results of the application of several ML methods to the
problem at hand. In addition, the authors adapt spatio- 3. Preliminaries
temporal ML methods to incorporate dynamic urban
events and vehicle-level flow information into the pre- In this section, we introduce some basic terms required
dictive models. in the rest of the paper and formulate the problem at
      </p>
      <p>
        Despite the widespread use of LSTM, SAE, and FCNN hand.
models for handling time-related data, another study in- In our setting, trajectories are moving on a road
nettroduces the power of Graph Neural Networks (GNNs) work. A road network is defined as a directed graph,
when dealing with trafic data. The proposed methodol- denoted as G=(V,E), with V ={1, 2, ...} representing
ogy, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], leverages historical data, incorporating factors the set of vertices in the graph G and E={1, 2, ...}
influencing trafic flow and speed, including time, space, representing the set of edges in the graph G. In a general
weather conditions and activities such as public holi- context, when considering a graph that represents a road
days. The chosen model for this approach is the Graph network, set V is understood to denote the intersections
Hierarchical Convolutional Recurrent Neural Network within the network, while set E signifies the road
seg(GHCRNN), which is specifically designed for urban ar- ments between two intersections. Within a road network
eas. The GHCRNN model integrates both spatial and represented as a graph G, a path, also a pathway, is
dehierarchical information to improve prediction accuracy. ifned to be a sequence of at least two consecutive edges
It includes convolutional layers for spatial analysis, pool- in G.
ing units for hierarchical structure representation and an In the context of a road network, diverse moving
obencoder-decoder architecture utilising Gated Recurrent jects (MO) navigate through it. An MO is defined as any
Unit (GRU) modules. The experimentation shows the entity that travels along the edges of a road network
repproficiency of GHCRNN in handling extensive graphs resented as a graph G. Examples of MOs include cars,
and delivering precise predictions. buses and taxis.
      </p>
      <p>
        Besides the powerfulness of neural networks, the fol- The route of a MO is defined as a sequence
lowing study proves that ML implementations also ad- &lt;1, 2, ..., &gt;, where each element  is a triple (, ,
dress the trafic flow prediction problem with high accu- ) denoting the timestamp of the MO’s sampling and
racy. The XGBoost algorithm with the Wavelet Decom- the respective GPS coordinates.
position (WD) technique are used to predict short-term Based on the provided definitions, the problem
adtrafic flow [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This method aims to extract additional dressed in this study is formulated as follows: given (i)
information from the feature to be predicted, combining a graph G representing a road network, (ii) a dataset D
low and high-frequency details during the reconstruction containing a set of routes of MOs traveling along the road
network represented by G and (iii) a set P of predefined
paths within G, the goal of trafic flow prediction is to Once the split trajectories are processed through the
predict the occupancy of each path in P, i.e. the num- map-matching algorithm, the information given as
outber of the MOs that will traverse each predefined path put encompasses the edges of the road network that the
within a specific time horizon in the future using D as its trajectory traversed in chronological order.
Additionknowledge base. ally, it includes information about the time interval the
trajectory lied on each edge.
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <p>In this section, we introduce the proposed
methodology for eficiently resolving the challenge of forecasting
trafic flow along a pathway within a road network. The
workflow of the proposed methodology is depicted in
Figure 2. The methodology starts by processing a collection
of GPS data that contain routes of MOs, and more
specifically by performing a route-splitting (i.e., segmentation)
operation. Route splitting is the process of partitioning
the entire route of a specific MO into distinct trajectories,
guided by specific conditions. Subsequently, the
resulting trajectories undergo map matching onto the road
network. The next step in the methodology is to
measure historical trafic flow volumes along each path that
belongs in the set of predefined paths and at
uniformsized time intervals, resulting in the creation of a time
series dataset for each predefined path. After generating
the time series data, we extract temporal features from
timestamps to enhance the prediction accuracy of the
forecasting models. Finally, prediction procedures of
trafifc flow magnitude are conducted using two ML models,
based on the XGBoost and the LSTM RNN approaches.</p>
      <sec id="sec-3-1">
        <title>4.1. Dataset Preprocessing</title>
        <sec id="sec-3-1-1">
          <title>In the initial phase, it is crucial to acknowledge the poten</title>
          <p>tial existence of routes of MOs that contain elements that
happen to be recorded at sparse timestamps. To address
this problem, a necessary step involves breaking down
MO routes into trajectories. The time gap between any
two consecutive elements within the same trajectory is
constrained to a specific time value, denoted as Dt. A
route undergoes a split when consecutive pairs of
elements within the route exhibit a time diference greater
than the specified value Dt.</p>
          <p>The subsequent phase involves the map-matching
process. The spatial locations of the trajectories may not
precisely align with the road network due to potential
noise during sampling or technical errors. Hence, the
usage of a map-matching algorithm becomes imperative
to ensure accurate alignment with the road network.</p>
          <p>
            For precise map-matching of trajectory coordinates
(lat and lon) onto the road network, the Valhalla Meili
API2, recognized for its highly eficient map-matching
algorithm [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], is utilized.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>2https://valhalla.github.io/valhalla/api/map-matching/api</title>
          <p>reference/</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Generating Trafic Flow Time Series</title>
        <p>
          The subsequent stage in the proposed methodology
involves the generation of time series data using the
trajectory data. Time series data represent the trafic flow
inside each pathway under study at equal-sized time
intervals. The utilisation of the Strict Path Queries (SPQs)
algorithm [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is crucial for this task. This type of query
identifies all trajectories strictly following a specified
path within a defined time interval. The start of this
interval signifies the minimum acceptable timestamp for
the trajectory to have entered the designated path, while
the end of the interval indicates the tolerated timestamp
for the trajectory to have exited it.
        </p>
        <p>The first parameter that the SPQ algorithm requires is
a path in which past trafic flow values will be measured
at uniform-sized time intervals. The second parameter
essential for the SPQ algorithm is a time interval.
Because the SPQ algorithm is employed to generate time
series data, uniform time intervals must be established.
To achieve this, the total sampling duration of
trajectories within dataset D is divided by a positive number,
represented as Q. This division results in the creation
of smaller time sub-intervals, each characterized by a
duration of Q time units.</p>
        <p>For each sub-interval, the SPQ algorithm measures
trafic flow along the predefined path, producing a time
series with trafic flow volume per sub-interval.</p>
        <p>To improve the quality of time series data, we also
extract time-related features from the timestamps. These
features encompass the day (numerical representation,
e.g., 1, 2, 3), the day of the week (e.g., Monday, Tuesday),
the hour (numerical representation from 0 to 23), and
the minutes (numerical representation from 0 to 59). To
account for the cyclical nature of time, cyclic encoding
is applied to these time-related features, capturing their
periodic characteristics using semitones and cosines.</p>
        <p>Incorporating additional time-related information, we
introduce the 3-hour-interval attribute (numerical
representation from 1 to 8) to signify the specific 3-hour
interval of the day to which a particular recording within
the time series belongs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Forecasting Models</title>
        <sec id="sec-3-3-1">
          <title>Resulting time series data serve as the input for the two</title>
          <p>
            ML models, XGBoost [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] and LSTM RNN [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], we evaluate
in this study. These models are specifically trained to
predict trafic flow on each predefined path. This task is
categorized as a supervised ML process.
          </p>
          <p>To structure the data into input and output sets, the
sliding window technique is applied. This technique
involves dividing the time series into subsets using a
moving or sliding window. As the window progresses
step by step through the time series, it captures fixed-size
segments of historical values. The length of this window
determines the amount of past data used to forecast the
subsequent value in the time series and corresponds to
parameter past-observations that actually indicates the
count of historical observations utilized for prediction.</p>
          <p>Figure 3 illustrates the operation of the sliding window
technique. The table at the left displays samples of trafic
lfow on a single path at 4 uniform-sized time intervals
1, 2, 3 and 4. For instance, equal to 4 at time interval
1, 5 at 2, and so on. If the decision is made to input
trafic flow values from the two preceding time steps
into the forecasting model to predict the next value, the
window length is set to 2. Adhering to these criteria, the
time series undergoes a transformation into the format
showcased in the table at the right.
trees, the model takes as input a vector with L features
in order to predict future trafic flow. Each subsequent
tree is designed to improve upon its predecessor. The
ultimate prediction for a regression task is derived by
consolidating the predictions made by each individual
tree within the ensemble. This aggregation process
ensures that the final forecast benefits from the collective
contributions of all the trees in the model.</p>
          <p>On the other hand, Figure 5 depicts the proposed LSTM
model. The neural network is designed to take a vector
with L features as input. This vector is then reshaped in a
3-dimensional tensor. The information traverses through
3 LSTM layers with 50, 25, and 12 units, respectively.
Dropout layers are strategically employed right after the
LSTM layers to deactivate certain neurons and prevent
overfitting during the training process. Subsequently, the
information passes through 3 fully connected layers with
6, 3, and 1 neuron, respectively, before producing the
ifnal output y. ReLu activation functions are utilized in
all LSTM layers, while the fully connected layers employ
a linear activation function.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experimental Study</title>
      <p>Through the implementation of the sliding window
technique, we can generate input and output sets for In this section, we evaluate the proposed methodology
diferent values of the past-observations parameter. These using a real-world trajectory dataset. The source code
sets are utilized for training and evaluating both machine that implements the proposed methodology is available
learning models. Each model receives an input vector at GitHub3.
with L features, including historical trafic flow values, In particular, we used the Cabspoting dataset4, a
poputemporal characteristics with their cyclic encodings, and lar urban mobility dataset consisting of navigation routes
the 3-hour-interval. Furthermore, the models incorporate from 537 taxis in San Francisco, CA, USA, recorded from
the mean and variance of historical trafic flow values May 17, 2008, to June 10, 2008. The routes are captured
as input. The goal of each model is to predict the trafic at an average sampling rate of approximately 30 seconds.
lfow in the next sub-interval with duration Q. The entire dataset encompasses 11,220,491 GPS records.</p>
      <p>To improve eficiency, a correlation matrix is utilized. The research was conducted using a combination of
This matrix helps identify and remove attributes that computational resources. Data preparation and time
seredundantly convey similar information, ensuring only ries generation procedures were performed on a Dell
those with low correlation are included. Features with Inspiron 15 3000 series laptop equipped with an Intel
low correlation are chosen for the training and testing Core i5-4210U processor and 8GB of RAM. For model
phases of XGBoost and LSTM models.</p>
      <p>In Figure 4, the architecture of the selected XGBoost
model is illustrated. Using a set of 100 XGBoost decision</p>
      <sec id="sec-4-1">
        <title>3https://github.com/stratoskar/Path_Based_Trafic_Flow_Prediction</title>
        <p>4https://ieee-dataport.org/open-access/crawdad-epflmobility;
https://stamen.com/work/cabspotting/
training and forecasting processes an NVIDIA Tesla T4 time interval. By repeating this procedure for each path
GPU hosted on Google Colab was utilized. This GPU and every interval, we create a trafic flow time series for
provided the necessary computational power to train the each path.</p>
        <p>ML models eficiently. The application of the SPQ algorithm filters out the</p>
        <p>Following the proposed methodology, the segmenta- trajectories that deviate while moving along a specific
tion of taxi routes into trajectories was performed using path. Figure 6 illustrates the diference in trafic flow data
a time threshold of Dt=90 seconds. The identification of obtained when using the SPQ algorithm on the set of 100
the route of a taxi before splitting is encoded as taxi-id, predefined paths and data collected without the use of
while the subsequent identification of each newly gener- SPQ on the same set of predefined paths. In the absence
ated trajectory is denoted as traj-id. Thus, each trajectory of SPQ restrictions, trafic flow on each route is
calcuin the dataset is uniquely identified by a pair of taxi-id lated by counting the trajectories that have traversed all
and traj-id values. The produced trajectory dataset en- path edges at least once within a defined time interval.
compasses a total of 521,135 unique trajectories. However, when SPQ restrictions are applied, it is evident</p>
        <p>The Valhalla API receives GPS coordinates for each el- that the collected trafic flow data are reduced.
ement in a trajectory as input and subsequently produces
a potential map-matched trajectory that aligns with the
road network of San Francisco. Following this procedure,
the resulting dataset includes details, such as the unique
trajectory identifier (denoted by taxi-id and traj-id) and
the exact sequence of road network edges traversed by
the trajectory. The updated dataset from the Valhalla
Meili map-matching algorithm also provides timestamps
denoted as t_enter, indicating the moments when the
trajectory enters the specified edges. Figure 6: Diference in average trafic flow across all
prede</p>
        <p>The next step involves constructing a time series fined paths grouped by timestamps: dataset 1 (blue) is formed
dataset focusing on the recording of trafic flow by path using SPQ algorithm, whereas dataset 2 (orange) is developed
and time interval. The available map-matched trajec- without adhering to SPQ rules.
tory data spans from May 17th, starting at 10:00:04, to To build robust and highly precise ML models, it is
cruJune 10th, concluding at 09:00:04. This time frame is seg- cial to selectively choose attributes from the time series
mented into smaller intervals of half an hour each, i.e. Q data with low correlation. To accomplish this, a
correla= 30 min. tion matrix is utilized. The correlation matrix presented</p>
        <p>To quantify trafic flow along pathways, we choose to in Figure 7 indicates that the temporal characteristics
create a set of 100 unique and distinct paths. The length day, dayofweek, hour and minute share common
inforof these paths, determined by the number of edges, spans mation with their corresponding cyclic encoded features
from 2 to 20 edges. Importantly, these paths are not day_of_week_sin, day_of_week_cos, hour_sin, hour_cos,
randomly generated; rather, they are derived from the minute_sin, minute_cos, day_sin and day_cos. As a result,
trajectory data. This ensures that each path created is a temporal attributes day, dayofweek, hour, and minute are
subset of one or more trajectories. eliminated from the time series data. The rest of the
at</p>
        <p>The construction of a time series utilizes the SPQ al- tributes present in the heatmap representation are taken
gorithm. For each call of the SPQ algorithm, the input into account in the training and evaluation procedures
consists of the specific path for which trafic flow needs of the ML models.
to be counted, along with the corresponding half-hour Time series data is divided into training and testing
used to evaluate the performance of XGBoost in terms
of RMSE and MAE regression evaluation metrics.</p>
        <p>Concerning the LSTM model, the time series data
within the training set undergo a reshaping process,
resulting in a 3-dimensional tensor. The output y of the
LSTM model represents the predicted trafic flow value
in the time series. The LSTM model is then evaluated on
the test set using the same evaluation metrics, RMSE and
MAE.</p>
        <p>To determine the optimal value for the
pastobservations parameter, we employ varying sliding
window lengths. Both models are assessed on the same test
set for diferent past-observations values, specifically 2,
3, 4, 5, and 6. The outcomes are presented in Table 1
using the data created using the SPQ algorithm. Table
2 indicates the result of the same experiment using the
data collected when the usage of SPQ is absent. The
comparison of the two tables makes clear that by applying
the proposed methodology we succeed lower RMSE and
MAE scores.
sets, with the training set encompassing the earliest data
in the time series, spanning from May 17th to June 2nd
inclusive. The remaining data forms the testing set.</p>
        <p>The subsequent phase involves fine-tuning
XGBoost’s hyperparameters utilizing the Grid Search Cross- Forecasts are produced using the known values from
Validation method5. The hyperparameters targeted for the test set. Based on the conclusions drawn from Table
optimization encompass regularization parameters, such 1, the XGBoost model is configured with a setting of 5 (6,
as gamma, alpha, lambda, the learning rate influencing respectively) for the past-observations parameter. Figures
the backpropagation algorithm, as well as the maximum 8 and 9 showcase the predictions made by both models
depth of each decision tree. Employing the best combi- on the test set. The blue colour represents the observed
nation of hyperparameter values, we train the XGBoost sum of trafic flow values in the test set, aggregated based
model utilizing an ensemble of 100 trees. The test set is on timestamps. Conversely, the orange and red colours
represent the predicted sum of trafic flow values across
5https://scikit-learn.org/stable/modules/grid_search.html all 100 predefined paths in the test set, organized by
timestamp and generated using the trained XGBoost and
LSTM models, respectively. It appears that both models
efectively capture the trend and seasonality present in
the time series data of the test set.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <sec id="sec-5-1">
        <title>In this study, we presented a generic methodology for</title>
        <p>forecasting trafic flow volumes on pathways, employing
ML algorithms, with a particular emphasis on the
XGBoost and the LSTM models. When tested on a real-world
dataset containing taxi routes in the San Francisco area,
both models demonstrated good performance, which is
evidently better than the case when all data is used.</p>
        <p>A limitation that constrained the scope of our study is
the reliance only on taxi trafic data resulted in an
incomplete representation of the city’s overall trafic
dynamics, overlooking the movements of other transportation
modes, such as buses and private cars. Furthermore, a
more holistic comprehension of trafic flow prediction
dynamics could be achieved by integrating additional
data, such as vehicle accidents and weather conditions,
which are known to impact trafic flow patterns.</p>
        <p>Lastly, for future research endeavours, it is crucial to
simulate scenarios using diverse modelling approaches,
including Graph Neural Networks (GNNs), to assess their
efectiveness in capturing complex relationships within
the road network. This comprehensive approach aims to
contribute to a more nuanced and adaptable predictive
framework for future trafic flow predictions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work was supported by the Horizon Europe research and innovation programmes under GA 101070416 (Green.Dat.AI) &amp; 101093051 (EMERALDS) funded by EU.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Olmos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mateo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hernando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <article-title>Urban dynamics through the lens of human mobility</article-title>
          ,
          <source>Nature Computational Science</source>
          <volume>3</volume>
          (
          <year>2023</year>
          )
          <fpage>611</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Theodoridis</surname>
          </string-name>
          ,
          <source>Mobility Data Management and Exploration</source>
          , Springer New York,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Vouros</surname>
          </string-name>
          , G. Andrienko,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Artikis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Jousselme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cordero</surname>
          </string-name>
          ,
          <article-title>Big Data Analytics for Time-Critical Mobility Forecasting</article-title>
          , Springer Nature,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Theodoridis</surname>
          </string-name>
          ,
          <article-title>A survey on big data processing frameworks for mobility analytics</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <volume>50</volume>
          (
          <year>2021</year>
          )
          <fpage>18</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chondrodima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pikrakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Theodoridis</surname>
          </string-name>
          ,
          <article-title>An eficient lstm neural network-based framework for vessel location forecasting</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>4872</fpage>
          -
          <lpage>4888</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Krogh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pelekis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Theodoridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Torp</surname>
          </string-name>
          ,
          <article-title>Pathbased queries on trajectory data</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems</source>
          , SIGSPATIAL '14,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, US,
          <year>2014</year>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Staudemeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <article-title>Understanding lstm - a tutorial into long short-term memory recurrent neural networks</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1909</year>
          .09586.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <article-title>Short-term trafic lfow prediction based on xgboost</article-title>
          ,
          <source>in: 2018 IEEE 7th Data Driven Control and Learning Systems Conference (DDCLS)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>854</fpage>
          -
          <lpage>859</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          , L. Pan,
          <article-title>Predicting short-term trafic flow by long short-term memory recurrent neural network</article-title>
          , in: 2015 IEEE International Conference on Smart City/SocialCom/SustainCom,
          <year>2015</year>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tampubolon</surname>
          </string-name>
          , P.-A. Hsiung,
          <article-title>Supervised deep learning based for trafic flow prediction</article-title>
          ,
          <source>in: 2018 International Conference on Smart Green Technology in Electrical and Information Systems (ICSGTEIS)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>Graph hierarchical convolutional recurrent neural network (ghcrnn) for vehicle condition prediction</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1903</year>
          .06261.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Trafifc flow prediction with big data: A deep learning approach</article-title>
          ,
          <source>IEEE Transactions on Intelligent Transportation Systems</source>
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <fpage>865</fpage>
          -
          <lpage>873</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tsubouchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shibasaki</surname>
          </string-name>
          ,
          <article-title>Deepcrowd: A deep model for large-scale citywide crowd density and flow prediction</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <fpage>276</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Conlan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Oakley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Demirci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sfyridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ferhatosmanoglu</surname>
          </string-name>
          ,
          <article-title>Real-time spatio-temporal forecasting with dynamic urban event and vehiclelevel flow information</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>3379</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <article-title>A practical guide to an opensource map-matching approach for big gps data</article-title>
          ,
          <source>SN Computer Science</source>
          <volume>3</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>