<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Physics: Conference
Series</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/TITS.2014.2371993</article-id>
      <title-group>
        <article-title>Models for Predicting Traffic Flow</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kirill Smelyakov</string-name>
          <email>kyrylo.smelyakov@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olha Klochko</string-name>
          <email>olha.klochko.cpe@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zoia Dudar</string-name>
          <email>zoia.dudar@nure.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>14 Nauky Ave., Kharkiv, 61166</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Quantile Regression, Random Forest Quantile Regression</institution>
          ,
          <addr-line>Gradient Boosting Quantile</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>16</volume>
      <issue>4</issue>
      <fpage>1761</fpage>
      <lpage>1770</lpage>
      <abstract>
        <p>Traffic is one of the most important aspects of managing cities and transportation infrastructure. Fast and accurate traffic forecasting can help solve various transportationrelated problems, such as congestion, increased air pollution, and improved road safety. In this paper, we investigate the use of quantile regression and its modifications such as KNN Regression, and XGBoost Quantile Regression for traffic intervals prediction using Uber data on traffic in Kyiv in January 2020. Results showed the Gradient Boosting Quantile Regression model appeared to perform the best. But others KNN and Random Forest algorithms work well for lower quantiles and XGBoost work the best for the median. The findings of this paper is that it can be used to improve traffic forecasting, which is an important task for traffic management authorities, logistics and transportation companies, and other stakeholders. Traffic flow, quantile regression, speed prediction, machine learning COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20-21, 2023, Kharkiv, Ukraine ORCID: 0000-0001-9938-5489 (K. Smelyakov); 0009-0008-5355-4222 (O.Klochko), 0000-0001-5728-9253 (Z. Dudar).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>whole.
traffic system</p>
      <p>
        Almost all cities in the world face serious congestion problems. Excessive traffic flow leads to the
paralysis of the urban transportation system on a daily basis, which creates great inconvenience and a
negative impact on people's travel. Different countries are actively taking appropriate measures, i.e.
redirecting traffic, limiting the number or expanding the scale of the road network, but these measures
may have little effect [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Intelligent transport systems are used to manage traffic flows, allowing real-time data collection and
processing of information about the road network, including traffic speed, number of vehicles for a
certain period, traffic density, road network occupancy, and public transport schedules.</p>
      <p>
        There are several reasons for the need to regulate urban traffic flows in Ukrainian cities [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
increasing urbanization, growing congestion on the road network, poor quality of public transport
services, inconvenient routes, long travel times, etc. These problems are especially acute in the largest
cities and encourage citizens to increasingly choose a car for daily correspondence, which in turn
increases delays, travel time, and leads to environmental pollution [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The relevance of this work is to find tools for managing and monitoring these processes in cities.
According to the developing but still insufficient scientific literature, which focuses on how the
dynamism of intelligent transport systems affects urban innovation and how traffic management tools
can be activated to obtain optimal results, it is important to analyze urban transport systems as a dynamic</p>
      <p>The aim of the paper is to research the efficiency using quantile regression models for predicting
traffic flow based on historical data on example of average speed of cars per hour on a particular road
segment, to evaluate the accuracy of the prediction and describe the applicability to improve the road</p>
      <p>2023 Copyright for this paper by its authors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Traffic forecasting is an important task in the field of transportation logistics and road traffic
management. Research in the field of traffic prediction uses machine learning methods, in particular
quantile regression methods.</p>
      <p>
        The authors of work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] describe internal and external, static and dynamic factors affecting traffic
conditions. Internal factors:
 Driving behavior (dynamic);
 Vehicle information (static);
 Vehicle condition (dynamic).
      </p>
      <p>External factors:
 Traffic flow condition (dynamic);
 Weather conditions (dynamic);
 Traffic rules and regulations (static);
 Traffic signals and events (dynamic).</p>
      <p>Dynamic factors are known to change over time, so they are more difficult to model than static
factors. Thus, in forecasting, historical and current information on dynamic factors should usually be
considered together. Finally, this section analyzes the main factors affecting various forecasts. The first
is classified in terms of the vehicle, which represents the internal factors of the vehicle and the external
factors of the environment, as shown in Figure 1 and Figure 2.</p>
      <p>
        Most studies on estimating the traffic flow of an entire road network are based on one or more road
network properties, and the results may not be promising [
        <xref ref-type="bibr" rid="ref5 ref6">5-7</xref>
        ], and evaluating the efficiency of network
transfer or tuning parameters of an intelligent system were seen in recent researchers [8,9]. There is a
way to combine five topological indicators and road length to estimate traffic flow based on a multiple
regression approach [10]. Six measures are used to estimate traffic flow: road length, proximity,
intermediate, degree, page rank, and clustering coefficient [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It is worth noting that each measure
requires a different correlation for different types of traffic data.
      </p>
      <p>Big data methods are used in a wide range of fields and industries, including: e-commerce,
healthcare, transportation, energy, government, education [11,12]. It drives innovation and improve
efficiency in many different industries, leading to significant advancements in technology and business
practices.</p>
      <p>The application of the KNN algorithm in short-term urban traffic forecasting. The KNN algorithm
has good performance in dealing with sudden changes and non-linearity of urban traffic flow due to its
non-parametric regression characteristics [13,14]. However, the long execution time of the KNN
forecasting system leads to a decrease in forecasting efficiency. To solve this problem, the two-stage
search algorithm proposed in this paper finds and identifies the best decision input set from historical
data using two similarity measures. Experimental results show that this method effectively improves
the prediction performance of the system under the condition of guaranteeing the accuracy of the
original prediction. The ideas presented here can be further explored with additional data, such as
weather conditions or emergencies, more complex urban topologies, and different types of forecasting
methods [15].</p>
      <p>Work [16-18] show quantile regression to predict traffic based on smartphone data. They compared
different quantile regression methods, including nearest neighbors, random forests, and gradient
boosting, and found that the gradient boosting method gave the best results. As well as a number of
statistical methods to predict the 5th, 10th, 25th, 50th, 75th, and 90th percentile of traffic speeds. As a
result of comparing the models, the authors concluded that the nearest neighbor method and random
forests showed the best performance for traffic prediction using quantile regression.</p>
      <p>Also several quantile regression methods were compared to predict traffic speed on a highway
[19,20]. The authors compared different methods, such as the nearest neighbor method, random forests,
gradient boosting, and XGBoost. They compared the results with several other congestion prediction
methods and concluded that the random forest method is effective in predicting quantile values of road
congestion. According to the results of the study, the XGBoost method showed the best performance
for predicting speed quantiles on the highway.</p>
      <p>The quantile regression method can be combined or combined with other methods to improve
forecast accuracy, so this article [21] describes an algorithm for short-term nonparametric probabilistic
quantile regression forecasting that incorporates the advantages of a hybrid neural network and quantile
regression.</p>
      <p>Approaching the quantile regression problem [22,23] from a multitasking perspective solves the
unpleasant problem of overlapping quantiles, while greatly outperforming current quantile regression
methods. Work say that jointly modeling the mean and several conditional quantiles leads to improved
predictions of conditional expectation due to the additional information and regularization effects
caused by the added quantiles.</p>
      <p>Also in the literature there are studies using artificial neural networks [24], like long short term
memory. Describes the state of the lack of traffic speed data and proposes a method for predicting traffic
speed based on measuring traffic flow in the previous and later moment states. The performance of five
prediction models was compared: KNN, support vector regression (SVR), classification trees, exactly
long short term memory (LSTM) and back propagation (BP) [25]. The method works on the basis of
the LSTM model and achieves the best result.</p>
      <p>In general, many studies use quantile regression methods to predict traffic speeds and traffic
congestion. Different methods are used, such as the nearest neighbor method, random forests, gradient
boosting, and XGBoost. Each of these methods has its own advantages and disadvantages, so the choice
of method depends on the specific task and the amount of data, i.e. searching in Big Data Warehouses
[26].</p>
      <p>The effect of the dataset on evaluating urban traffic prediction was analyzed as well and
experimental results show that the predictive effect of the multiscale model is much better than that of
the single-scale prediction and fully reflects the data set, adding more information is of greater research
value [27]. These resources may provide further insights and perspectives on the use of data science
and machine learning techniques for predicting and analyzing transportation patterns and trends.</p>
      <p>Consequently, research in quantile regression for traffic prediction is ongoing, and allows for the
development of increasingly accurate and efficient methods to solve this important problem.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method and materials</title>
      <p>Consider the dataset, machine learning methods and metrics that was used in further experiments.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset Description</title>
      <p>As mentioned earlier, traffic data is collected by many organizations involved in transportation,
logistics, and mapping services. However, due to certain restrictions, such data is usually not publicly
available. Most of the traffic data is provided by taxi services. Also, up-to-date data was needed, as
most open datasets store information on traffic speeds up to 2012. Since it was decided to use Kyiv data
to build the model, it was decided to search for the necessary information on the resources of
wellknown taxi services.</p>
      <p>There are several large taxi services in Kyiv, one of the largest is Uber [28]. An important fact is
that in 2018, the company launched the Uber Movement resource, which provides access to data on the
speed of taxi movement of this service over time. It contains data from January 2018 to March 2020.
The data is divided into sets, each of which contains information about the average taxi speed on a
segment of the region's road for each hour of each day of a particular month. The data includes only
those observations for which there is data on at least 5 unique trips on the segment in question at the
time point in question (Figure 3).</p>
      <p>It includes the following fields:
 year - year of observation;
 month - the number of the observation month (from 1, which corresponds to January, to 12,
which corresponds to December);
 day - day of observation (from 1 to 31);
 hour - hour of observation in local time (from 0 to 23);
 utc_timestamp - date and time of observation in UTC (Coordinated Universal Time) format;
 osm_way_id - OpenStreetMap road identifier for the corresponding segment;
The road segment is fully defined by the OpenStreetMap road identifier, as well as the start and end
node identifiers in OpenStreetMap. This data can be used to get information about the name of the
street where the segment is located, as well as its location. Uber Movement also provides this data, but
as a separate set.
3.2.</p>
    </sec>
    <sec id="sec-5">
      <title>Machine learning methods</title>
      <p>There are various regression types. Regression models aim to fit a target variable that is expressed
as a numerical vector. Nevertheless, statisticians have increasingly developed sophisticated regression
techniques. Quantile regression (QR) is a procedure for estimating the parameters of a linear
relationship between explanatory variables and a given level of the quantile of the variable being
explained [29, 30].</p>
      <p>Unlike ordinary least squares, quantile regression is a non-parametric method. This allows you to
get more information: regression parameters for any quantiles of the distribution of the dependent
variable. In addition, such a model is much less sensitive to outliers in the data and to violations of the
assumptions about the nature of the distributions.</p>
      <p>Quantile regression is a regression that intentionally introduces a bias into the result. Instead of
looking for the mean of the predicted variable, quantile regression aims to find the median and any
other quantiles (which are sometimes called percentiles). The classic and most straightforward
prediction is that based on mean values: the respective over- and under-prediction weights must be
equal, otherwise the prediction becomes biased (more accurately, biased relative to the mean value).</p>
      <p>The first refinement of this approach is the median prediction: the corresponding over- and
underprediction frequencies must be equal, otherwise the prediction becomes biased relative to the median.
At this point, we shifted the notion of unbiased predictions from equal weights to equal probability.
This shift is not obvious, but it can make a huge numerical difference in some situations. The median
value represents the threshold value where the distribution breaks down with a 50/50 probability.
However, it is possible to consider other frequency ratios as well. For example, we can consider ratios
of 80/20, 90/10, and any other, as long as their total value is 100%.</p>
      <p>Quantiles are a generalization of the median value to any percentage expression. For τ, whose value
is between 0 and 1, the quantile regression Q(τ) represents the threshold value at which the probability
of a value below the threshold is equal to τ [31].</p>
      <p>
        Strict mathematical definition of quantile according to the following: if Y is a random variable with
a distribution function F(y) or a distribution density f(y), then the quantile qτ of order τ∈[
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        ] of a
onedimensional distribution is the value yτ of the random variable Y for which the distribution function
takes the value τ or there is a "jump" with a value less than τ to a value greater than τ. For continuous
distributed, the quantile of order τ, where the number τ∈ [
        <xref ref-type="bibr" rid="ref1">1,0</xref>
        ], is defined as the solution of the equation:
segment;




km/h;
km/h.
(1)
(2)
osm_start_node_id - the corresponding OpenStreetMap node identifier for the start of the
osm_end_node_id - the corresponding OpenStreetMap node ID for the end of the segment;
speed_kph_mean - the average speed of Uber vehicles on the corresponding road segment in
speed_kph_stddev - standard deviation of the speed on the corresponding road segment in
 ( ) = ∫−∞

 ( )
      </p>
      <p>=  ,
 ̂ =

1
∑</p>
      <p>∈  ( 0)   ,</p>
      <p>K-nearest neighbors (KNN) is a nonparametric regression tool that attempts to estimate the
conditional mean for a new observation, x0, by identifying the k points of observed data that are closest
to the new observation for which a prediction is needed. The response values for these nearest
observations are then averaged together [14-16]. The k-nearest neighbor predictions are more formally
computed by using the following equation:</p>
      <p>where Nk(x0) is the neighborhood of x0 defined by the k closest points xi in the training data. Since
most observed data will likely have just one or no observations at a candidate x0, the observed response
values for the closest neighbors serve as an approximate for the conditional distribution Y|x = x0. Thus,
averaging across these observed values is an estimate of the conditional mean at x0.</p>
      <p>A regression tree, like KNN, is a nonparametric prediction method that approximates the conditional
mean by using available data close in proximity to the point one wishes to predict. For continuous
predictors, regression trees split the predictor space into high dimensional rectangles rather than using
neighbors.</p>
      <p>The random forest (RF) model is a highly valuable and applied nonparametric form of regression.
The trees provide a natural way to automatically approximate f(X) without doing a lot of thinking about
what the true from of f(X) looks like. Its bagging nature lends itself to better prediction accuracies than
a regression tree and also allows for categorical predictors to be incorporated where other nonparametric
tools, like KNN, do not.</p>
      <p>XGBoost is a machine learning algorithm based on a decision search tree and using a gradient
binning framework. In prediction tasks that use unstructured data (such as images or text), the artificial
neural network outperforms all other algorithms or frameworks. But when it comes to structured or
tabular data of small size, algorithms based on a decision tree take precedence [32,33].</p>
      <p>XGBoost and Gradient Boosting Machines (GBM) are ensembles of tree methods that use the
principle of boosting weak learners (most commonly, the binary decision tree algorithm) using a
gradient descent architecture [33]. In turn, XGBoost is an improvement of the GBM framework through
system optimization and algorithm refinement.
3.3.</p>
    </sec>
    <sec id="sec-6">
      <title>Machine learning metrics</title>
      <p>To assess the prediction accuracy of the chosen models, in this paper, we use four statistical scores:
Mean Absolute Percent Error (MAPE), Mean Absolute Error (MAE), Mean Square Error(MSE), Root
Mean Square error (RMSE) [34]. They are calculated as follows


where N – number of data points,  i – observed value,  ̂i – predicted value.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Experiment</title>
      <p>In general, the first step is to collect data. This can be data on some traffic characteristics. This can
be done with the help of special devices that can be installed either outside the vehicle, such as radars,
or inside, such as GPS trackers.</p>
      <p>Recognition systems are often used to add data from surveillance cameras to determine traffic on
the roads using computer vision technology. The data can be collected at a single point or part of the
road, or at a set of observation points or road sections.</p>
      <p>For performing experiment was used service Uber Movement that provide data in the public domain,
in particular for academic purposes [28].</p>
      <p>However, there is one non-obvious aspect in this case. On the one hand, the speed of a taxi generally
reflects the speed of the traffic in which the vehicle is moving. However, sometimes this is not the case,
in particular, there is a study that shows that taxis move somewhat slower than the traffic, which
logically implies that taxis slow down traffic.</p>
      <p>For performing experiment was downloaded dataset the Uber Movement Speed Data from Kyiv in
January 2020 and appropriate .geojson file with the OpenStreetMap data.</p>
      <p>The data of the main streets of the central part of Kyiv were selected for the study. In a set with
segment meta-information, the importance of a street is determined by the osmhighway parameter.</p>
      <p>In particular, the values trunk, primary, and secondary denote the main roads, and trunk_link,
primary_link, and secondary_link denote the main connections between streets that do not have their
own name, such as exits from overpasses or overpasses.</p>
      <p>Roads that fit this description are shown in Figure 4. These are arterial streets bounded on one side
by the so-called "small ring road" and on the other side by the Dnipro River.</p>
      <p>The main programming tool in this work is the Python language for several reasons. Firstly, it is
quite easy to use because it has an intuitive syntax. Thus, it is widely used by professionals at every
level of their science/engineering career. Secondly, it has a large selection of libraries and frameworks
for ML models. And the last reason is that it is a modern programming language that can be easily
integrated with other languages if necessary. Python 3.10 and the Jupyter Notebook were used to
simulate the prediction algorithm. Libraries for data analysis and visualization - NumPy, Pandas,
Sklearn, matplotlib, seaborn.</p>
      <p>This can be due to sensor failures or, for example, the absence of cars on a given road segment at a
given time in the case of data from taxi services. For correct operation, it is necessary to pre-process
the data under study, in particular, to restore the missing values in some way. Secondly, it is also
important for researchers to pay attention to how the traffic data on the road network is organized. Given
their origin, they usually contain temporal and spatial dependencies of varying complexity. In
particular, they are characterized by seasonality in time, for example, both daily and weekly seasonality.
Neighboring road segments also affect the traffic of the target segment, which indicates obvious spatial
dependencies in this kind of data. If these aspects are ignored, it is difficult to build an adequate
forecasting system. So, it was removed unnecessary fields such as the year and month, as the dataset
contains data for only one year and month, as well as the id of the road segments, start and end segments,
since they were duplicated.</p>
      <p>The collected data is often not perfect and cannot be immediately used. They may contain gaps or
unnecessary information that will overburden the algorithm, and as a result, the algorithm will not give
an accurate prediction. Various reasons cause gaps in the collected telecommunications data: system
problems, packet loss, interference, etc. Other data-related issues include sensor measurement errors,
emissions, and gaps. For the experiment, data were taken from one Umanskaya street, which contains
412 examples, where from 10 to 14 measurements were made for each day. The chosen target variable
is the traffic in one particular cell. Characteristics are selected for work - time (hour), day (day). As
quantiles was chosen 0.1, 0.25, 0.5, 0.75, and 0.9.</p>
      <p>We started by preprocessing the data and splitting it into training 70% and testing 30% sets. We then
implemented the KNN Quantile Regression algorithm, Random Forest Quantile Regression algorithm,
Gradient Boosting Quantile Regression, and XGBoost Quantile Regression algorithm using sklearn
library and xgb. In the experiment, the parameter settings of all models are shown in Table 1.</p>
      <p>We also defined a function to compute the quantile losses and plot the predicted versus actual values.
The quality of forecasting can be assessed by the indicator included MAE, MSE, MAPE and RMSE
from library sklearn.metrics.</p>
      <p>Machine learning requires a lot of RAM. To speed up access to it, you need a processor that supports
four channels, not two as in conventional custom solutions. To perform machine learning efficiently, it
is important to consider the number of cores and the memory size of the graphics card. Since deep
learning is a lot of linear functions, a lot of simple operations occurring at the same time, graphics
processors are better suited for it. The fact is that they are designed for a lot of parallel calculations,
while CPUs are designed for sequential ones. The wall training time of the model is highly dependent
on processing power, was taken with an Intel Core i7 processor and 8 GB of RAM, which was used to
test the wall training time with the given parameters.</p>
      <p>Overall, the comparison provided insights into the strengths and weaknesses of each algorithm, and
the results could be used to select the most appropriate algorithm.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Results</title>
      <p>As a result of the experiment, we got the following plot of actual and predicted data with KNN
Quantile Regression in Figure 4.</p>
      <p>Then, we got the following plot of actual and predicted data with Random Forest Quantile
Regression in Figure 5.</p>
      <p>Table 4 shows more detailed information about predicted values with Random Forest Quantile
Regression for all quantiles.</p>
      <sec id="sec-8-1">
        <title>Quantile 0.5 Quantile 0.75 Quantile 0.9 17.273 35.127</title>
        <p>Table 5 shows error metrics for predicted values with Random Forest Quantile Regression for all
quantiles.</p>
        <p>Then, we got the following plot of actual and predicted data with Gradient Boosting Quantile
Regression in Figure 6.</p>
        <p>Table 6 shows more detailed information about predicted values with Gradient Boosting Quantile
Regression for all quantiles.</p>
        <p>Table 8 shows more detailed information about predicted values with XGBoost Quantile Regression
for all quantiles.</p>
        <p>The training time for all algorithm of regressions can be seen at the Table 10.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>6. Discussions</title>
      <sec id="sec-9-1">
        <title>Algorithm</title>
        <p>K-nearest neighbors</p>
        <p>Random Forest
Gradient Boosting</p>
        <p>XGBoost</p>
        <p>Before discussion of results, it is worth mentioning some of the assumptions and limitations of this
work. Assumptions for this study may include the following:
 The study was conducted on data taken from the Uber Movement service, which contains data
on the average speed of taxi traffic on a certain road segment for each hour of each day of a particular
month.
 The study was conducted using machine learning methods, in particular quantile regression
methods such as KNN, Random Forest, Gradient Boosting, XGBoost, and others.
 The study used qualitative metrics such as MSE, RMSE, MAE, MAPE to clearly assess the
effectiveness of the regression methods under consideration.</p>
        <p>Limitations include the following points:
 The study was conducted on a limited data set, which may affect the overall representativeness
of the study.
 For the regression methods under consideration, additional parameters and hyperparameters
may be required and need to be optimized to obtain better forecasting results.
 Some factors affecting traffic may be difficult to measure or unavailable for data collection
(unpredictable changes in traffic, e.g. due to accidents or weather conditions), which may result in
insufficient accuracy of regression models.</p>
        <p>The training time of the model is an important parameter, as it was established by the results of
experiments that the fastest algorithms for this task are XGBoost and KNN.</p>
        <p>We observed that KNN and Random Forest algorithm performed relatively well for lower quantiles,
but its performance degraded for higher quantiles (Figure 8). The table 2 provides actual values and
predicted values for 5 quantiles for the KNN model. It appears that the predicted values are generally
higher than the actual values, and the difference between the predicted and actual values increases as
the quantile level increases.</p>
        <p>They work well for lower quantiles, then they are able to model complex nonlinear data and fill large
amounts of training data. However, for higher quantiles, when the data are smaller and the values for
the data cut-offs, these methods may be less efficient. For such tradeoffs, there may be better methods
that require more complex models and make fewer assumptions about the distribution of the data, such
as gradient boosting.</p>
        <p>Gradient Boosting and XGBoost perfomed bad enough for lower quantiles and the smallest loss near
the median – 0.5 quantile. The MAE, MSE, and RMSE decrease as the quantile increases, indicating
that the model is performing better at higher quantiles. However, the MAPE increases as the quantile
increases, indicating that the relative error of the model is higher at higher quantiles. Because methods
are based on sequentially adding weak models to the ensemble in order to improve predictive abilities.
They are commonly used to reduce MSE of the prediction, which is a standard metric in many
regression problems. However, in quantile regression, where the target variables are quantiles, MSE
may not be a suitable predictor.</p>
        <p>Gradient Boosting and XGBoost methods usually show good results for the median, as this metric
is quite close to the MSE. However, for lower quantiles where predictions should be more conservative,
these methods may be less effective. This may be due to the fact that gradient boosting and XGBoost
methods use tree models, which usually tend to overfitting, that is, they can remove significant
interactions between variables that are absent in the training data set. This may lead to less accurate
predictions for lower quantiles, where the distribution of the data may be more complex and interactions
between variables may be more important.</p>
        <p>As we can see from the results, KNN and Random Forest show not bad results in accuracy but have
narrow predicted interval and don’t cover all possible values.</p>
        <p>KNN method shows a narrow interval in the quantile regression because it uses the most similar
values from the training data set to predict the target variable. If the training data set is representative
of the target variable, then the nearest neighbor method can provide reasonably accurate results for
quantile regression. However, it may show less accurate results if the training data set is not
representative of the target variable, as may be the case when the speed of cars depends on many other
factors, such as weather conditions, traffic, day of the week, etc.</p>
        <p>Random Forest method can also show a narrow interval in the quantile regression because this
method is based on an ensemble of decision trees, which can be very flexible in modeling non-linear
relationships between the dependent and independent variables. In addition, with the use of many trees
in the ensemble, high prediction accuracy can be achieved. However, if the decision trees are too deep
or the number of trees is too large, overtraining of the model may occur and its overall generalization
ability may deteriorate. Therefore, it is important to carefully tune model hyperparameters such as tree
depth and number of trees in the ensemble.</p>
        <p>In quantile regression, a larger value of MSE indicates that the model has a higher dispersion of
errors around the predicted quantile values. In other words, the model may be overestimating or
underestimating the actual quantile values by a larger margin.</p>
        <p>Since quantile regression is concerned with predicting specific quantiles of the target variable, a
model with a larger MSE may be more appropriate if the goal is to identify extreme or outlier values of
the target variable. This is because a larger MSE implies that the model is better able to capture the
variability in the tails of the target variable's distribution. This is because quantile regression is
concerned with modeling the entire conditional distribution of the response variable, rather than just its
mean.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>7. Conclusions</title>
      <p>In this work, we explored the use of different quantile regression models for predicting speed based
on Uber data in Kyiv, Ukraine during January 2020. We compared the performance of KNN quantile
regression, Random Forest quantile regression, Gradient Boosting quantile regression, and XGBoost
quantile regression, measuring errors and draw plots for each model.</p>
      <p>Our results show that all four models performed well in predicting speed. We got knowledge that
KNN and Random Forest algorithms work relatively well for lower quantiles but their effectiveness
declines for higher quantiles. Gradient Boosting and XGBoost methods showed poor results for lower
quantiles and the smallest losses near the median. KNN and Random Forest methods have a narrow
prediction interval and do not cover all possible values. However, the Gradient Boosting quantile
regression model appeared to perform the best, with the lowest overall mean absolute error and mean
squared error.</p>
      <p>Traditionally, mean regression models have been used for this purpose, but quantile regression
provides a more comprehensive approach as it allows for the prediction of multiple quantiles, providing
a fuller picture of the traffic flow distribution.</p>
      <p>The results of this work can help to identify the most effective methods of traffic forecasting, which
can reduce the time spent on forecasting and increase the accuracy of forecasts. In addition, the
conclusions of this work can be used to develop new traffic forecasting algorithms that will be more
efficient and accurate. The study shows that the proposed quantile regression models (KNN, random
forest, gradient boosting, and XGBoost) outperform the traditional linear regression model in traffic
flow prediction.</p>
      <p>In the future, we plan to conduct research and compare the effectiveness of other machine learning
methods that can be applied to traffic forecasting, such as neural networks. We need to consider the
possibility of using different factors. For example, we can add data on weather, events in the city, road
works and accidents, which will help identify key stress points in cities. We also want to expand our
model to include last year or even previous years, hoping to identify seasonal patterns in urban mobility.
In addition, it will be important to test the effectiveness of the developed models on real data and
compare them with existing traffic forecasting systems to assess their potential usefulness and practical
relevance.</p>
    </sec>
    <sec id="sec-11">
      <title>8. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Khaled</given-names>
            <surname>Shaaban</surname>
          </string-name>
          , Mazen Elamin, Mohammed Alsoub,
          <source>Intelligent Transportation Systems in a Developing Country: Benefits and Challenges of Implementation</source>
          , Transportation Research Procedia, vol.
          <volume>55</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>1373</fpage>
          -
          <lpage>1380</lpage>
          , doi: 10.1016/j.trpro.
          <year>2021</year>
          .
          <volume>07</volume>
          .122.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Kiev traffic index</article-title>
          . URL: https://www.tomtom.com/en_gb/traffic-index/kiev-traffic.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Daunoras</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bagdonas</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gargasas</surname>
            <given-names>V</given-names>
          </string-name>
          .
          <article-title>City transport monitoring and routes optimal management system</article-title>
          .
          <source>Transport</source>
          .
          <year>2008</year>
          .
          <volume>23</volume>
          (
          <issue>2</issue>
          ). p.
          <fpage>144</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Zewei</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Ziru Yang, Yuanjian Zhang, Yanjun Huang, Hong Chen,
          <string-name>
            <given-names>Zhuoping</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A comprehensive study of speed prediction in transportation system: From vehicle to traffic, iScience</article-title>
          , vol.
          <volume>25</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>3</given-names>
          </string-name>
          <source>, 18 March</source>
          <year>2022</year>
          , doi: 10.1016/j.isci.
          <year>2022</year>
          .
          <volume>103909</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhao</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>"A Multiple Regression Approach for Traffic Flow Estimation,"</article-title>
          <source>in IEEE Access</source>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>35998</fpage>
          -
          <lpage>36009</lpage>
          ,
          <year>2019</year>
          , doi: 10.1109/ACCESS.
          <year>2019</year>
          .
          <volume>2904645</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>"Short-Term Traffic Flow Prediction Method for Urban Road Sections Based on Space-Time Analysis and GRU," in IEEE Access</article-title>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>143025</fpage>
          -
          <lpage>143035</lpage>
          ,
          <year>2019</year>
          , doi: 10.1109/ACCESS.
          <year>2019</year>
          .
          <volume>2941280</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>