=Paper=
{{Paper
|id=Vol-2718/paper06
|storemode=property
|title=Analysis of Delay Patterns and Correlations in Railway Traffic Data
|pdfUrl=https://ceur-ws.org/Vol-2718/paper06.pdf
|volume=Vol-2718
|authors=Roland Krisztián Szabó,Tomáš Horváth,Ádám Tarcsi
|dblpUrl=https://dblp.org/rec/conf/itat/SzaboHT20
}}
==Analysis of Delay Patterns and Correlations in Railway Traffic Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2718/paper06.pdf</pdf>
<pre>
            Analysis of delay patterns and correlations in railway traffic data

                                      Roland Krisztián Szabó, Tomáš Horváth, and Ádám Tarcsi

                      Eötvös Loránd University, Faculty of Informatics, Budapest, Pázmány Péter stny. 1/C., 1117,
                         rolandszabo@inf.elte.hu tomas.horvath@inf.elte.hu ade@inf.elte.hu

Abstract: Traffic itself can be a huge challenge for most
                                                                                Table 1: Details of a train entry in a snapshot
commuters regardless of the transportation method of their
choice. For example, it is inevitable to experience delays
and congestion during rush hours. All commute methods                             Field name       Example
have their own specific characteristics when it comes to                         Date         "2019.10.29 20:09:38"
delays - cars and buses suffer from traffic jams and sim-                        Elvira ID    "5614115_191029"
ilar principles apply to railways as well. However, the                          Operator     "MAV"
causes of railway delays are not that straightforward and                        Line         "40"
they need further investigation. According to our personal                       Train number "55808"
experiences most passengers are not aware of the reasons                         Relation     "Budapest-Keleti - Pécs"
behind train delays even though they are usually encoun-                         Latitude     46.26418
tered multiple times a day. In this paper I will present pos-                    Longitude    18.10566
sible answers based on the data collected from the publicly                      Delay        5
available APIs of Hungarian State Railways over the past
1.5 years.

1     Datasets                                                            at all. Should the pandemic be over its effects could be
                                                                          analyzed later on but currently it is out of scope of this
The idea of the delay analysis and prediction originates                  paper.
from paper [1] where a simpler version of this concept has                   A snapshot of the map contains the following informa-
been used as a module in a smart alarm clock application.                 tion about each of the trains that were present at the time
During the development of the application multiple data                   the snapshot was taken (Table 1).
sources were investigated, some of which turned out to be
unusable. In this section the details of the selected data
                                                                          1.2   Weather
sources will be discussed.
                                                                          In addition to the traffic data we also collected the cor-
1.1   Traffic                                                             responding weather data for every train, because we sus-
                                                                          pect that weather has an influence on the delays as well.
I found that the most reliable publicly available data source
                                                                          It was not easy to find a free provider which is capable of
for traffic is the official map of Hungarian State Railways
                                                                          handling the necessary amount of requests, but after many
[2], where all trains can be tracked in real-time. I have
                                                                          trials we decided to use OpenWeatherMap [3]. Its free
created a small automated script that runs on a virtual pri-
                                                                          tier gives access to 60 location-based weather requests per
vate server and takes a snapshot of the map approximately
                                                                          minute, which is still not enough for every individual train,
every minute and stores the result in a JSON file.
                                                                          but can be sufficient to place virtual weather stations all
   Traffic data have been being collected since January
                                                                          over Hungary with a resolution of approximately 35.5 km.
2019, which means there are roughly 1 year and 6 months
of available information (approx. 130 million records).                   Definition 1. Virtual weather station. A virtual weather
Due to the COVID-19 outbreak a data freeze was applied                    station is a GPS position which can be queried for up-to-
at the end of March 2020 because of the extraordinary cir-                date local weather information.
cumstances that affect transportation all over the world.
Hungarian State Railways canceled lots of trains and only
30% percent of a train’s capacity can be used in order to                 Calculating the coordinates of the virtual weather sta-
prevent the spread of the infectious disease. This new sit-               tions The first task is to distribute the available 60 slots
uation significantly alters the operation of the railway sys-             uniformly such that every train can be assigned to the clos-
tem which would have introduced a lot of noise to the ex-                 est virtual weather station. Finding an exact solution to
isting dataset and it might not be relevant in a few months               the problem would have been infeasible, therefore we de-
      Copyright c 2020 for this paper by its authors. Use permitted un-
                                                                          cided to develop an approximation algorithm for which we
der Creative Commons License Attribution 4.0 International (CC BY         used the GeoNames geographical database [4] which con-
4.0).                                                                     tains POIs in Hungary and is available for download free
of charge under the Creative Commons Attribution 4.0 li-
                                                                  Figure 1: Reconstructed railway network of Hungary
cense.
   The algorithm (Algorithm 1) uses a k-d tree which is
a space-partitioning data structure that allows fast nearest
neighbor searches. [5] The k-d tree is used to place 60 vir-
tual weather stations on the map as follows: in each step
the most populated POI is selected and then its neighbor-
ing POIs are eliminated in the given radius. It results
in an approximately uniform placement of virtual weather
stations and they are located at densely populated areas
where accurate weather information benefits more people.

Algorithm 1 Approximation algorithm for virtual weather
station placement
Funct getVirtualWeatherStationPositions(pois, radius)
  1: pois ← pois.sort(”population”, ”desc”)
  2: kdt ← kdTree < POI > (”haversine”)                         Definition 2. Conflicting trains. A set of trains is said to
  3: for poi ∈ pois do                                          be in conflict when their route has common representative
  4:     nn ← kdt.searchRadius(poi, radius)                     points.
  5:     if nn = 0/ then
                                                                   The algorithm (Algorithm 2) determines the set of
  6:         kdt.add(poi)
                                                                representative points which are within a given distance
  7:     end if
                                                                radius to a specific representative point rp and returns
  8: end for
                                                                the set of trains that travel through those points without
  9: return kdt
                                                                taking the temporal dimension into consideration, because
                                                                we only use the intersection of conflicting trains with the
                                                                currently traveling trains.
2     Analysis
                                                                Algorithm 2 Algorithm for determining conflicting trains
2.1   Reconstruction of the railway network                     Funct getConflictingTrains(allRps, rp, radius)
                                                                 1: kdt ← kdTree < RP > (”haversine”, allRps)
In the traffic dataset there are millions of recorded GPS co-    2: nn ← kdt.searchRadius(rp, radius)
ordinates and the majority of them can be safely discarded                   n∈nn {n.trainId}
                                                                           S
                                                                 3: return
after the necessary information have been extracted. For
this task we used the Representative point extraction and
updating algorithm by Zhongyi Ni et al. [6] which is able
to calculate the most significant points along a route and is   2.3   Association rules
also capable of refining these points as new data becomes
available due to its online nature. The task is to determine    We realized that the grouping of trains can be considered
the least amount of points (called the representative points)   as a frequent itemset mining problem, therefore we used
that can accurately represent such a route.                     the Apriori algorithm [7] for itemset mining and associa-
   The above-mentioned algorithm can determine the              tion rule learning.
points describing a route (Figure 1) with an arbitrary res-
                                                                Definition 3. Delayed train. A train is officially consid-
olution. However, the recorded GPS trajectories are noisy
                                                                ered to be delayed when its delay is greater than or equal
and may contain significantly misplaced outliers. The
                                                                to 5 minutes.
representative point extraction algorithm is able to prop-
erly handle noise, but it also creates new representative          The algorithm requires transactions, which can be con-
points in case of outliers, therefore a support-based post-     structed based on the snapshots of the map. For each snap-
processing step is needed, which removes representative         shot a transaction is made based on the set of conflicting
points that are encountered rarely.                             delayed trains (Definition 2) in the given snapshot.
                                                                   Association rules were generated for departure delays
2.2   Conflicting trains                                        (Table 2), for which the snapshots taken upon the sched-
                                                                uled departure are used. The meaning of a rule is that if
As a dimensionality reduction method it is beneficial to        the set of antecedent trains are delayed then the consequent
obtain the set of trains that might affect the delay of an-     train is likely to depart late with the given metrics.
other train. It can be also used to model delay-chain prop-        Low support values are due to the fact that train 2749
agation.                                                        is only included in a transaction when it is delayed. The
Table 2: A subset of association rules generated for train     Table 3: A subset of sequential rules generated for train
2749                                                           2749

                                                                      Antecedents    Consequent      Supp.    Conf.
  Antecedents             Consequent      Supp.    Conf.
                                                                      2879           2749            0.63     0.97
 2879,2739           2749                 0.21     0.81               7049           2749            0.61     0.97
 2879,7039,2739      2749                 0.19     0.84               2669,2879      2749            0.60     0.97
 2879,2859,2739      2749                 0.17     0.82               2669,6099      2749            0.61     0.90
 2879,7039,2859,2739 2749                 0.16     0.84               2669           2749            0.67     0.90
 2879,700,7039       2749                 0.15     0.80               6099           2749            0.63     0.87
 2879,700,2859       2749                 0.15     0.83               2649           2749            0.61     0.86
 2879,6299,2739      2749                 0.14     0.87               7009           2749            0.61     0.84
 2879,7039,6299,2739 2749                 0.14     0.89
 2879,700,2739       2749                 0.14     0.83
 700,7039,2859,2879  2749                 0.13     0.85
 2879,2740,2739      2749                 0.12     0.93        Definition 4. Average delay. The average of the trains’
 2879,2740,2859      2749                 0.12     0.84        average delays along their route on a given day.
                                                               Definition 5. Average minimum (maximum) delay. The
                                                               average of the trains’ minimum (maximum) delays along
                                                               their route on a given day.
frequent itemset containing train 2749 has a support value
of 0.36 which means the train departs late roughly 36% of
the time.                                                      Month It turned out that summer is the only specific sea-
                                                               son which has a peak in the average delays followed by
                                                               autumn (Figure 2). This effect might be caused by main-
2.4   Sequential rules
                                                               tenance works, but there is no available historical mainte-
Besides the traditional association rule mining we can also    nance data to confirm this theory. It’s likely not caused by
consider consecutive snapshots of a specific train on a        the number of passengers since there is no school in Hun-
given day, which takes the temporal information into con-      gary during the summer, which significantly reduces the
sideration as well (Table 3). Sequential pattern mining is     number of passengers in the rush hours. Higher tempera-
almost the same as association rule mining, but instead of     tures also seem to have an effect on the delays.
working directly with a transaction we consider consecu-
tive transactions recorded in time.                                   Figure 2: Average delays grouped by month
   Sequential rules can be also mined for the departure de-
lay, but they are more meaningful if we mine them along
the entire route of the train. The rule A =⇒ B means that
when the trains in A are delayed then train B will also be-
come delayed in the future. In case of association rules we
talked about trains that are usually delayed together, but
now we have an additional temporal dimension.
   In order to test this method we used the SPMF open-
source data mining library [8] with the RuleGrowth algo-
rithm [9].
   The support values are much higher in this case, because
rules are mined along the entire route of the train. The
support value of the frequent itemset containing train 2749
is 0.71 which means even though the train departed late
only 36% of the time it got delayed 71% of the time during
its trip.

2.5   Other factors
                                                               Day of the week By looking at the average of all trains we
In addition to the delay propagation there might be other      can claim that Monday and Friday have the largest delays
factors that contribute to the delay of trains, like weather   on average, while weekends have somewhat lower average
and temporality. In this section some of these factors will    delays (Figure 3). It would be nice to have a dataset related
be analyzed with possible explanations and conclusions.        to the number of passengers because the larger amount of
passengers may cause delay peaks at the beginning and         marked as delay peaks (Figure 5). It can be concluded that
at the end of the workweek. The number of passengers          as the number of passengers and the density of the sched-
may also have a correlation with the lower delays during      ule increase, the average delay increases as well. Accord-
the weekend, but we suspect that it is likely caused by the   ing to the research, most relations have this pattern.
sparser schedule, which effectively reduces delay propa-         Two other peaks can be observed between 23:00 and
gation.                                                       01:00. In order to understand them domain knowledge is
                                                              needed. The reason behind the existence of the peaks is
  Figure 3: Average delays grouped by day of the week         that only a very small number of train travels by that time
                                                              in the country (sometimes even less than 10), and when
                                                              some of them are delayed, it causes a huge impact on the
                                                              average.


                                                                      Figure 5: Average delays grouped by hour


Holidays Holidays do not seem to have a significant effect
on the average delays (Figure 4). The peaks were mostly
predictable according to the previous researches - Pente-
cost Monday and Saint Stephen’s Day have slightly higher
average delays but they are both in the Summer, which has
the highest average delay among the seasons and Good
Friday is a Friday, which has above average delay if we
compare it to the other days of the week. As a conclusion,    Temperature The chart shows that the average delays in-
events do not seem to cause extraordinary delays, because     crease as temperature tends to either -10 or +30 Celsius
they can be planned ahead.                                    degrees (Figure 6). Due to the distribution of trains, the
                                                              ends of the chart are noisy, but the trendline can be easily
                                                              seen.
 Figure 4: Average delays grouped by holidays in 2019

                                                                  Figure 6: Average delays grouped by temperature


Time of the day By looking at the chart containing the
delays grouped by hours, the rush hours can be clearly
Weather type As far as the type of weather is concerned,      adjusted by its current delay. This estimation is not so re-
precipitation usually increases the delays (Figure 7). The    liable on the long term, but it can give you an idea about
most troublesome types are related to snow in the winter      the scale of the expected delay under the current circum-
and unexpected thunderstorms in the summer. Nearly all        stances.
rain types have higher average delays than clear sky.            Another problem is that the forecast lacks a very impor-
                                                              tant indicator, as it cannot tell whether the train is going to
                                                              depart late or on-time. The forecast is only available after
      Figure 7: Average delays grouped by weather type
                                                              the train has already departed. The two main goals are to
                                                              find a method to predict the departure delay and to improve
                                                              the long term reliability of the delay forecast mechanism
                                                              already present in the application.
                                                                 Departure delay prediction is a special problem, be-
                                                              cause we do not have any information yet about the train
                                                              we are interested in. Whether the train is going to depart
                                                              late or on-time can only be predicted based on its observ-
                                                              able environment. The input for the departure delay pre-
                                                              diction problem is a set of snapshots taken at the scheduled
                                                              departure time for which the target value is the delay of the
                                                              train on its first appearance on the day.


                                                              3.1   Association rules

                                                              The first idea is that the previously mined association rules
2.6    Delay heatmap                                          (Table 2) should be applied and see if we can predict
                                                              whether a train is going to be delayed or not upon depar-
An interesting visualization method is to generate a          ture.
heatmap of delay changes (Figure 8). It allows us to             The algorithm (Algorithm 3) of the model is very sim-
see where the delay accumulates during the trip and these     ple, it only requires a set of association rules extracted
peaks might suggest track problems, busy stations, or any     based on the input for departure delays. A train is consid-
other hidden issues that we are not aware of.                 ered to be delayed if it is a consequent in a rule for which
                                                              all the antecedent trains are delayed in a given snapshot
Figure 8: Delay heatmap of train 2749 between Monor and       of the map. The hyper-parameters of the model are the
Budapest-Nyugati                                              minimum support and minimum confidence of the rules.

                                                              Algorithm 3 Algorithm for predicting the departure delay
                                                              (association rules)
                                                              Funct predictDepartureDelayAr(snapshot, rules)
                                                                1: delayedTrains ← getDelayedTrains(snapshot)
                                                                2: for rule ∈ rules do
                                                                3:     if rule.getAntecedents() ⊆ delayedTrains then
                                                                4:         return True
                                                                5:     end if
                                                                6: end for
                                                                7: return False
   In this figure, red means larger average increase of de-
lay and blue means a lower average increase of delay. The
                                                                 The results (Table 4) are impressive, but according to
green patches represent moderate average increases of de-
                                                              the research it turned out there are simply not enough in-
lay.
                                                              formation for the association rule mining algorithm in its
                                                              current form which causes underfitting. Trains are catego-
3     Departure delay prediction                              rized as either delayed or on-time, which cannot properly
                                                              handle the following situation: when an antecedent train is
Traveling in an unreliable environment on a daily basis       delayed more than a given threshold (for example 10 min-
can be nerve-wracking. The official mobile application of     utes) then the consequent train can depart on-time as they
Hungarian State Railways has a delay forecasting mech-        are far away from each other and a slot becomes available
anism, but it is quite limited in its current form. When a    for the consequent train. Otherwise, the delay of the an-
train is already moving then the schedule is automatically    tecedent train propagates to the consequent train.
Table 4: Departure delay prediction metrics for train 2749                     Table 5: An example train embedding
using the association rules on the test set
                                                                        406    472    580   609    619    709       2617   ...
               Precision     Recall    F1-score     Support
                                                                        3        18   1     3      3      0         2      ...
  On time         0.85        0.85       0.85         194               3        20   2     3      4      0         1      ...
  Late            0.72        0.71       0.72         105
  Accuracy                               0.80         299

                                                                  denoted by green squares and trains that depart late are
                                                                  marked as red squares.
3.2   Train embedding

In order to solve the underfitting problem that affects the       Figure 9: Embeddings for train 2749 visualized using 3-
association and sequential rules, a different approach is         dimensional PCA
necessary. It is not enough to have an indicator whether a
train is delayed or not, the exact numeric values are needed
instead. It is also important to have an input with fixed
length for the algorithms.
   The solution (Algorithm 4, Table 5) is that each con-
flicting train that travels when a specific train departs is
considered as a unique feature with its current delay. If a
previously encountered conflicting train is not present at
the time, its delay becomes 0 as it likely won’t affect the
delay-chain. First, the algorithm is called with an empty
state vector and a subset of trains from a snapshot. If the
identifier of a train is not contained in the state vector then
it is appended to it. For each train identifier in the state
we determine whether it is present in the current input or
not and we append its current delay to the embedding. If          3.3       Support-vector machine
a train is not found in the input, its current delay is con-
sidered to be 0. The returned state can be then re-used to        A support-vector machine [10] tries to find a hyperplane
embed another set of trains. Before training, the embed-          in an n-dimensional space which separates, and therefore
dings can be safely padded with zeros to have a common            classifies the data points. This hyperplane should have
length.                                                           maximum margin, which means it should have maximal
                                                                  distance between the two classes, so future data points can
                                                                  be classified more reliably.
Algorithm 4 Algorithm for creating train embeddings
                                                                     For each train, a unique model is trained. In our exam-
Funct embed(state,trains)                                         ple, the space is 45-dimensional and the hyperplane sepa-
 1: embedding ← []                                                rates the trains that depart on-time and the trains that de-
 2: for train ∈ trains do                                         part late. For the SVM experiment we’ve implemented a
 3:     if train.getId() ∈
                         / state then                             grid-search (Table 6) and executed it on the training set
 4:          state.append(train.getId())                          with 5-fold cross-validation with the following parame-
 5:     end if                                                    ters:
 6: end for
 7: for trainId ∈ state do
 8:     if trainId ∈ trains then                                         Table 6: Parameter grid for the SVM experiment
 9:          embedding.append(trains.getDelay(trainId))
10:     else                                                            Parameter name            Possible values
11:          embedding.append(0)                                        Regul. parameter (C)  0.001, 0.01, 0.1, 1, 10
12:     end if                                                          Kernel                linear, poly, rbf, sigmoid
13: end for                                                             Kernel coeff. (gamma) 0.001, 0.01, 0.1, 1
14: return state, embedding                                             Indep. term (coef0)   0.0, 0.001, 0.01, 0.1, 1, 10

   They usually have high dimensions but they can be vi-
sualized using dimensionality reduction methods (Figure             The grid-search optimizes the hyper-parameters of the
9). In the following picture trains that depart on-time are       SVM model on the training dataset which results in better
metrics during the evaluation phase. The best parameters       deep neural networks. As it was mentioned before, the de-
were C=1, coef0=10, gamma=0.01 and kernel=poly. (Ta-           lay estimation in the official mobile application is not so
ble 7)                                                         reliable on the long term, therefore it would be beneficial
                                                               to find a method for predicting the real schedule. The rea-
                                                               son behind choosing deep neural networks is that state-of-
      Table 7: SVM prediction metrics for train 2749
                                                               the-art multivariate time series forecasting methods tend
                                                               to use these technologies. [12]
               Precision   Recall    F1-score    Support          The general delay prediction task will be formulated as
    On time      0.92       0.97       0.94        194         a regression problem instead of classification, because it
    Late         0.94       0.84       0.88        105         is more informative for the end-user and there are signif-
    Accuracy                           0.92        299         icantly more data available when we consider snapshots
                                                               after departure as well.


3.4    Random Forest Classifier                                4.1   Input
A random forest [11] is an ensemble model which fits mul-      For each day the snapshots containing a given train t are
tiple decision trees and outputs their mode. Each decision     collected in ascending order by their timestamp (Table 10).
tree splits the dataset a variable number of times based on    It must be noted that there can be a different amount of
the delays of the conflicting trains and outputs whether the   snapshots for each day, because a delayed train obviously
train is going to depart late or not.                          travels for a longer period of time. In this section a daily
   The training methodology was similar to SVM’s, we ran       collection of ordered snapshots for a given train t will be
a grid-search (Table 8) with 5-fold cross-validation on the    referred as the input.
training set with the following parameters:

                                                               Table 10: A subset of the input for train 2749 on a given
Table 8: Parameter grid for the Random Forest Classifier       day

         Parameter name      Possible values                     Date               Train   Delay     Lat      Lon      ...
        n_estimators      200, 600, 1200, 1800                  19-06-16 06:39      2749    0         47.35    19.43    ...
        max_depth         10, 50, 100, unlimited                19-06-16 06:40      2749    0         47.35    19.43    ...
        min_samples_split 2, 5, 10                              19-06-16 06:40      2749    0         47.35    19.42    ...
        min_samples_leaf 1, 2, 4                                19-06-16 06:41      2749    1         47.36    19.42    ...
        bootstrap         True, False


                                                                  Based on the data, there are two kinds of preprocessed
   The results (Table 9) are slightly better than the SVM’s,
                                                               inputs for each day, a vector of auxiliary features and a
and the trained model also helps with the explainability
                                                               matrix of time-series features (Table 11).
of the delay-chains, which is useful to prevent them
                                                                  The auxiliary features include an indicator whether the
from occurring in the future. The best parameters were
                                                               given day is a weekday, an indicator whether the given
bootstrap=true, max_depth=10, min_samples_leaf=2,
                                                               train departs during the rush hours and the one-hot en-
min_samples_split=10 and n_estimators=200.
                                                               coded representation of the month upon departure.
                                                                  For the time-series features a similar train embedding
Table 9: Random Forest Classifier prediction metrics for       is used as before, the only difference is that this time the
train 2749                                                     train we are interested in is also included in the embed-
                                                               ding. The embedding has a much higher dimensionality,
               Precision   Recall    F1-score    Support       because conflicting trains are embedded over the entire
    On time      0.97       1.00       0.98        194         route of the train. In order to keep the input dimension
    Late         1.00       0.90       0.95        105         manageable, only the characteristics of the train embed-
    Accuracy                           0.97        299         dings are used.
                                                                  An entry in the time-series input contains the current
                                                               delay of the train, the classified weather and the mean,
                                                               standard deviation, minimum and maximum values of the
4     Generic delay prediction                                 delays of the conflicting trains.
                                                                  Let’s suppose that train t traveled during k days over the
Based on the research experiences with departure delays        interval covered by the dataset and the number of time-
it is time to solve the prediction problem in general with     series features for all of its snapshots are m. Thus the 3D
 Table 11: Preprocessed time-series input on a given day                    Figure 10: Comparison of different 10-minute prediction
                                                                            models for train 2749 on a given day
      Delay      Weather         Mean         STD        Min          Max
      0          3               1.20         2.69       0            13
      0          3               1.25         2.65       0            13
      0          3               1.32         2.62       0            13
      1          3               1.35         2.60       0            13


input of the network has dimensions (k, li , m) where li is
the number of snapshots on the ith day.


4.2   Output

For each entry in the time-series input the corresponding
output becomes the vector of true delays in the future after                Table 12: Evaluation of the generic prediction model on
n minutes where n ∈ {5, 10, 20, 30}. Let’s assume that the                  train 2749
train we are interested in is t and its current delay is de-
termined by dt (Xi ). For each snapshot Xi let the output yi j                                   n=5     n = 10    n = 20    n = 30
equal to the delay of train t at snapshots Xi+n j ( j = 1..4).                    LSTM MSE       1.03    2.01      4.23      7.07
In case of out of bounds indices the delay of t at the last                       LSTM R2        0.97    0.95      0.91      0.84
snapshot of the day is used instead.                                              Official MSE   1.28    3.03      7.63      12.92
                                                                                  Official R2    0.97    0.93      0.83      0.72
      ytrue
       i    = [dt (Xi+5 ), dt (Xi+10 ), dt (Xi+20 ), dt (Xi+30 )]


Official model The main goal of the generic delay pre-
diction task is to obtain a more accurate forecast than it is               4.3    Evaluation
currently available in the official mobile application. In or-              The first evaluation was performed on train 2749. Out of
der to have a meaningful comparison, we have to recreate                    the 236 available occurrences only 231 were used, where
the model of the official forecast method and calculate its                 the maximum delay along the route was less than 30 min-
loss and other metrics alongside our model. Fortunately,                    utes. There is simply not enough data for the outliers
the official model is not too complicated, it simply substi-                where delay may occasionally exceed 250 minutes.
tutes the current delay for all future occurrences.                            The following metrics (Figure 10, Table 12) were cal-
                                                                            culated using 3-fold cross-validation.
           yoi f f icial = [dt (Xi ), dt (Xi ), dt (Xi ), dt (Xi )]            For this train the LSTM model gives better and better
                                                                            results as n increases compared to the official model.
LSTM model Our model has to support both the auxiliary                         On average, the proposed LSTM model outperforms the
and time-series features, therefore a multi-input network is                official model, but only when significant outliers are omit-
necessary. This problem is similar to the image captioning                  ted from the dataset. The proposed model is not able to
task, where an image is chosen as an auxiliary feature and                  learn extreme delays yet reliably due to their rare nature,
the words of the generated caption are sequence-like. [13,                  but the official model is able to forecast them easily by
14] Due to the fact that there can be a varying number of                   simply substituting the current delay in a linear manner.
snapshots per day some sort of recurrent neural network                     This is not a huge issue, because if a train is delayed that
(RNN) is needed, which can handle the temporal nature of                    much it usually skips its trip on that day entirely and pas-
the data as well. The output of an RNN depends not only                     sengers are informed on multiple platforms.
on the current input but on the previous outputs as well.
Its memory is very useful for the prediction of the delays,                 4.4    Conclusion
because it can learn complicated delay patterns.
   The RNN can also have a preset initial state, where we                   The analysis and the machine learning models presented
can store the representation of the auxiliary features and                  in this paper could be useful for the betterment of railway
the resulting network models P(Xi+1 |X0:i , auxiliary). [15]                services in Hungary and they may also increase the satis-
This auxiliary condition allows us to have a single network                 faction of the passengers. Hungarian State Railways also
for all trains if we include a train identifier, but due to re-             expressed their interest in the continuation of the research
source constraints this was not used during the research.                   project in cooperation with our university.
References                                                           EFOP-3.6.3-VEKOP-16-2017-00001: Talent Manage-
                                                                  ment in Autonomous Vehicle Control Technologies – The
 [1] Roland Krisztián Szabó. Smart alarm clock based on traffic   Project is supported by the Hungarian Government and co-
     and weather information, 2018.                               financed by the European Social Fund.
 [2] MÁV Szolgáltató Központ Zrt. MÁV-START térkép, 2020.
     [Online; accessed 16-January-2020].
 [3] Openweather Ltd. OpenWeatherMap, 2020. [Online; ac-
     cessed 16-January-2020].
 [4] GeoNames Team. GeoNames dump (Hungary), 2020. [On-
     line; accessed 16-January-2020].
 [5] Jon Louis Bentley.       Multidimensional binary search
     trees used for associative searching. Commun. ACM,
     18(9):509–517, September 1975.
 [6] Zhongyi Ni, Lijun Xie, Tian Xie, Binhua Shi, and Yao
     Zheng. Incremental road network generation based on ve-
     hicle trajectories. ISPRS International Journal of Geo-
     Information, 7(10), 2018.
 [7] Rakesh Agrawal and Ramakrishnan Srikant. Fast algo-
     rithms for mining association rules in large databases. In
     Proceedings of the 20th International Conference on Very
     Large Data Bases, VLDB ’94, pages 487–499, San Fran-
     cisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.
 [8] Philippe Fournier-Viger. SPMF open-source data mining
     library, 2020. [Online; accessed 17-March-2020].
 [9] Philippe Fournier-Viger, Roger Nkambou, and Vincent
     Shin-Mu Tseng. Rulegrowth: Mining sequential rules com-
     mon to several sequences by pattern-growth. In Proceed-
     ings of the 2011 ACM Symposium on Applied Computing,
     SAC ’11, page 956–961, New York, NY, USA, 2011. As-
     sociation for Computing Machinery.
[10] Corinna Cortes and Vladimir Vapnik. Support-vector net-
     works. Mach. Learn., 20(3):273–297, September 1995.
[11] Tin Kam Ho. Random decision forests. In Proceedings of
     the Third International Conference on Document Analysis
     and Recognition (Volume 1) - Volume 1, ICDAR ’95, page
     278, USA, 1995. IEEE Computer Society.
[12] Papers with Code. Multivariate Time Series Forecasting,
     2020. [Online; accessed 08-May-2020].
[13] Andrej Karpathy and Fei-Fei Li. Deep visual-semantic
     alignments for generating image descriptions. CoRR,
     abs/1412.2306, 2014.
[14] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
     mitru Erhan. Show and tell: A neural image caption gener-
     ator. CoRR, abs/1411.4555, 2014.
[15] Philippe Rémy. Conditional RNN, 2019. [Online; accessed
     17-March-2020].

</pre>