=Paper=
{{Paper
|id=Vol-2578/BMDA8
|storemode=property
|title=MOBI-AID: A Big Data Platform for Real-Time Analysis of On Board Unit Data
|pdfUrl=https://ceur-ws.org/Vol-2578/BMDA8.pdf
|volume=Vol-2578
|authors=Arnau Dillen,Giovanni Buroni,Yann-Aël Le Borgne,Karl Determe,Gianluca Bontempi
|dblpUrl=https://dblp.org/rec/conf/edbt/DillenBBDB20
}}
==MOBI-AID: A Big Data Platform for Real-Time Analysis of On Board Unit Data==
<pdf width="1500px">https://ceur-ws.org/Vol-2578/BMDA8.pdf</pdf>
<pre>
 MOBI-AID: A Big Data Platform for Real-Time Analysis of On
                     Board Unit Data
                   Arnau Dillen                                          Giovanni Buroni                                    Yann-Aël Le Borgne
    Machine Learning Group, Université                       Machine Learning Group, Université                    Machine Learning Group, Université
            Libre de Bruxelles                                       Libre de Bruxelles                                    Libre de Bruxelles
            Brussels, Belgium                                        Brussels, Belgium                                     Brussels, Belgium
         arnau.dillen@ulb.ac.be                                 giovanni.buroni@ulb.ac.be

                                               Karl Determe                                    Gianluca Bontempi
                                              Brussels Mobility                        Machine Learning Group, Université
                                              Brussels, Belgium                                Libre de Bruxelles
                                                                                               Brussels, Belgium
                                                                                               gbonte@ulb.ac.be

ABSTRACT                                                                                and communication technologies, more traffic data, especially
Every day large amounts of goods are transported by heavy-                              moving sensors data, are collected and made openly available by
goods vehicles over the road network. Being able to monitor and                         both public and private companies, allowing the development of
analyse heavy-goods vehicle traffic is essential to define poli-                        data-intensive approaches for traffic analysis.
cies able to minimize the impact of negative effects. However,                             In Belgium, traffic data is gathered for heavy-goods vehicles
this requires dealing with large amounts of data and often a                            (HGV) by Bruxelles Mobilité 1 , the public administration responsi-
dense road network, especially in an urban setting. This paper                          ble for equipment and infrastructure related to mobility issues
introduces a platform that makes use of state-of-the-art big data                       in the Brussels Capital Region (BCR). They continuously receive
technologies to process data pertaining to the positions and prop-                      data on HGV positions, which is normally used to charge HGVs
erties of heavy-goods vehicles. This platform aims to provide                           for kilometers driven on toll roads in Belgium. Every day, an
policy-makers and other stakeholders with the tools that allow                          average of 19 Gigabytes of data are therefore accumulated and
large-scale analysis of heavy-goods vehicle data in a near real-                        need to be processed in a timely manner, in order to monitor
time fashion. Additionally, the platform allows for forecasting of                      HGV traffic in Brussels.
future traffic conditions based on historical data.                                        Bruxelles Mobilité currently stores this data in a centralized
                                                                                        PostgreSQL [12] database which is set up to handle geographical
                                                                                        data through the PostGIS [14] extension (see figure 1). However,
1    INTRODUCTION                                                                       this solution is unable to cope with the massive amounts of data
Road freight transport is an essential aspect of any country’s                          that are ingested on a daily basis. While it would be possible
infrastructure policy due to its economic, environmental and                            to optimize queries and create database indices to minimize the
social impact. Among other issues, freight vehicles are respon-                         time it takes to retrieve a solution to a query, the main issue with
sible for a large part of the congestion on urban road networks                         a classical relational database system lies in the constant updates
(economic impact), pollutant emissions such as carbon dioxide                           and additions of rows. Even the most performant database on the
(environmental impact) and physical consequences of pollutant                           fastest hardware will result in a bottleneck. Additionally, reading
emissions on public health (social impact) [1].                                         and writing these amounts of data from and to a regular file
   Urban planners and policy-makers therefore demand Intel-                             system is too slow for the amounts of data that are being dealt
ligent Transportation Systems (ITS) which are able to foresee                           with.
the mobility behavior and support the definition of appropriate
policies [26, 29]. Tools such as accurate traffic forecasting models
[30], advanced mobility indicators of freight transport [15] and
more general mobility models [3] can assist policy makers in
making appropriate decisions.
   Traffic on a road network exhibits features which are com-
mon to most complex systems: self-organization, emergence of
transient space-time patterns based on local and global feedback
loops, which makes analysis of these types of data difficult. Due
to this, few studies [3, 18, 29, 31] address a complete transporta-
tion network including both freeways and urban contexts or
limit themselves to offline analysis [15, 25]. One of the main
reasons is the scarce availability of data gathered from point de-
tectors or interval detectors and the lack of methods able to tackle
the traffic prediction problem at a larger scale [2, 29]. However,
                                                                                        Figure 1: Current Brussels mobility architecture for in-
thanks to the more ubiquitous availability of new information
                                                                                        gesting and storing HGV data.
Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed-
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020 , Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
                                                                                        1 https://mobilite-mobiliteit.brussels/en
tribution 4.0 International (CC BY 4.0).
    The Machine Learning Group (MLG) of the Université Libre          amounts of data. Finally, we present a prototype web interface
de Bruxelles (ULB) collaborates with Bruxelles Mobilité to design     that would be used by policy makers and data scientists to get
a big data architecture able to provide near real-time processing     insights on traffic conditions, from the processed data. This would
and querying of the incoming data. For example, a query that          assist policy makers in making informed decisions regarding
retrieves the number of trucks on each street is required to make     urban planning with relation to road infrastructure and freight
forecasts on future traffic conditions.                               transport.
    An initial version of the architecture was implemented in
[5] and has been collecting data on the MLG cluster for some          2.1     Viapass and On Board Unit (OBU) Data
time now. We were able to successfully collect and process large      As of April first of the year 2016, heavy-goods vehicles having a
amounts of data thanks to the joint use of an Apache Hadoop           Maximum Authorized Mass (MAM) exceeding 3.5 tonnes must
cluster [21] and Apache Spark [32]. However, big data technolo-       pay a kilometer charge for driving on certain paying toll roads
gies are evolving fast and an appropriate interface to visualize      in Belgium. Any vehicle that is not exempt from the toll must
and analyze traffic related data is necessary. Data aggregation is    have an On Board Unit (OBU) installed. The public organization
necessary to get a high level view and loading large amounts of       in charge of supervising the kilometer charge is called Viapass2 .
data into the interface client is slow and impedes responsiveness     With the aid of GPS/GNSS satellite technology and mobile data,
of the interface. These are important aspects to take into account    the OBU records the distance that a HGV travels on Belgian public
when deciding what visualizations the interface should provide        roads. Mobile wireless technology is used to send the number
and which data should be loaded to the client.                        of kilometers charged to the Viapass data center, after which an
    The aim of this research is to be able to perform network-        invoice is issued to the owner of the vehicle.
scale analysis and forecasting in near real-time. The presented          Because of their evident value as a mobility indicator, the OBU
architecture allows to make real-time forecasts based on incom-       data are also made available to several mobility agencies, includ-
ing data using both well-established [29, 30] and state-of-the-art    ing Bruxelles Mobilité which uses this data to analyze freight
[2, 18] methods on a network-wide scale. It also enables perform-     traffic in the Brussels Capital Region (BCR). The BCR is a sepa-
ing analyses, such as identifying important points of congestion      rate region from the Flanders region, where it is geographically
caused by HGV traffic in changing conditions for example, which       located, and consists of 19 administrative districts named com-
were previously computed offline, in real-time. Next to the previ-    munes. These districts will be referred as such for the remainder
ously mentioned models, there is a fair amount of related work        of this article. The models and analyses used in this paper will
that proposes possible forecasting models which could be can-         use OBU data from HGVs within the BCR and its communes.
didates for a real-time forecasting model on road networks and           On average more than nine thousand HGVs are recorded every
their different sections [13, 23, 28]. A large corpus of literature   working day in the larger Brussels Metropolitan Area [4]. Each
discusses this issue.                                                 OBU device sends an update to the server approximately every
    The main contributions of this paper are twofold. In a first      30 seconds. An OBU record contains an anonymous identifier,
place it introduces an extension to the big data architecture that    which is reset every day at 4 a.m., the timestamp at which the
was implemented in [5] which enables near real-time process-          position was recorded, the GPS coordinates (latitude, longitude),
ing of the incoming data. Secondly, it proposes a design for a        the speed (km/h) and the direction (degrees). Additionally, the
dashboard that enables analyses and visualization of data, which      data includes vehicle characteristics such as the weight category
is implemented as a web interface. Together, these make up a          (MAM), country code and European emission standards classifi-
platform that provides the tools that are necessary to Bruxelles      cation of the engine (EURO value). This results in an average of
Mobilité to monitor the traffic of HGVs in Brussels and provide       19GB of data incoming on a daily basis and several terabytes of
insights that should be useful in establishing future policies re-    data being generated every year.
lated to transportation of goods within the BCR. The platform
was named the MOBIlity Advanced Indicators Dashboard (MOBI            2.2     Design of The Big Data Architecture
AID), after the project that supports this research. Additionally,
                                                                      Handling such large amounts of data requires an architecture that
the work done in this research could also serve as an example for
                                                                      can process the incoming data fast enough and store processed
other cities and potentially whole countries to deploy their own
                                                                      data in an efficient manner. A well-known architecture that meets
platforms to assist in decision making on policies with regards
                                                                      these requirements is the Lambda architecture [19, 20, 27] which
to road freight transport.
                                                                      has proven itself in several settings [10, 16] and is used in prac-
                                                                      tice by Twitter among others [17]. An overview of our current
2   METHODS AND IMPLEMENTATION                                        implementation of the architecture can be found in figure 2.
The data that are gathered concern all HGVs that are currently           With this architecture, three separate layers can be distin-
present in the Belgian territory. At this time we are only inter-     guished, which each handle different aspects of the platform. The
ested in HGVs that are present in the Brussels Capital Region,        speed layer takes care of processing incoming data in a timely
which still concerns thousands of HGVs on a daily basis. To get       manner and send the processed data to the serving layer for vi-
useful insights from this data, a platform is necessary that can      sualization and analysis. This layer handles the real-time aspect
handle such large amounts of data and present forecasts or the        of the platform. The batch layer stores immutable data (i.e. ob-
results of analyses in a meaningful way. For this purpose, next to    servations) and processes it for later user queries on historical
the data, two essential components were identified to implement       data. The serving layer consists of multiple views that are each
the envisioned platform.                                              used to fulfill a specific type of user queries. For example, data
   The remainder of this section is structured as follows. Firstly,   that are stored in a specific format which is used for a specific
we will describe the gathered data. Secondly, we discuss the          visualization, or predetermined queries that retrieve data that
architecture that allows processing and storage of such large         2 https://www.viapass.be/
                             Speed layer                                 compression and fast query access. To process the raw CSV files,
                                                                         Apache Spark [9] is used to deduplicate the observations and
                         Data            Processed                       store them in HDFS as Parquet files. HDFS takes care of distribut-
                        stream              data
                                                      Real-time View     ing file data over the different nodes of the cluster. With this
                                                                         approach, these operations can be processed in parallel and dis-
Incoming data

                                                       Historical View
                                                                         tributed over multiple compute nodes thanks to the integration
                        Immutable        Long term
                          data            storage
                                                                         of Hadoop and Spark. Using Spark we can efficiently run SQL
                                                                         queries and advanced analytics on the data by parallelizing a
                                 Batch layer         Serving layer
                                                                         large part of the computations. An overview of this process is
                                                                         shown in figure 3.
         Figure 2: Overview of the Lambda architecture.


are required for a specific analysis. This layer can also merge the
information that comes from both speed and batch layers, such
as discrepancies between the real-time traffic conditions and the
typical case for example. In our current implementation there
are two views available. The real-time view provides data that
comes from the speed layer directly. The historical view uses the
data from the batch layer to query for events and states that have
been observed in the past.
    The initial implementation of this architecture was deployed
on an Apache Hadoop [21] cluster, which is an open-source frame-
work for distributed computing that is widely used for big data
processing [7, 24, 27, 32]. The data are collected with a Python
script that queries the Viapass servers for new data at a fixed time
interval, which is currently set to two minutes. The script loads           Figure 3: Data retrieval and the batch layer pipeline.
the data in a GeoPandas [11] DataFrame (a data structure with
named columns and index-based rows), which is an extension                  In experiments with an alternative implementation of the
of the well-known Pandas library for the Python programming              batch layer the CSV data is read into a PostGIS database that
language, to support geometric data types and functions. The             stores the daily route of a HGV with a given ID. The route is
DataFrame contains all observations that were collected by Via-          stored as a LineString object (i.e. a sequence of points) con-
pass since the last data request.                                        structed from all available observations for a given HGV ID on
    Observations consist of a HGV’s current position as a geome-         a given day. In the same database information on Brussels com-
try point, which is represented by a given latitude and longitude,       munes is stored, both geographical (e.g. commune boundaries)
together with the unique ID that was assigned to the HGV for             and non-geographical (e.g. name, population, etc.). Using the ge-
that day. Additionally, an observation contains a timestamp of           ographical operations that are provided by PostGIS, information
when it was recorded by the OBU and the HGV’s characteristics,           such as the number of HGVs in a given commune at a given time
which were described in section 2.1. Observations are augmented          can efficiently be queried. This alternative batch layer implemen-
with the current date and time to indicate when the observation          tation was created, because the current approach lacks data types
was received by our servers. This is done because there is no            and functions that are optimized for operations with regards to
guarantee that the observations within the retrieved batch will          space and time. Ideally we would like to use both approaches in
all be for the current day, as it is not uncommon to have observa-       conjunction, for example by storing raw observations in Parquet
tions from previous days come in. As it can not be known when            format and aggregate these observations over a day to form the
all observations for a day have been received, the system needs          route of a truck over that day, to take advantage of the strengths
to take this into account.                                               of both approaches.
    The observations that were retrieved by the script are conse-           However, while PostGIS introduces the concept of space with
quently split by the day on which the observation was recorded           geographic data types and functions, it lacks a concept of both
and then saved to CSV files on the local file system. The files          space and time taken together without having to introduce ad-
are stored in a folder that corresponds to the day on which the          ditional complexity. PostGIS is not optimized for queries that
observations were recorded. These CSV files are used to run sim-         involve both space and time dimensions taken together. This
ulations of the Lambda architecture by reading batches of data           means that while the sequence of HGV positions can be stored
that represent incoming data from Viapass and sending them to            for a certain day, the associated time at which the HGV was at
the appropriate layers. In real-world scenarios, the incoming data       that position can not be stored without introducing additional
would be sent directly to the appropriate layers of the Lambda           fields or dimensions and having to make certain assumptions
architecture.                                                            about the data. This results in a loss of speed and data efficiency,
    For the currently deployed implementation of the batch layer,        which is one of the essential aspects of this platform. For this
we aggregate the CSV files per day and store them on Hadoop              reason, we are currently investigating a further extension of
Distributed File System (HDFS) in Parquet format. HDFS al-               PostGIS that introduces data types that introduce the concept
lows distributed storage with replication and improved read and          of a position at a certain time, which is called MobilityDB [33].
write speeds compared to regular file systems. The Parquet file          This would allow us to perform the necessary queries without
format is a column-oriented format that provides efficient data          being concerned with the underlying representation of the data
                                                                             3 a.m. - 4 a.m.                                                4 a.m. - 5 a.m.
                                                                         hour-of-day   Average    Average                                hour-of-day   Average    Average
                                                                                       Velocity   Flow                                                 Velocity   Flow


                                   Application


             Batch N                                        Batch K
                                                                                                            Measurement Time: 04:00:00

                                      State


           (a) State updates according to incoming data stream.


                                                                         (b) Transition from the 3 a.m. hour-of-the-day window to the
                                                                         4 a.m. window when data comes in that was sampled at 4 a.m..

                                    Figure 4: Stateful streaming as implemented in the pipeline.


and optimization of the geographic functions. We are currently                rather than at the currently observed values, to make forecasts.
in the process of experimenting with the mentioned alternatives               As an example, if the data is hourly and the forecasting target is
to identify the most appropriate approach for the batch layer.                9 a.m. on Monday, then given a window size of 1, the observation
    The speed layer of our Lambda architecture implementation                 of last Monday at 9 a.m. will be returned as the predicted forecast.
uses the Apache Kafka [8] streaming platform to store incom-                  A window of size 2 means returning the average of the obser-
ing data from queries to Viapass as a continuous stream of data.              vations of the last two Mondays at the same hour and so on for
For the purpose of initial simulations, a Python script reads a               larger window sizes. However, while simple and explainable, this
batch of observations from the stored CSV files into a GeoPandas              approach is rather naive, as it does not take the current traffic
DataFrame. As a preprocessing step, a different DataFrame, which              conditions or information that is known in advance, such as a
was loaded in memory beforehand, contains the geographic in-                  special event that is planned for example, into account. More
formation of a set of Brussels street segments. We used a subset              advanced Machine Learning methods could incorporate this type
of Brussels streets for testing, however, in practice this would              of additional information for improved forecasting.
contain all streets in Brussels. By performing a join of the two                  The final results are written to a JSON file, which is formatted
DataFrames with the within geographic function provided by                    according to the GeoJSON [6] specification. In this format, every
GeoPandas, we obtain a new DataFrame where every observa-                     street segment is described by a LineString instance that cor-
tion also contains the internal ID of the street segment the HGV              responds to the path of the street segments. In addition to this,
was on at that time. These data are sent to Kafka for processing              each street segment is annotated with HGV counts and average
in the next step of the streaming pipeline.                                   velocities for each hour-of-the-day as properties. The outputted
    At the receiving end of the data stream, the streaming facilities         file serves as the real-time view for the considered street seg-
that are provided by Spark are used to process the data, which                ments and can be read by the dashboard for display on a map, or
can directly be integrated with a Kafka stream. Incoming data is              to perform further analysis using the data, such as identifying
processed accordingly and used to update the current state of all             the busiest streets at the current time for example.
street segments that are being kept track of. This approach, which
is referred to as stateful streaming, is illustrated in figure 4a. The        2.3          Implementation of The MOBI-AID
state of a street segment is represented by the average number of
HGVs and the average velocity of passing HGVs for every hour-
                                                                                           Dashboard
of-the-day of the current day. For every new day at midnight,                 To provide an interface that would allow stakeholders to monitor
the state for each street segment is re-initialized to zero values            the current traffic situation for HGVs in Brussels or perform his-
for all properties. Values are subsequently updated continuously              torical analyses for future planning, a dashboard interface was
with a running mean for the current hour-of-the-day. Values for               implemented. A web interface provides this dashboard and was
past hours-of-the-day will contain the mean observed statistics               implemented with the Django [22] web framework, additionally
for that day and future values will be zero until the current time            making use of the first-party GeoDjango extension. Using this
falls within the window for that hour-of-the-day. This process is             extension provides a direct integration with databases such as
illustrated by figure 4b.                                                     PostGIS and other useful geographical tools. These technologies
    In addition to keeping track of the observed values, forecasts            where chosen for their flexibility, maturity and due to the fact
are also made for future hours-of-the-day. Currently, predictions             that they required minimal additional learning, given our com-
are made using a type of model that is referred to as a persistence           puter science backgrounds. The fact that these components are
model, more specifically, a sliding window persistence model.                 also very low level allows us to easily experiment with different
With this type of model a forecast is based directly on previously            alternative approaches.
observed values for the same day-of-the-week and hour-of-day.                    The web interface is comprised of three pages: Home, Dashboard
In this implementation, the data is divided in one week seasons,              and About. The Home page provides an overview of the available
meaning that predictions look at the data for the whole week                  features and displays a map that shows real-time HGV counts
for the different communes that compose the Brussels Capital
Region. Hovering over a specific commune will show the total
number of HGVs that have last been observed in this commune.
The HGV counts per commune are also shown in a table beside
the map, where they are also divided by weight category. Figure
5 shows a prototype implementation for the home page with the
user hovering over the Brussels City commune. The About page
provides more detailed information on the web interface and
contains the documentation on the dashboard. It also mentions
the sources of our funding and the project supporters.


                                                                         Figure 6: Work-in-progress Real-time tab of the MOBI-
                                                                         AID dashboard.


                                                                         for a certain hour-of-the-day on a certain day-of-the-week. The
                                                                         user can also select at which level of aggregation they want to see
                                                                         information displayed on the map. The currently provided levels
                                                                         of aggregation are commune level, street level and at the level
                                                                         of individual HGVs. Individual HGVs can not be shown when
                                                                         looking at the typical traffic situation, as concrete HGV positions
                                                                         evidently vary with time. However, in this case clusters would be
                                                                         shown at locations where HGVs are often present at the chosen
                                                                         hour-of-day and day-of-the-week. Figure 7 shows the work-in-
                                                                         progress Maps tab, without the website header, footer and the
                                                                         tab-selection menu. Note that the selection controls should be
                                                                         separated based on the previously selected type of visualization.
   Figure 5: Prototype home page of the web interface.                   These controls would also be shown on the map rather than
                                                                         above, as is currently the case.
   The Dashboard page provides the core functionality of the
web application. This page consists of several tabs which provide
a certain type of visualizations or allows for specific analyses to be
performed. In it’s current implementation, the dashboard consists
of the following tabs: Real-time, Maps, Charts, Analytics and
Predictions.
   The Real-time tab is composed of several panels that display
different types of real-time information, which are retrieved from
the Lambda architecture’s real-time view. In this tab, users can
select the type of information they want to see, which will then
be displayed on the map. A table next to the map displays a user
selected overview of the information that is displayed on the map.
For example, the top ten most busy streets can be displayed in this
table. Figure 6 shows the current prototype for the Real-time
tab.
   Note that in this figure the time-window for collecting statis-
tics is 15 minutes as opposed to the one hour window that is used
for the state of a street. This window corresponds to the interval
between consequent updates of the state rather than the hour-
of-day window that is being updated in the state. Additionally
note that streets in the table are identified by ID’s. In practice we
would use street names in the final implementation.
   The Maps tab contains a large map that shows historical data          Figure 7: Work-in-progress Maps tab of the dashboard.
about the observed HGV traffic as selected by the user. We distin-       Without site headers and dashboard tab-selection side
guish two distinct ways to look at historical data in this situation.    menu.
The user can select to either look at the data at a specific time on
a specific date, or they can choose to look at data that is typical
3     EVALUATION OF THE INITIAL                                        3.2    Results
      PLATFORM
For the MOBI-AID dashboard to provide an optimal user-experience
and be a useful contribution to the field of big mobility data, two
main aspects are of particular importance. These essential fea-
tures are adequate performance of the real-time data processing
pipeline and the usability of the web interface. To evaluate perfor-
mance, scalability tests were performed with a simulated stream
that is read from the data which is currently being collected from
Bruxelles Mobilité. The user interface was evaluated through user
testing and feedback.


3.1    Experimental setting
Scalability testing was already performed with a previous version
of the architecture in [5]. These experiments were performed
on the Hadoop big data cluster of the MLG. This cluster is made
up of 10 slave nodes, each with 24 CPU cores, managed by a
master node which is the point of access for users and handles
user interaction (interactive node). The resource manager Yarn,        Figure 8: Overview of the SparkUI stream statistics for the
which is an integral part of the Hadoop ecosystem, allocated 150       simulation.
cores and 805GB RAM for the purpose of these tests.
   Preliminary experiments with the new real-time architecture            Figure 8 shows an overview of some relevant statistics col-
were run on a local machine with a 2.3 GHz Intel Core i5 CPU           lected by SparkUI. Here, the most informative charts are the
with 4 cores and 16 GB of RAM. This hardware setup is far from         top (input rate) and third from the top (processing time) ones.
the processing power that is available on the cluster and will         The variation in input rate shows that data ingestion peeks at
have much slower IO due to the absence of Hadoop. However, it          certain points in a day, this illustrates the variation in HGV traffic
should give an initial insight of potential real-time capabilities     depending on hour-of-the-day. The most important aspect of this
of the implemented pipeline. Note that the code that is used in        figure is that the processing time for a batch is below the batch
these simulations has not yet been optimized, as implementing          interval. As can be seen in the figure, the average batch process-
the architecture was the priority in this phase. There are also        ing time is 1.6 seconds, which is well below the batch interval
some overheads introduced by the simulation environment, such          of 5 seconds. The second chart from the top shows scheduling
as running docker containers and local applications from the           delay, i.e. delay between scheduling of the job and the start of
testing machine sharing CPU cycles.                                    processing, which always remained 0 as batches were always
   The implemented simulation uses previously collected data           processed within the batch interval. For this reason the bottom
that was stored in CSV files. These files contain collected obser-     chart (total delay) is the same as the processing time chart, since
vations for three days, being the 23d, 24th and 25th of September      processing time is the only source of delay.
of the year 2018. As the simulation was performed on limited
hardware and accelerates the ingestion of data compared to the
real situation, these files were filtered beforehand to only contain
observations concerning three predetermined streets. New data
is sampled from these files to simulate incoming data over one
hour windows. This is a much larger sampling rate than in the          (a) Table showing the different tasks of the job, distributed over 4
real case, as we want to accelerate the simulations and are mostly     cores.
interested in the correct functioning of the pipeline. The batch
interval within which the processing should be completed was
set to 10 seconds. This means that the simulation has to process
the incoming batches 360 times faster than in the real case. This            (b) Event timeline of the parallel execution of the job.
is one of the main reasons why the number of observed streets
were so severely limited for the simulation. To evaluate the sim-      Figure 9: Some important information provided by
ulation, the output provided by the SparkUI interface, which is        SparkUI on the Spark job that processes a single batch of
used to inspect the state of Spark execution, was analyzed. A          data.
snapshot of SparkUI after running the simulation is shown in
figures 8 and 9.
                                                                          Figure 9 shows essential information which SparkUI provides
   Regarding user evaluation of the web interface, informal user
                                                                       on a specific Spark job. Figure 9a shows that the job which pro-
evaluations were performed. Stakeholders from Bruxelles Mobilité
                                                                       cesses a batch was parallelized over four tasks that are each
were shown the work-in-progress interface and asked to provide
                                                                       handled by a different CPU core. Figure 9b shows the timeline of
informal feedback on the application. Additionally, colleagues
                                                                       events that are part of handling a Spark job. The blue parts of
with expertise in the area of data visualization, especially regard-
                                                                       the timeline correspond to scheduling of the job, the red parts
ing mobility data, also gave their initial feedback on the currently
                                                                       to deserialization of the data and the green parts to actually pro-
provided functionalities.
                                                                       cessing the incoming records. The timeline shows that most of
processing time is actually spent on scheduling an deserialization                                Speed layer

of the tasks. This is because the number of records in a batch
in this experiment are much smaller than in the real-world data                               Data            Processed
                                                                                                                           Real-time View
                                                                                             stream              data
stream. Figure 10 shows the same timeline as figure 9b when
running the same task on the full dataset, i.e. with significantly                                                          Merged View
                                                                     Incoming data
more records in the processed batch. In this experiment 8 cores
                                                                                             Immutable        Long term
were allocated.                                                                                data            storage     Historical View


                                                                                                      Batch layer         Serving layer


                                                                     Figure 11: Overview of the future lambda architecture
                                                                     with a merged data view.


Figure 10: The event timeline of a Spark job when perform-
ing the simulation with all observations of a day. Ran with
8 cores allocated.
                                                                     MLG is currently in the process of migrating to a new cluster
  Regarding user evaluation of the web interface, the general        which should provide the necessary facilities for large-scale ex-
consensus was that the current interface can already provide         periments. The goal of these experiments would be to move be-
some basic insights, but requires more advanced tools and vi-        yond simulation. Concretely, we would hook up the implemented
sualizations to provide an added value to our potential users,       pipeline to the actual stream of incoming data.
compared to equivalent tools that are currently available.              Implementing and experimenting with more advanced Ma-
                                                                     chine Learning approaches for forecasting will also be an impor-
3.3    Discussion                                                    tant task in providing more nuanced predictions. Additionally,
The results from the performed experiments indicate that the         integrating existing mobility indicators and advanced ITS models
current architecture is promising for use in a real-life scenario.   from related research will provide appropriate metrics to policy
Taking the results from the previous experiments in [5] and the      makers. The platform should be able to perform such processing
well-known reliability of the used technologies into account, it     in real time and use the forecasts to simulate the impact of a
is expected that given appropriate hardware and optimization,        policy.
there should be no issue in dealing with the amounts of data we         Next to this, a finalized web interface will provide stakeholders
are working with.                                                    with the necessary tools to make informed decisions on how to
   Initial tests with the full data set where also performed on      optimize traffic of goods in the Brussels Capital Region. Further
the same hardware as the preliminary experiments. Results are        extending the current interface with feedback from the users
promising given the single node setting, but further experiments     should allow us to provide this ideal interface. Concretely, further
are needed to assess the architecture on a cluster setting. How-     versions of the real-time tab will also include other visualizations
ever, these preliminary results let us anticipate that no perfor-    besides the map, such as relevant charts and differences with the
mance issues should be expected when using the full processing       typical traffic situation at this hour-of-the-day. The final version
power of a big data cluster.                                         of this tab should allow users to easily spot anomalies in the
   SparkUI was an important tool in debugging and analyzing          current traffic situation compared to historical observations.
performance of the implemented pipeline. The insights it pro-           Prototypes for the Charts, Analytics and Predictions tabs
vides into the execution of jobs enables detailed monitoring of      have not been implemented yet. It is currently under review
how well the implemented code for a big data project performs        whether these should be separate tabs, or if they should be com-
in the Hadoop + Spark environment. These insights are espe-          bined into a single general Analysis tab. Conceptually, the Charts
cially useful for assessing whether the implemented pipeline will    tab would contain several types of charts that show useful infor-
perform well, even without the use of big-data capable hardware.     mation, such as the typical distribution of HGVs over communes
For example, it is with the help of SparkUI that we can clearly      for example. The analytics tab would contain tools that allow the
see that the scheduling and serialization overheads that can be      user to perform a specific analysis, such as constructing a model
seen in figure 9b become insignificant when working with larger      of traffic flow based on the available data. The predictions tab
data batches, as shown by the results seen in figure 10.             would put more emphasis on training and using the previously
                                                                     mentioned forecasting methods to predict future states of the
4     FUTURE WORK                                                    HGV traffic in Brussels. These models could then be used by
Future work consists of finalizing the pipeline architecture and     policy makers to simulate effects of certain decisions, such as
connecting the different components of the MOBI-AID big data         modifying existing roads for example. Determining where the
platform together. One possible extension that is currently envi-    functionality that is envisioned should live will be one of the
sioned is to add a merged view that uses data from both the speed    next steps in the design of the interface.
and batch layers to, for example, show discrepancies between            After the full prototype of the web interface has been im-
the real-time traffic conditions and typical conditions. Figure 11   plemented, extensive user studies and formal retrieval of user
visualizes this extension of our current implementation.             requirements will be done to get a better insight as to what the
   Given this finalized implementation, we will perform exten-       final web interface should provide. Iterating further and using
sive experiments on the MLG big data cluster which is pow-           agile software development methods should allow us to provide
ered by Apache Hadoop, as opposed to a regular office machine.       the end-users with the tools they need in a user friendly manner.
   Finally, packaging the platform for deployment will give the                                Systems Magazine 10, 2 (Summer 2018), 93–109. https://doi.org/10.1109/MITS.
different stakeholders the envisioned platform that fits their re-                             2018.2806634
                                                                                          [19] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of
quirements and allow them to easily deploy it on their own hard-                               massive datasets. Cambridge university press.
ware. This platform should also scale to be used for the whole                            [20] Nathan Marz and James Warren. 2015. Big Data: Principles and best practices
                                                                                               of scalable real-time data systems. New York; Manning Publications Co.
country and given appropriate data, it could also be used for                             [21] Apache Hadoop Project Members. 2019. Apache Hadoop. Apache Software
other countries.                                                                               Foundation. https://hadoop.apache.org/
                                                                                          [22] Django Team Members. 2019. Django. Django Software Foundation. https:
                                                                                               //www.djangoproject.com/
ACKNOWLEDGMENTS                                                                           [23] David Myr. 2003. Real time vehicle guidance and traffic forecasting system.
                                                                                               US Patent 6,615,130.
Arnau Dilen, Giovanni Buroni, Yann-aël Le Borgne and Gian-                                [24] Daiga Plase, Laila Niedrite, and Romans Taranovs. 2016. Accelerating data
luca Bontempi acknowledge the support of Programme Opéra-                                      queries on Hadoop framework by using compact data formats. In Advances
tionnel FEDER 2014-2020 de la Région de Bruxelles Capitale                                     in Information, Electronic and Electrical Engineering (AIEEE), 2016 IEEE 4th
                                                                                               Workshop on. IEEE, 1–7.
(ICITY MOBI-AID project). The authors are also grateful to Brux-                          [25] Mohammed A. Quddus, Chao Wang, and Stephen G. Ison. 2010. Road
elles Mobilité for having provided the OBU data necessary for                                  Traffic Congestion and Crash Severity: Econometric Analysis Using
the work.                                                                                      Ordered Response Models.              Journal of Transportation Engineering
                                                                                               136, 5 (2010), 424–435.          https://doi.org/10.1061/(ASCE)TE.1943-5436.
                                                                                               0000044 arXiv:https://ascelibrary.org/doi/pdf/10.1061/%28ASCE%29TE.1943-
REFERENCES                                                                                     5436.0000044
                                                                                          [26] John Ratcliffe and Ela Krawczyk. 2011. Imagineering city futures: The use of
 [1] Stephen Anderson, Julian Allen, and Michael Browne. 2005. Urban logis-                    prospective through scenarios in urban planning. Futures 43, 7 (2011), 642 –
     tics––how can it meet policy makers’ sustainability objectives? Journal of                653. https://doi.org/10.1016/j.futures.2011.05.005 Alternative City Futures.
     Transport Geography 13, 1 (2005), 71 – 81. https://doi.org/10.1016/j.jtrangeo.       [27] Dilpreet Singh and Chandan K Reddy. 2015. A survey on platforms for big
     2004.11.002 Sustainability and the Interaction Between External Effects of                data analytics. Journal of Big Data 2, 1 (2015), 8.
     Transport (Part Special Issue, pp. 23-99).                                           [28] Hongyu Sun, Henry X. Liu, Heng Xiao, Rachel R. He, and Bin Ran. 2003.
 [2] J. S. Angarita-Zapata, A. D. Masegosa, and I. Triguero. 2019. A Taxonomy                  Use of Local Linear Regression Model for Short-Term Traffic Forecasting.
     of Traffic Forecasting Regression Problems From a Supervised Learning Per-                Transportation Research Record 1836, 1 (2003), 143–150. https://doi.org/10.
     spective. IEEE Access 7 (2019), 68185–68205. https://doi.org/10.1109/ACCESS.              3141/1836-18
     2019.2917228                                                                         [29] CP Van Hinsbergen, JW Van Lint, and FM Sanders. 2007. Short term traffic
 [3] Hugo Barbosa, Marc Barthelemy, Gourab Ghoshal, Charlotte R. James, Maxime                 prediction models. In PROCEEDINGS OF THE 14TH WORLD CONGRESS ON
     Lenormand, Thomas Louail, Ronaldo Menezes, José J. Ramasco, Filippo Simini,               INTELLIGENT TRANSPORT SYSTEMS (ITS), HELD BEIJING, OCTOBER 2007.
     and Marcello Tomasini. 2018. Human mobility: Models and applications.                [30] JWC Van Lint and CPIJ Van Hinsbergen. 2012. Short-term traffic and travel
     Physics Reports 734 (2018), 1 – 74. https://doi.org/10.1016/j.physrep.2018.01.            time prediction models. Artificial Intelligence Applications to Critical Trans-
     001 Human mobility: Models and applications.                                              portation Issues 22, 1 (2012), 22–41.
 [4] Giovanni Buroni, Yann-Aël Le Borgne, Gianluca Bontempi, and Karl Determe.            [31] Eleni I. Vlahogianni, Matthew G. Karlaftis, and John C. Golias. 2014. Short-
     2018. Cluster Analysis of On-Board-Unit Truck Big Data from the Brussels                  term traffic forecasting: Where we are and where we’re going. Transportation
     Capital Region. 21st IEEE International Conference on Intelligent Transportation          Research Part C: Emerging Technologies 43 (2014), 3 – 19. https://doi.org/10.
     Systems (2018).                                                                           1016/j.trc.2014.01.005 Special Issue on Short-term Traffic Flow Forecasting.
 [5] Giovanni Buroni, Yann-Aël Le Borgne, Gianluca Bontempi, and Karl Determe.            [32] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Arm-
     2018. On-Board-Unit Data: A Big Data Platform for Scalable storage and                    brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman,
     Processing. 1–5. https://doi.org/10.1109/CloudTech.2018.8713342                           Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data
 [6] Howard Butler, Martin Daly, Allan Doyle, Sean Gillies, Hagen Stefan, and Tim              processing. Commun. ACM 59, 11 (2016), 56–65.
     Schaub. 2016. GeoJSON. Internet Engineering Task Force. https://tools.ietf.          [33] Esteban Zimányi, Mahmoud Sakr, Arthur Lesuisse, and Mohamed Bakli. 2019.
     org/html/rfc7946                                                                          MobilityDB: A Mainstream Moving Object Database System. In Proceedings of
 [7] Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen,                the 16th International Symposium on Spatial and Temporal Databases (SSTD ’19).
     Yannis Mazzer, and Gianluca Bontempi. 2018. SCARFF: A scalable framework                  ACM, New York, NY, USA, 206–209. https://doi.org/10.1145/3340964.3340991
     for streaming credit card fraud detection with spark. Information fusion 41
     (2018), 182–194.
 [8] Apache Kafka Comitters. 2019. Apache Kafka. Apache Software Foundation.
     https://kafka.apache.org/
 [9] Apache Spark Committers. 2019. Apache Spark. Apache Software Foundation.
     https://spark.apache.org/
[10] Konstantinos Demertzis, Lazaros Iliadis, and Vardis-Dimitris Anezakis. 2019.
     A Machine Hearing Framework for Real-Time Streaming Analytics Using
     Lambda Architecture. In Engineering Applications of Neural Networks, John
     Macintyre, Lazaros Iliadis, Ilias Maglogiannis, and Chrisina Jayne (Eds.).
     Springer International Publishing, Cham, 246–261.
[11] GeoPandas developers. 2019. GeoPandas. GeoPandas developers. http:
     //geopandas.org/index.html#
[12] PostgreSQL Developers. 2019. PostgreSQL. The PostgreSQL Global Develop-
     ment Group. https://www.postgresql.org
[13] Anzhelika Dombalyan, Viktor Kocherga, Elena Semchugova, and Nikolai
     Negrov. 2017. Traffic Forecasting Model for a Road Section. Transportation Re-
     search Procedia 20 (2017), 159 – 165. https://doi.org/10.1016/j.trpro.2017.01.040
     12th International Conference on Organization and Traffic Safety Manage-
     ment in large cities, SPbOTSIC-2016, 28-30 September 2016, St. Petersburg,
     Russia.
[14] PostGIS Development Group. 2019. PostGIS. The Open Source Geospatial
     Foundation. https://postgis.net/
[15] S. Hadavi, S. Verlinde, W. Verbeke, C. Macharis, and T. Guns. 2019. Monitoring
     Urban-Freight Transport Based on GPS Trajectories of Heavy-Goods Vehicles.
     IEEE Transactions on Intelligent Transportation Systems 20, 10 (Oct 2019), 3747–
     3758. https://doi.org/10.1109/TITS.2018.2880949
[16] M. Kiran, P. Murphy, I. Monga, J. Dugan, and S. S. Baveja. 2015. Lambda
     architecture for cost-effective batch and speed big data processing. In 2015
     IEEE International Conference on Big Data (Big Data). 2785–2792. https:
     //doi.org/10.1109/BigData.2015.7364082
[17] Narayan Kumar. 2017. Twitter’s tweets analysis using Lambda Architec-
     ture.       https://blog.knoldus.com/twitters-tweets-analysis-using-lambda-
     architecture/.
[18] I. Lana, J. Del Ser, M. Velez, and E. I. Vlahogianni. 2018. Road Traffic Forecast-
     ing: Recent Advances and New Challenges. IEEE Intelligent Transportation

</pre>