MOBI-AID: A Big Data Platform for Real-Time Analysis of On Board Unit Data Arnau Dillen Giovanni Buroni Yann-Aël Le Borgne Machine Learning Group, Université Machine Learning Group, Université Machine Learning Group, Université Libre de Bruxelles Libre de Bruxelles Libre de Bruxelles Brussels, Belgium Brussels, Belgium Brussels, Belgium arnau.dillen@ulb.ac.be giovanni.buroni@ulb.ac.be Karl Determe Gianluca Bontempi Brussels Mobility Machine Learning Group, Université Brussels, Belgium Libre de Bruxelles Brussels, Belgium gbonte@ulb.ac.be ABSTRACT and communication technologies, more traffic data, especially Every day large amounts of goods are transported by heavy- moving sensors data, are collected and made openly available by goods vehicles over the road network. Being able to monitor and both public and private companies, allowing the development of analyse heavy-goods vehicle traffic is essential to define poli- data-intensive approaches for traffic analysis. cies able to minimize the impact of negative effects. However, In Belgium, traffic data is gathered for heavy-goods vehicles this requires dealing with large amounts of data and often a (HGV) by Bruxelles Mobilité 1 , the public administration responsi- dense road network, especially in an urban setting. This paper ble for equipment and infrastructure related to mobility issues introduces a platform that makes use of state-of-the-art big data in the Brussels Capital Region (BCR). They continuously receive technologies to process data pertaining to the positions and prop- data on HGV positions, which is normally used to charge HGVs erties of heavy-goods vehicles. This platform aims to provide for kilometers driven on toll roads in Belgium. Every day, an policy-makers and other stakeholders with the tools that allow average of 19 Gigabytes of data are therefore accumulated and large-scale analysis of heavy-goods vehicle data in a near real- need to be processed in a timely manner, in order to monitor time fashion. Additionally, the platform allows for forecasting of HGV traffic in Brussels. future traffic conditions based on historical data. Bruxelles Mobilité currently stores this data in a centralized PostgreSQL [12] database which is set up to handle geographical data through the PostGIS [14] extension (see figure 1). However, 1 INTRODUCTION this solution is unable to cope with the massive amounts of data Road freight transport is an essential aspect of any country’s that are ingested on a daily basis. While it would be possible infrastructure policy due to its economic, environmental and to optimize queries and create database indices to minimize the social impact. Among other issues, freight vehicles are respon- time it takes to retrieve a solution to a query, the main issue with sible for a large part of the congestion on urban road networks a classical relational database system lies in the constant updates (economic impact), pollutant emissions such as carbon dioxide and additions of rows. Even the most performant database on the (environmental impact) and physical consequences of pollutant fastest hardware will result in a bottleneck. Additionally, reading emissions on public health (social impact) [1]. and writing these amounts of data from and to a regular file Urban planners and policy-makers therefore demand Intel- system is too slow for the amounts of data that are being dealt ligent Transportation Systems (ITS) which are able to foresee with. the mobility behavior and support the definition of appropriate policies [26, 29]. Tools such as accurate traffic forecasting models [30], advanced mobility indicators of freight transport [15] and more general mobility models [3] can assist policy makers in making appropriate decisions. Traffic on a road network exhibits features which are com- mon to most complex systems: self-organization, emergence of transient space-time patterns based on local and global feedback loops, which makes analysis of these types of data difficult. Due to this, few studies [3, 18, 29, 31] address a complete transporta- tion network including both freeways and urban contexts or limit themselves to offline analysis [15, 25]. One of the main reasons is the scarce availability of data gathered from point de- tectors or interval detectors and the lack of methods able to tackle the traffic prediction problem at a larger scale [2, 29]. However, Figure 1: Current Brussels mobility architecture for in- thanks to the more ubiquitous availability of new information gesting and storing HGV data. Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020 , Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- 1 https://mobilite-mobiliteit.brussels/en tribution 4.0 International (CC BY 4.0). The Machine Learning Group (MLG) of the Université Libre amounts of data. Finally, we present a prototype web interface de Bruxelles (ULB) collaborates with Bruxelles Mobilité to design that would be used by policy makers and data scientists to get a big data architecture able to provide near real-time processing insights on traffic conditions, from the processed data. This would and querying of the incoming data. For example, a query that assist policy makers in making informed decisions regarding retrieves the number of trucks on each street is required to make urban planning with relation to road infrastructure and freight forecasts on future traffic conditions. transport. An initial version of the architecture was implemented in [5] and has been collecting data on the MLG cluster for some 2.1 Viapass and On Board Unit (OBU) Data time now. We were able to successfully collect and process large As of April first of the year 2016, heavy-goods vehicles having a amounts of data thanks to the joint use of an Apache Hadoop Maximum Authorized Mass (MAM) exceeding 3.5 tonnes must cluster [21] and Apache Spark [32]. However, big data technolo- pay a kilometer charge for driving on certain paying toll roads gies are evolving fast and an appropriate interface to visualize in Belgium. Any vehicle that is not exempt from the toll must and analyze traffic related data is necessary. Data aggregation is have an On Board Unit (OBU) installed. The public organization necessary to get a high level view and loading large amounts of in charge of supervising the kilometer charge is called Viapass2 . data into the interface client is slow and impedes responsiveness With the aid of GPS/GNSS satellite technology and mobile data, of the interface. These are important aspects to take into account the OBU records the distance that a HGV travels on Belgian public when deciding what visualizations the interface should provide roads. Mobile wireless technology is used to send the number and which data should be loaded to the client. of kilometers charged to the Viapass data center, after which an The aim of this research is to be able to perform network- invoice is issued to the owner of the vehicle. scale analysis and forecasting in near real-time. The presented Because of their evident value as a mobility indicator, the OBU architecture allows to make real-time forecasts based on incom- data are also made available to several mobility agencies, includ- ing data using both well-established [29, 30] and state-of-the-art ing Bruxelles Mobilité which uses this data to analyze freight [2, 18] methods on a network-wide scale. It also enables perform- traffic in the Brussels Capital Region (BCR). The BCR is a sepa- ing analyses, such as identifying important points of congestion rate region from the Flanders region, where it is geographically caused by HGV traffic in changing conditions for example, which located, and consists of 19 administrative districts named com- were previously computed offline, in real-time. Next to the previ- munes. These districts will be referred as such for the remainder ously mentioned models, there is a fair amount of related work of this article. The models and analyses used in this paper will that proposes possible forecasting models which could be can- use OBU data from HGVs within the BCR and its communes. didates for a real-time forecasting model on road networks and On average more than nine thousand HGVs are recorded every their different sections [13, 23, 28]. A large corpus of literature working day in the larger Brussels Metropolitan Area [4]. Each discusses this issue. OBU device sends an update to the server approximately every The main contributions of this paper are twofold. In a first 30 seconds. An OBU record contains an anonymous identifier, place it introduces an extension to the big data architecture that which is reset every day at 4 a.m., the timestamp at which the was implemented in [5] which enables near real-time process- position was recorded, the GPS coordinates (latitude, longitude), ing of the incoming data. Secondly, it proposes a design for a the speed (km/h) and the direction (degrees). Additionally, the dashboard that enables analyses and visualization of data, which data includes vehicle characteristics such as the weight category is implemented as a web interface. Together, these make up a (MAM), country code and European emission standards classifi- platform that provides the tools that are necessary to Bruxelles cation of the engine (EURO value). This results in an average of Mobilité to monitor the traffic of HGVs in Brussels and provide 19GB of data incoming on a daily basis and several terabytes of insights that should be useful in establishing future policies re- data being generated every year. lated to transportation of goods within the BCR. The platform was named the MOBIlity Advanced Indicators Dashboard (MOBI 2.2 Design of The Big Data Architecture AID), after the project that supports this research. Additionally, Handling such large amounts of data requires an architecture that the work done in this research could also serve as an example for can process the incoming data fast enough and store processed other cities and potentially whole countries to deploy their own data in an efficient manner. A well-known architecture that meets platforms to assist in decision making on policies with regards these requirements is the Lambda architecture [19, 20, 27] which to road freight transport. has proven itself in several settings [10, 16] and is used in prac- tice by Twitter among others [17]. An overview of our current 2 METHODS AND IMPLEMENTATION implementation of the architecture can be found in figure 2. The data that are gathered concern all HGVs that are currently With this architecture, three separate layers can be distin- present in the Belgian territory. At this time we are only inter- guished, which each handle different aspects of the platform. The ested in HGVs that are present in the Brussels Capital Region, speed layer takes care of processing incoming data in a timely which still concerns thousands of HGVs on a daily basis. To get manner and send the processed data to the serving layer for vi- useful insights from this data, a platform is necessary that can sualization and analysis. This layer handles the real-time aspect handle such large amounts of data and present forecasts or the of the platform. The batch layer stores immutable data (i.e. ob- results of analyses in a meaningful way. For this purpose, next to servations) and processes it for later user queries on historical the data, two essential components were identified to implement data. The serving layer consists of multiple views that are each the envisioned platform. used to fulfill a specific type of user queries. For example, data The remainder of this section is structured as follows. Firstly, that are stored in a specific format which is used for a specific we will describe the gathered data. Secondly, we discuss the visualization, or predetermined queries that retrieve data that architecture that allows processing and storage of such large 2 https://www.viapass.be/ Speed layer compression and fast query access. To process the raw CSV files, Apache Spark [9] is used to deduplicate the observations and Data Processed store them in HDFS as Parquet files. HDFS takes care of distribut- stream data Real-time View ing file data over the different nodes of the cluster. With this approach, these operations can be processed in parallel and dis- Incoming data Historical View tributed over multiple compute nodes thanks to the integration Immutable Long term data storage of Hadoop and Spark. Using Spark we can efficiently run SQL queries and advanced analytics on the data by parallelizing a Batch layer Serving layer large part of the computations. An overview of this process is shown in figure 3. Figure 2: Overview of the Lambda architecture. are required for a specific analysis. This layer can also merge the information that comes from both speed and batch layers, such as discrepancies between the real-time traffic conditions and the typical case for example. In our current implementation there are two views available. The real-time view provides data that comes from the speed layer directly. The historical view uses the data from the batch layer to query for events and states that have been observed in the past. The initial implementation of this architecture was deployed on an Apache Hadoop [21] cluster, which is an open-source frame- work for distributed computing that is widely used for big data processing [7, 24, 27, 32]. The data are collected with a Python script that queries the Viapass servers for new data at a fixed time interval, which is currently set to two minutes. The script loads Figure 3: Data retrieval and the batch layer pipeline. the data in a GeoPandas [11] DataFrame (a data structure with named columns and index-based rows), which is an extension In experiments with an alternative implementation of the of the well-known Pandas library for the Python programming batch layer the CSV data is read into a PostGIS database that language, to support geometric data types and functions. The stores the daily route of a HGV with a given ID. The route is DataFrame contains all observations that were collected by Via- stored as a LineString object (i.e. a sequence of points) con- pass since the last data request. structed from all available observations for a given HGV ID on Observations consist of a HGV’s current position as a geome- a given day. In the same database information on Brussels com- try point, which is represented by a given latitude and longitude, munes is stored, both geographical (e.g. commune boundaries) together with the unique ID that was assigned to the HGV for and non-geographical (e.g. name, population, etc.). Using the ge- that day. Additionally, an observation contains a timestamp of ographical operations that are provided by PostGIS, information when it was recorded by the OBU and the HGV’s characteristics, such as the number of HGVs in a given commune at a given time which were described in section 2.1. Observations are augmented can efficiently be queried. This alternative batch layer implemen- with the current date and time to indicate when the observation tation was created, because the current approach lacks data types was received by our servers. This is done because there is no and functions that are optimized for operations with regards to guarantee that the observations within the retrieved batch will space and time. Ideally we would like to use both approaches in all be for the current day, as it is not uncommon to have observa- conjunction, for example by storing raw observations in Parquet tions from previous days come in. As it can not be known when format and aggregate these observations over a day to form the all observations for a day have been received, the system needs route of a truck over that day, to take advantage of the strengths to take this into account. of both approaches. The observations that were retrieved by the script are conse- However, while PostGIS introduces the concept of space with quently split by the day on which the observation was recorded geographic data types and functions, it lacks a concept of both and then saved to CSV files on the local file system. The files space and time taken together without having to introduce ad- are stored in a folder that corresponds to the day on which the ditional complexity. PostGIS is not optimized for queries that observations were recorded. These CSV files are used to run sim- involve both space and time dimensions taken together. This ulations of the Lambda architecture by reading batches of data means that while the sequence of HGV positions can be stored that represent incoming data from Viapass and sending them to for a certain day, the associated time at which the HGV was at the appropriate layers. In real-world scenarios, the incoming data that position can not be stored without introducing additional would be sent directly to the appropriate layers of the Lambda fields or dimensions and having to make certain assumptions architecture. about the data. This results in a loss of speed and data efficiency, For the currently deployed implementation of the batch layer, which is one of the essential aspects of this platform. For this we aggregate the CSV files per day and store them on Hadoop reason, we are currently investigating a further extension of Distributed File System (HDFS) in Parquet format. HDFS al- PostGIS that introduces data types that introduce the concept lows distributed storage with replication and improved read and of a position at a certain time, which is called MobilityDB [33]. write speeds compared to regular file systems. The Parquet file This would allow us to perform the necessary queries without format is a column-oriented format that provides efficient data being concerned with the underlying representation of the data 3 a.m. - 4 a.m. 4 a.m. - 5 a.m. hour-of-day Average Average hour-of-day Average Average Velocity Flow Velocity Flow Application Batch N Batch K Measurement Time: 04:00:00 State (a) State updates according to incoming data stream. (b) Transition from the 3 a.m. hour-of-the-day window to the 4 a.m. window when data comes in that was sampled at 4 a.m.. Figure 4: Stateful streaming as implemented in the pipeline. and optimization of the geographic functions. We are currently rather than at the currently observed values, to make forecasts. in the process of experimenting with the mentioned alternatives As an example, if the data is hourly and the forecasting target is to identify the most appropriate approach for the batch layer. 9 a.m. on Monday, then given a window size of 1, the observation The speed layer of our Lambda architecture implementation of last Monday at 9 a.m. will be returned as the predicted forecast. uses the Apache Kafka [8] streaming platform to store incom- A window of size 2 means returning the average of the obser- ing data from queries to Viapass as a continuous stream of data. vations of the last two Mondays at the same hour and so on for For the purpose of initial simulations, a Python script reads a larger window sizes. However, while simple and explainable, this batch of observations from the stored CSV files into a GeoPandas approach is rather naive, as it does not take the current traffic DataFrame. As a preprocessing step, a different DataFrame, which conditions or information that is known in advance, such as a was loaded in memory beforehand, contains the geographic in- special event that is planned for example, into account. More formation of a set of Brussels street segments. We used a subset advanced Machine Learning methods could incorporate this type of Brussels streets for testing, however, in practice this would of additional information for improved forecasting. contain all streets in Brussels. By performing a join of the two The final results are written to a JSON file, which is formatted DataFrames with the within geographic function provided by according to the GeoJSON [6] specification. In this format, every GeoPandas, we obtain a new DataFrame where every observa- street segment is described by a LineString instance that cor- tion also contains the internal ID of the street segment the HGV responds to the path of the street segments. In addition to this, was on at that time. These data are sent to Kafka for processing each street segment is annotated with HGV counts and average in the next step of the streaming pipeline. velocities for each hour-of-the-day as properties. The outputted At the receiving end of the data stream, the streaming facilities file serves as the real-time view for the considered street seg- that are provided by Spark are used to process the data, which ments and can be read by the dashboard for display on a map, or can directly be integrated with a Kafka stream. Incoming data is to perform further analysis using the data, such as identifying processed accordingly and used to update the current state of all the busiest streets at the current time for example. street segments that are being kept track of. This approach, which is referred to as stateful streaming, is illustrated in figure 4a. The 2.3 Implementation of The MOBI-AID state of a street segment is represented by the average number of HGVs and the average velocity of passing HGVs for every hour- Dashboard of-the-day of the current day. For every new day at midnight, To provide an interface that would allow stakeholders to monitor the state for each street segment is re-initialized to zero values the current traffic situation for HGVs in Brussels or perform his- for all properties. Values are subsequently updated continuously torical analyses for future planning, a dashboard interface was with a running mean for the current hour-of-the-day. Values for implemented. A web interface provides this dashboard and was past hours-of-the-day will contain the mean observed statistics implemented with the Django [22] web framework, additionally for that day and future values will be zero until the current time making use of the first-party GeoDjango extension. Using this falls within the window for that hour-of-the-day. This process is extension provides a direct integration with databases such as illustrated by figure 4b. PostGIS and other useful geographical tools. These technologies In addition to keeping track of the observed values, forecasts where chosen for their flexibility, maturity and due to the fact are also made for future hours-of-the-day. Currently, predictions that they required minimal additional learning, given our com- are made using a type of model that is referred to as a persistence puter science backgrounds. The fact that these components are model, more specifically, a sliding window persistence model. also very low level allows us to easily experiment with different With this type of model a forecast is based directly on previously alternative approaches. observed values for the same day-of-the-week and hour-of-day. The web interface is comprised of three pages: Home, Dashboard In this implementation, the data is divided in one week seasons, and About. The Home page provides an overview of the available meaning that predictions look at the data for the whole week features and displays a map that shows real-time HGV counts for the different communes that compose the Brussels Capital Region. Hovering over a specific commune will show the total number of HGVs that have last been observed in this commune. The HGV counts per commune are also shown in a table beside the map, where they are also divided by weight category. Figure 5 shows a prototype implementation for the home page with the user hovering over the Brussels City commune. The About page provides more detailed information on the web interface and contains the documentation on the dashboard. It also mentions the sources of our funding and the project supporters. Figure 6: Work-in-progress Real-time tab of the MOBI- AID dashboard. for a certain hour-of-the-day on a certain day-of-the-week. The user can also select at which level of aggregation they want to see information displayed on the map. The currently provided levels of aggregation are commune level, street level and at the level of individual HGVs. Individual HGVs can not be shown when looking at the typical traffic situation, as concrete HGV positions evidently vary with time. However, in this case clusters would be shown at locations where HGVs are often present at the chosen hour-of-day and day-of-the-week. Figure 7 shows the work-in- progress Maps tab, without the website header, footer and the tab-selection menu. Note that the selection controls should be separated based on the previously selected type of visualization. Figure 5: Prototype home page of the web interface. These controls would also be shown on the map rather than above, as is currently the case. The Dashboard page provides the core functionality of the web application. This page consists of several tabs which provide a certain type of visualizations or allows for specific analyses to be performed. In it’s current implementation, the dashboard consists of the following tabs: Real-time, Maps, Charts, Analytics and Predictions. The Real-time tab is composed of several panels that display different types of real-time information, which are retrieved from the Lambda architecture’s real-time view. In this tab, users can select the type of information they want to see, which will then be displayed on the map. A table next to the map displays a user selected overview of the information that is displayed on the map. For example, the top ten most busy streets can be displayed in this table. Figure 6 shows the current prototype for the Real-time tab. Note that in this figure the time-window for collecting statis- tics is 15 minutes as opposed to the one hour window that is used for the state of a street. This window corresponds to the interval between consequent updates of the state rather than the hour- of-day window that is being updated in the state. Additionally note that streets in the table are identified by ID’s. In practice we would use street names in the final implementation. The Maps tab contains a large map that shows historical data Figure 7: Work-in-progress Maps tab of the dashboard. about the observed HGV traffic as selected by the user. We distin- Without site headers and dashboard tab-selection side guish two distinct ways to look at historical data in this situation. menu. The user can select to either look at the data at a specific time on a specific date, or they can choose to look at data that is typical 3 EVALUATION OF THE INITIAL 3.2 Results PLATFORM For the MOBI-AID dashboard to provide an optimal user-experience and be a useful contribution to the field of big mobility data, two main aspects are of particular importance. These essential fea- tures are adequate performance of the real-time data processing pipeline and the usability of the web interface. To evaluate perfor- mance, scalability tests were performed with a simulated stream that is read from the data which is currently being collected from Bruxelles Mobilité. The user interface was evaluated through user testing and feedback. 3.1 Experimental setting Scalability testing was already performed with a previous version of the architecture in [5]. These experiments were performed on the Hadoop big data cluster of the MLG. This cluster is made up of 10 slave nodes, each with 24 CPU cores, managed by a master node which is the point of access for users and handles user interaction (interactive node). The resource manager Yarn, Figure 8: Overview of the SparkUI stream statistics for the which is an integral part of the Hadoop ecosystem, allocated 150 simulation. cores and 805GB RAM for the purpose of these tests. Preliminary experiments with the new real-time architecture Figure 8 shows an overview of some relevant statistics col- were run on a local machine with a 2.3 GHz Intel Core i5 CPU lected by SparkUI. Here, the most informative charts are the with 4 cores and 16 GB of RAM. This hardware setup is far from top (input rate) and third from the top (processing time) ones. the processing power that is available on the cluster and will The variation in input rate shows that data ingestion peeks at have much slower IO due to the absence of Hadoop. However, it certain points in a day, this illustrates the variation in HGV traffic should give an initial insight of potential real-time capabilities depending on hour-of-the-day. The most important aspect of this of the implemented pipeline. Note that the code that is used in figure is that the processing time for a batch is below the batch these simulations has not yet been optimized, as implementing interval. As can be seen in the figure, the average batch process- the architecture was the priority in this phase. There are also ing time is 1.6 seconds, which is well below the batch interval some overheads introduced by the simulation environment, such of 5 seconds. The second chart from the top shows scheduling as running docker containers and local applications from the delay, i.e. delay between scheduling of the job and the start of testing machine sharing CPU cycles. processing, which always remained 0 as batches were always The implemented simulation uses previously collected data processed within the batch interval. For this reason the bottom that was stored in CSV files. These files contain collected obser- chart (total delay) is the same as the processing time chart, since vations for three days, being the 23d, 24th and 25th of September processing time is the only source of delay. of the year 2018. As the simulation was performed on limited hardware and accelerates the ingestion of data compared to the real situation, these files were filtered beforehand to only contain observations concerning three predetermined streets. New data is sampled from these files to simulate incoming data over one hour windows. This is a much larger sampling rate than in the (a) Table showing the different tasks of the job, distributed over 4 real case, as we want to accelerate the simulations and are mostly cores. interested in the correct functioning of the pipeline. The batch interval within which the processing should be completed was set to 10 seconds. This means that the simulation has to process the incoming batches 360 times faster than in the real case. This (b) Event timeline of the parallel execution of the job. is one of the main reasons why the number of observed streets were so severely limited for the simulation. To evaluate the sim- Figure 9: Some important information provided by ulation, the output provided by the SparkUI interface, which is SparkUI on the Spark job that processes a single batch of used to inspect the state of Spark execution, was analyzed. A data. snapshot of SparkUI after running the simulation is shown in figures 8 and 9. Figure 9 shows essential information which SparkUI provides Regarding user evaluation of the web interface, informal user on a specific Spark job. Figure 9a shows that the job which pro- evaluations were performed. Stakeholders from Bruxelles Mobilité cesses a batch was parallelized over four tasks that are each were shown the work-in-progress interface and asked to provide handled by a different CPU core. Figure 9b shows the timeline of informal feedback on the application. Additionally, colleagues events that are part of handling a Spark job. The blue parts of with expertise in the area of data visualization, especially regard- the timeline correspond to scheduling of the job, the red parts ing mobility data, also gave their initial feedback on the currently to deserialization of the data and the green parts to actually pro- provided functionalities. cessing the incoming records. The timeline shows that most of processing time is actually spent on scheduling an deserialization Speed layer of the tasks. This is because the number of records in a batch in this experiment are much smaller than in the real-world data Data Processed Real-time View stream data stream. Figure 10 shows the same timeline as figure 9b when running the same task on the full dataset, i.e. with significantly Merged View Incoming data more records in the processed batch. In this experiment 8 cores Immutable Long term were allocated. data storage Historical View Batch layer Serving layer Figure 11: Overview of the future lambda architecture with a merged data view. Figure 10: The event timeline of a Spark job when perform- ing the simulation with all observations of a day. Ran with 8 cores allocated. MLG is currently in the process of migrating to a new cluster Regarding user evaluation of the web interface, the general which should provide the necessary facilities for large-scale ex- consensus was that the current interface can already provide periments. The goal of these experiments would be to move be- some basic insights, but requires more advanced tools and vi- yond simulation. Concretely, we would hook up the implemented sualizations to provide an added value to our potential users, pipeline to the actual stream of incoming data. compared to equivalent tools that are currently available. Implementing and experimenting with more advanced Ma- chine Learning approaches for forecasting will also be an impor- 3.3 Discussion tant task in providing more nuanced predictions. Additionally, The results from the performed experiments indicate that the integrating existing mobility indicators and advanced ITS models current architecture is promising for use in a real-life scenario. from related research will provide appropriate metrics to policy Taking the results from the previous experiments in [5] and the makers. The platform should be able to perform such processing well-known reliability of the used technologies into account, it in real time and use the forecasts to simulate the impact of a is expected that given appropriate hardware and optimization, policy. there should be no issue in dealing with the amounts of data we Next to this, a finalized web interface will provide stakeholders are working with. with the necessary tools to make informed decisions on how to Initial tests with the full data set where also performed on optimize traffic of goods in the Brussels Capital Region. Further the same hardware as the preliminary experiments. Results are extending the current interface with feedback from the users promising given the single node setting, but further experiments should allow us to provide this ideal interface. Concretely, further are needed to assess the architecture on a cluster setting. How- versions of the real-time tab will also include other visualizations ever, these preliminary results let us anticipate that no perfor- besides the map, such as relevant charts and differences with the mance issues should be expected when using the full processing typical traffic situation at this hour-of-the-day. The final version power of a big data cluster. of this tab should allow users to easily spot anomalies in the SparkUI was an important tool in debugging and analyzing current traffic situation compared to historical observations. performance of the implemented pipeline. The insights it pro- Prototypes for the Charts, Analytics and Predictions tabs vides into the execution of jobs enables detailed monitoring of have not been implemented yet. It is currently under review how well the implemented code for a big data project performs whether these should be separate tabs, or if they should be com- in the Hadoop + Spark environment. These insights are espe- bined into a single general Analysis tab. Conceptually, the Charts cially useful for assessing whether the implemented pipeline will tab would contain several types of charts that show useful infor- perform well, even without the use of big-data capable hardware. mation, such as the typical distribution of HGVs over communes For example, it is with the help of SparkUI that we can clearly for example. The analytics tab would contain tools that allow the see that the scheduling and serialization overheads that can be user to perform a specific analysis, such as constructing a model seen in figure 9b become insignificant when working with larger of traffic flow based on the available data. The predictions tab data batches, as shown by the results seen in figure 10. would put more emphasis on training and using the previously mentioned forecasting methods to predict future states of the 4 FUTURE WORK HGV traffic in Brussels. These models could then be used by Future work consists of finalizing the pipeline architecture and policy makers to simulate effects of certain decisions, such as connecting the different components of the MOBI-AID big data modifying existing roads for example. Determining where the platform together. One possible extension that is currently envi- functionality that is envisioned should live will be one of the sioned is to add a merged view that uses data from both the speed next steps in the design of the interface. and batch layers to, for example, show discrepancies between After the full prototype of the web interface has been im- the real-time traffic conditions and typical conditions. Figure 11 plemented, extensive user studies and formal retrieval of user visualizes this extension of our current implementation. requirements will be done to get a better insight as to what the Given this finalized implementation, we will perform exten- final web interface should provide. Iterating further and using sive experiments on the MLG big data cluster which is pow- agile software development methods should allow us to provide ered by Apache Hadoop, as opposed to a regular office machine. the end-users with the tools they need in a user friendly manner. Finally, packaging the platform for deployment will give the Systems Magazine 10, 2 (Summer 2018), 93–109. https://doi.org/10.1109/MITS. different stakeholders the envisioned platform that fits their re- 2018.2806634 [19] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of quirements and allow them to easily deploy it on their own hard- massive datasets. Cambridge university press. ware. This platform should also scale to be used for the whole [20] Nathan Marz and James Warren. 2015. Big Data: Principles and best practices of scalable real-time data systems. New York; Manning Publications Co. country and given appropriate data, it could also be used for [21] Apache Hadoop Project Members. 2019. Apache Hadoop. Apache Software other countries. Foundation. https://hadoop.apache.org/ [22] Django Team Members. 2019. Django. Django Software Foundation. https: //www.djangoproject.com/ ACKNOWLEDGMENTS [23] David Myr. 2003. Real time vehicle guidance and traffic forecasting system. US Patent 6,615,130. Arnau Dilen, Giovanni Buroni, Yann-aël Le Borgne and Gian- [24] Daiga Plase, Laila Niedrite, and Romans Taranovs. 2016. Accelerating data luca Bontempi acknowledge the support of Programme Opéra- queries on Hadoop framework by using compact data formats. In Advances tionnel FEDER 2014-2020 de la Région de Bruxelles Capitale in Information, Electronic and Electrical Engineering (AIEEE), 2016 IEEE 4th Workshop on. IEEE, 1–7. (ICITY MOBI-AID project). The authors are also grateful to Brux- [25] Mohammed A. Quddus, Chao Wang, and Stephen G. Ison. 2010. Road elles Mobilité for having provided the OBU data necessary for Traffic Congestion and Crash Severity: Econometric Analysis Using the work. Ordered Response Models. Journal of Transportation Engineering 136, 5 (2010), 424–435. https://doi.org/10.1061/(ASCE)TE.1943-5436. 0000044 arXiv:https://ascelibrary.org/doi/pdf/10.1061/%28ASCE%29TE.1943- REFERENCES 5436.0000044 [26] John Ratcliffe and Ela Krawczyk. 2011. Imagineering city futures: The use of [1] Stephen Anderson, Julian Allen, and Michael Browne. 2005. Urban logis- prospective through scenarios in urban planning. Futures 43, 7 (2011), 642 – tics––how can it meet policy makers’ sustainability objectives? Journal of 653. https://doi.org/10.1016/j.futures.2011.05.005 Alternative City Futures. Transport Geography 13, 1 (2005), 71 – 81. https://doi.org/10.1016/j.jtrangeo. [27] Dilpreet Singh and Chandan K Reddy. 2015. A survey on platforms for big 2004.11.002 Sustainability and the Interaction Between External Effects of data analytics. Journal of Big Data 2, 1 (2015), 8. Transport (Part Special Issue, pp. 23-99). [28] Hongyu Sun, Henry X. Liu, Heng Xiao, Rachel R. He, and Bin Ran. 2003. [2] J. S. Angarita-Zapata, A. D. Masegosa, and I. Triguero. 2019. A Taxonomy Use of Local Linear Regression Model for Short-Term Traffic Forecasting. of Traffic Forecasting Regression Problems From a Supervised Learning Per- Transportation Research Record 1836, 1 (2003), 143–150. https://doi.org/10. spective. IEEE Access 7 (2019), 68185–68205. https://doi.org/10.1109/ACCESS. 3141/1836-18 2019.2917228 [29] CP Van Hinsbergen, JW Van Lint, and FM Sanders. 2007. Short term traffic [3] Hugo Barbosa, Marc Barthelemy, Gourab Ghoshal, Charlotte R. James, Maxime prediction models. In PROCEEDINGS OF THE 14TH WORLD CONGRESS ON Lenormand, Thomas Louail, Ronaldo Menezes, José J. Ramasco, Filippo Simini, INTELLIGENT TRANSPORT SYSTEMS (ITS), HELD BEIJING, OCTOBER 2007. and Marcello Tomasini. 2018. Human mobility: Models and applications. [30] JWC Van Lint and CPIJ Van Hinsbergen. 2012. Short-term traffic and travel Physics Reports 734 (2018), 1 – 74. https://doi.org/10.1016/j.physrep.2018.01. time prediction models. Artificial Intelligence Applications to Critical Trans- 001 Human mobility: Models and applications. portation Issues 22, 1 (2012), 22–41. [4] Giovanni Buroni, Yann-Aël Le Borgne, Gianluca Bontempi, and Karl Determe. [31] Eleni I. Vlahogianni, Matthew G. Karlaftis, and John C. Golias. 2014. Short- 2018. Cluster Analysis of On-Board-Unit Truck Big Data from the Brussels term traffic forecasting: Where we are and where we’re going. Transportation Capital Region. 21st IEEE International Conference on Intelligent Transportation Research Part C: Emerging Technologies 43 (2014), 3 – 19. https://doi.org/10. Systems (2018). 1016/j.trc.2014.01.005 Special Issue on Short-term Traffic Flow Forecasting. [5] Giovanni Buroni, Yann-Aël Le Borgne, Gianluca Bontempi, and Karl Determe. [32] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Arm- 2018. On-Board-Unit Data: A Big Data Platform for Scalable storage and brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Processing. 1–5. https://doi.org/10.1109/CloudTech.2018.8713342 Michael J Franklin, et al. 2016. Apache spark: a unified engine for big data [6] Howard Butler, Martin Daly, Allan Doyle, Sean Gillies, Hagen Stefan, and Tim processing. Commun. ACM 59, 11 (2016), 56–65. Schaub. 2016. GeoJSON. Internet Engineering Task Force. https://tools.ietf. [33] Esteban Zimányi, Mahmoud Sakr, Arthur Lesuisse, and Mohamed Bakli. 2019. org/html/rfc7946 MobilityDB: A Mainstream Moving Object Database System. In Proceedings of [7] Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, the 16th International Symposium on Spatial and Temporal Databases (SSTD ’19). Yannis Mazzer, and Gianluca Bontempi. 2018. SCARFF: A scalable framework ACM, New York, NY, USA, 206–209. https://doi.org/10.1145/3340964.3340991 for streaming credit card fraud detection with spark. Information fusion 41 (2018), 182–194. [8] Apache Kafka Comitters. 2019. Apache Kafka. Apache Software Foundation. https://kafka.apache.org/ [9] Apache Spark Committers. 2019. Apache Spark. Apache Software Foundation. https://spark.apache.org/ [10] Konstantinos Demertzis, Lazaros Iliadis, and Vardis-Dimitris Anezakis. 2019. A Machine Hearing Framework for Real-Time Streaming Analytics Using Lambda Architecture. In Engineering Applications of Neural Networks, John Macintyre, Lazaros Iliadis, Ilias Maglogiannis, and Chrisina Jayne (Eds.). Springer International Publishing, Cham, 246–261. [11] GeoPandas developers. 2019. GeoPandas. GeoPandas developers. http: //geopandas.org/index.html# [12] PostgreSQL Developers. 2019. PostgreSQL. The PostgreSQL Global Develop- ment Group. https://www.postgresql.org [13] Anzhelika Dombalyan, Viktor Kocherga, Elena Semchugova, and Nikolai Negrov. 2017. Traffic Forecasting Model for a Road Section. Transportation Re- search Procedia 20 (2017), 159 – 165. https://doi.org/10.1016/j.trpro.2017.01.040 12th International Conference on Organization and Traffic Safety Manage- ment in large cities, SPbOTSIC-2016, 28-30 September 2016, St. Petersburg, Russia. [14] PostGIS Development Group. 2019. PostGIS. The Open Source Geospatial Foundation. https://postgis.net/ [15] S. Hadavi, S. Verlinde, W. Verbeke, C. Macharis, and T. Guns. 2019. Monitoring Urban-Freight Transport Based on GPS Trajectories of Heavy-Goods Vehicles. IEEE Transactions on Intelligent Transportation Systems 20, 10 (Oct 2019), 3747– 3758. https://doi.org/10.1109/TITS.2018.2880949 [16] M. Kiran, P. Murphy, I. Monga, J. Dugan, and S. S. Baveja. 2015. Lambda architecture for cost-effective batch and speed big data processing. In 2015 IEEE International Conference on Big Data (Big Data). 2785–2792. https: //doi.org/10.1109/BigData.2015.7364082 [17] Narayan Kumar. 2017. Twitter’s tweets analysis using Lambda Architec- ture. https://blog.knoldus.com/twitters-tweets-analysis-using-lambda- architecture/. [18] I. Lana, J. Del Ser, M. Velez, and E. I. Vlahogianni. 2018. Road Traffic Forecast- ing: Recent Advances and New Challenges. IEEE Intelligent Transportation