-

A Preliminary Assessment of the Tra c Measures in Madrid City

Instituto de Estudios Fiscales

Avda.Cardenal Herrera Oria

Madrid

Spain mpilar.rey@ief.hacienda.gob.es

52 64

A potential source for producing reliable statistical information is the huge amount of data les created by the activity of electronic sensing devices. In particular, datasets collecting data on tra c sensors can be downloaded from the open data portal o ered by the local government of Madrid City. The tra c sensors are a rich source of information, providing data not only on the vehicle count but also on, e.g., its speed. However, processing the data at the granularity level required involves complex workloads that exceed the capabilities of traditional data analytical processing technologies and require big data speci c tools. The rst part of the paper is devoted to the steps in producing short-term indicators of the evolution of the tra c ow variable in Madrid using the Spark big data platform. Taking advantage of the information on the sensors' geographical location, the indicators are then analyzed to assess the impact of some recent local government measures addressed to reduce pollution and tra c congestion.

Big Data Short-term Indicators Spark Platform Tra c measures

The local government of Madrid City o ers an open data portal designed for the users to explore and download their publicly accessible data. The datasets available include data from tra c sensors located at strategic points in the roads and streets of Madrid City. These tra c sensors are a rich source of information, providing data not only on the vehicle count, but also, e.g., on its speed and geographical location. There have been a number of studies on tra c sensors[ 6,5 ] reporting that they provide, in general, accurate tra c measures.

The volume of the downloaded information cannot be processed using conventional statistical software and requires procedures speci cally developed for this purpose. Apache Spark [ 13 ], an open source analytics engine for Big Data processing has been used on a single node for the rst steps of collecting and pre-processing data. The volume of the downloaded information cannot be processed using conventional statistical software and requires procedures speci cally developed for this purpose. Apache Spark [ 13 ], an open source analytics engine for Big Data processing has been used on a single node for the rst steps of collecting and pre-processing data. The rst aim of the paper is to study the tra c in the city from 2016, constructing daily indicators of its evolution. Monitoring the real evolution is a task more di cult than it appears at rst glance. In order to obtain good enough indicators and before the nal calculations to compute the indexes, it requires various steps to detect and correct logical inconsistencies in the data, impute missing information, and summarize at di erent granularity levels.

Once the indicators are available, the tra c evolution can be analyzed to learn signi cant patterns of behavior. The information on the sensors geographical location may help at this stage to discover similarities and di erences between zones in Madrid City. On the other hand, combining all these data will allow to evaluate the results of the recent tra c measures taken by the local government addressed to improve the levels of air pollution within the city and surrounding areas.

The remainder of this paper is organized as follows: the next section presents a summary of the steps taken to construct the indicators; section 3 analyses the high-frequency series obtained; section 4 performs the assessment of the tra c measures; and, nally, a number of remarks and conclusions are shown in section 5. 2

Construction of the daily indicators

The raw data to be used as source for computing the time series consist on the datasets made available in the portal after the end of each month, including the gures of the previous month, for each one of the more than 4000 sensors, of a number of variables measured in 15-minutes intervals. This makes around 150 million of data points for each year and each variable. Besides the previously cited Apache Spark, the Python software [ 10 ]has been used for all calculations and analysis once the indicators have been obtained.

Although the datasets provide information on more variables, this paper only studies a single variable, the intensity measured by the number of vehicles by time unit, as an example of the analysis that could be performed. A daily intensity indicator will be computed for the whole city, and also split into the urban area and the M30 ring road. For this purpose, the calculations are performed in some stages. Given that the names and/or categories of the intensity and sensor type variables have changed in the datasets through the times, the rst step of preprocessing is done, treating the data to make them homogeneous. After this, as the daily level of time granularity has been chosen, the total number of vehicles per sensor and day is calculated.

As next stage, data editing must be performed to ensure completeness and validity because the transmission of information from some sensor nodes may sometimes fail. To detect these failures, data with more than a certain proportion of missing information in the readings are not validated. These data together with missing data are imputed by a procedure described later.

Pk xkt It = It 1: P k xkt 1

Pk xkt xbit = xit 1: P k xkt 1

Since the intensity of the tra c in a road is de ned by the number of vehicles passing the road in a period of time, the natural way to measure the intensity in an area would be by the average number of vehicles in all the roads and streets located in the area. As there are not sensors in all the roads and streets, it could be approximated by the average number of vehicles in all the sensors located in the area during the period. But the transmission of information from some nodes may sometimes fail due to environmental interference, physical damage or lack of power. Therefore, changes in the averages could be motivated by changes in the sensors location and/or activity and not necessarily by changes in the tra c intensity in the area.

Being ow data, a simple aggregative index [ 9 ] could be used to compute the evolution of the intensity. Instead, to solve the previous problem in measuring the evolution, the indicators are computed as change estimators or chain-linked index (1) (2) where the sum is extended to the k sensors having data validated for both periods t and t 1. The indexes I0 for the rst period, the rst of January 2016, are calculated as the average by sensors in the area of the total number of vehicles this day. Once the indexes of a day are computed and before calculating the indexes of the following day, the sensors having missing data on this day are imputed as where the sum is also extended to all k sensors having data validated for both periods t and t 1: Then the imputed values are validated and the indexes are re-calculated, obtaining the same previous values. In this way, the imputed data are available for the calculation of the following day indexes. It can be shown that using this simple method of imputation, the indexes are always computed using all the information available, and they are not deteriorated by a repeated lack of information on some sensors.

After the imputations are computed in this way, there is a remaining problem: there are days for which there are no data for any sensor and indexes cannot be calculated. The daily changes series are then considered to complete the missing days using time series predictions. The rst attempt for forecasting was made using LSTM (Long Short-Term Memory) Deep Neural Networks [ 8 ], a class of arti cial neural networks that allows exhibiting temporal dynamic behavior. These networks have proven to be able to outperform state-of-the-art univariate time series forecasting methods. However, in our case, having less than 4 years of data, forecasts from ARIMA models, following the Box-Jenkins methodology [ 1 ], have obtained better results in terms of minimum mean square error of forecast.

As a nal stage, once microdata have been imputed and missing daily changes have been predicted, intensity indicators are computed for the whole city, the M30 ring road and the urban area. 3

High-frequency series analysis

Even though the temporal granularity chosen is of 1-day intervals, another aspect to consider is the distribution of the vehicles ows within the day. The tra c intensity for the combination day-of-the-week and hour may show interesting patterns. For this purpose, it has been calculated for each sensor the average of the tra c intensity per day of the week and hour, and later these averages have been divided by the maximum found tra c intensity at this sensor in an hour. The method provides an approximate idea of the average level of occupancy during the week of the road or street on which the sensor is located.

Fig. 4 shows an example of the pro le for a particular sensor (tick marks indicate noon for each day) where it can be seen the decay on weekend and a peak around 9 a. m. each weekday. These pro les form 168-dimensional points. Clusters of these points using the K -means algorithm and the Euclidean distance [ 4 ] have been built to explore and summarize the results. Fig. 5 shows the centers of the clusters for k = 10 clusters.

Although the elbow method [ 11 ] to determine the optimal number of clusters is not totally conclusive, this number has not a big impact on the results: similar graphs and conclusions could be obtained with another number of clusters. As general patterns for all roads or streets, besides a decay on weekends, it is found that the tra c intensity decreases during night hours (from 1 to 5 a. m.), especially on weekdays, and that there are generally decaying around noon and 3 p. m. Besides these general features, there are big di erences between the levels of occupancy, extending from light in clusters 2 and 5 to heavy in clusters 4 and 6. It can also be seen that sensors in clusters 3 and 10 have maximum tra c on weekdays at morning commuting hours, while sensors in 4, 8 and 9 have the top at afternoon hours. Therefore, there are two aspects that may characterize the sensors weekly behavior and may be of interest to explore and describe: the global level of occupancy, and the time of the day at which the intensity on weekdays is the highest.

Instead of visually studying the graphs to assign a level of occupancy for each sensor, they are automatically classi ed into three levels, depending on the computed area under the normalized by the maximum weekly pro le curve. Fig. 6 shows the average level of occupancy obtained from the sensors in Madrid City boroughs. It can be seen that most of the areas with Light tra c intensity are outside the central part of the city.

Similarly, the sensors can be automatically classi ed into three groups depending on the time of the day at which the intensity on weekdays is the highest (a sensor belongs to Morning commuting/Afternoon group when its average for weekdays exceed by more than 20% the Afternoon/Morning commuting average, respectively, being Morning commuting between 7 and 9 a.m. and Afternoon between 2 and 9 p.m.; otherwise belongs to All day group). Fig. 7 provides an idea of the typical weekday pro le of usage of the roads and streets. 4

Assessment of the impact of the tra c measures

The local government of Madrid City has taken in the last years, some measures addressed to reduce pollution. Although the current understanding of the air pollution impacts from tra c congestion on roads is limited [ 14 ] , it seems that vehicle emissions and tra c-related pollution are typically one of the largest contributors to air pollution in cities. This paper studies just one variable, the tra c intensity, and, consequently, the evaluation refers exclusively to the e ects on tra c reduction, and not directly to the e ects on air pollution. The most important tra c measures taken may be summarized in Table 1.

As the measures have been gradually taken, a rst assessment of the impact on the whole city can be done from the annual average rates in Table 2. The global indicator re ects the behavior of the whole Madrid City area and the other indicators (M30 and Urban) extend also over all area. For this reason, it is not likely to nd any e ect of the tra c measures because they refer to only some zones and there may also exist opposed e ects in other parts.

To check the hypothesis of a possible e ect on any of the indicators, ARIMA models with intervention analysis [ 2,3 ] have been used. Thus, a basic multiplicative ARIMA model with weekly seasonality has been tted to each series using the Scikit-learn software library [ 7 ] . There have also been included as regressors some additive outliers and a speci c variable to measure the e ect of Easter, a relevant moving holiday for daily data. Then, di erent intervention variables, trying to gather the e ects of the tra c measures (with di erent structures and di erent dates) have been tested. But the value of the corresponding parameter estimates has never been signi cantly di erent from zero.

In any case, the assessment must be better referred to zones that can be a ected by the measures. The information about the geographical location of the sensors, provided also in the open data portal of Madrid City, can be used. Two zones probably a ected have been considered: Madrid Central, the area with borders de ned by the local government and which some of the tra c measures refer to, and another area de ned as a crown of 300 meters surrounding Madrid Central, which will be named Crown. The delimitation of the zones appears in Fig. 8.

What can be done now is to compute new indicators, following the rules in section 2, for the two zones, including in each one the data of the sensors within the corresponding area. Thus, intensity indicators for Madrid Central and Crown zones appear in Fig. 9 and Fig. 10.

For a rst assessment, Table 3 shows the annual average rates where now possible e ects appear. There is a gradual reduction in Madrid Central, probably re ecting the cumulative e ect of the di erent measures. The Crown area, on its side, shows a clear increase in 2018, result of a plausible substitution or border e ect. Nevertheless, this may revert as a result of the last tra c measures in 2019.

Fig. 11 and Fig. 12 present the corresponding monthly average and monthly average annual rates, respectively, of the tra c intensity at Madrid Central and Crown zones.

With the aim to provide more detailed explanations, both series have been treated in a similar way to the previous for nding possible e ects of the tra c measures. That is, basic multiplicative ARIMA models [ 2,3 ] with weekly seasonality, Easter variable, and additive outliers have been tted, and later di erent intervention variables have been tested using the Scikit-learn [ 7 ] software library. Although at rst glance, from Fig. 12, one of the most important measures, the starting of Madrid Central in March 2019, seems to be having some e ects (both indicators show annual decreases from April 2019), no signi cant e ects have been found. Nor have any other signi cant interventions related to the tra c measures been found, probably because of their gradual implementation that may be described by the ARIMA model.

Another interesting analysis to perform is to see whether there has been any e ect on the weekly patterns of behavior for the roads and streets located in both areas. To simplify the study, the period since the complete implementation of all measures, (starting in March 16, 2019) is compared to an equivalent period in 2017 (March 16, 2017, to August 30, 2017), when hardly any tra c measure had begun to work.

As a summary result, Fig. 13 classi es the sensors on whether they have experienced an improvement or a worsening on the level of occupancy, computed as described in section 3, in these 2-years.

In general terms, after March 15, 2019 the level of occupancy has improved in the area of Madrid Central, with some exceptions. The border e ect is concentrated in speci c zones of the Crown area, while there are also in this area other parts that have experienced improvements in the level of tra c intensity.

Finally, in Fig. 14 are shown exclusively the sensors changing their pro le of usage, calculated and de ned as in section 3, between the same periods in 2017 and 2019. It must be noted that the sensors within Madrid Central have not changed to \All day" pro le of usage, supporting that now the zone is not occupied through all hours. On the contrary, some sensors in the Crown area have worsened its level of occupancy and, at the same time, have now an \All day" pro le of usage. 5

Final remarks

This paper uses data about tra c sensors from the Madrid City open data portal to evaluate the impact of the tra c measures taken in the last years in Madrid. Being the rst aim to study the behavior of the tra c intensity over time, it must be stressed the di culties and complexities in measuring its evolution, requiring speci c procedures.

The results obtained are very preliminary, rst because only one of the variables available has been considered, and second because more periods would be needed to accurately measure the possible impacts.

Although the main objective of the tra c measures taken is to reduce air pollution, what has been assessed here is the impact on the tra c volume, because it is considered one of the largest contributors to air pollution in cities. What has been found is that the actions implemented from 2017 seem to have reduced tra c congestion in Madrid Central and other areas especially from 2019. At the same time, in 2018 a rst collateral border e ect of increasing tra c intensity in the surrounding zones may exist, although this e ect may revert in the next months as a consequence of the last actions undertaken.

Taking advantage of the spatial aspects of the information available, the methods proposed can be used to assess the e ects of other tra c actions at the same or at more detailed geographical level, when data from more periods are available. The scope of the analysis can be widened when data from more periods are available and also by extending the procedures to other variables existing at the open data portal.

1. Box , G.E. , Jenkins , G.M. , Reinsel , G. : Time series analysis: forecasting and control holden-day san francisco . BoxTime Series Analysis: Forecasting and Control Holden Day1970 ( 1970 )

2. Box , G.E. , Tiao , G.C. : Intervention analysis with applications to economic and environmental problems . Journal of the American Statistical association 70 ( 349 ), 70 { 79 ( 1975 )

3. Chen , C. , Liu , L.M.: Joint estimation of model parameters and outlier e ects in time series . Journal of the American Statistical Association 88 ( 421 ), 284 { 297 ( 1993 )

4. MacQueen , J., et al.: Some methods for classi cation and analysis of multivariate observations . In: Proceedings of the fth Berkeley symposium on mathematical statistics and probability . vol. 1 , pp. 281 { 297 . Oakland, CA, USA ( 1967 )

5. Medina , J.C. , Benekohal , R.F. , Ramezani , H.: Field evaluation of smart sensor vehicle detectors at intersections{volume 1: Normal weather conditions . Tech. rep. ( 2012 )

6. Mimbela , L.E.Y. , Klein , L.A. : Summary of vehicle detection and surveillance technologies used in intelligent transportation systems ( 2000 )

7. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , Blondel , M. , Prettenhofer , P. , Weiss , R. , Dubourg , V. , et al.: Scikit-learn: Machine learning in python . Journal of machine learning research 12(Oct) , 2825 { 2830 ( 2011 )

8. Schmidhuber , J. , Hochreiter , S.: Long short-term memory . Neural Comput 9 ( 8 ), 1735 { 1780 ( 1997 )

9. Stone , R. , Prais , S. : Systems of aggregative index numbers and their compatibility . The Economic Journal 62 ( 247 ), 565 { 583 ( 1952 )

10. Team , P.C. : Python: A dynamic, open source programming language, python software foundation, 2017

11. Thorndike , R.L. : Who belongs in the family? Psychometrika 18 ( 4 ), 267 { 276 ( 1953 )

12. Welch , P. : The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modi ed periodograms . IEEE Transactions on audio and electroacoustics 15(2) , 70 { 73 ( 1967 )

13. Zaharia , M. , Xin , R.S. , Wendell , P. , Das , T. , Armbrust , M. , Dave , A. , Meng , X. , Rosen , J. , Venkataraman , S. , Franklin , M.J. , et al.: Apache spark: a uni ed engine for big data processing . Communications of the ACM 59 ( 11 ), 56 { 65 ( 2016 )

14. Zhang , K. , Batterman , S. : Air pollution and health risks due to vehicle tra c . Science of the total Environment 450 , 307 { 316 ( 2013 )