LADaS 2018 - Latin America Data Science Workshop Influence of Virtual Road Traffic Sensors of Oporto for Origin-Destination Matrix Estimation Luciano U. Pando, Ricardo Lüders, Keiko V. O. Fonseca, Marcelo O. Rosa 1 Federal University of Technology – Paraná (UTFPR) Postal code 80230-901 – Curitiba – PR – Brazil luciano.pando@ifpr.edu.br, {luders,keiko,mrosa}@utfpr.edu.br Abstract. The knowledge of urban mobility patterns is important to maintain good public services as well as to improve city planning. These mobility pat- terns can be characterized by using expensive fieldwork or analyzing the huge amount of data available from services and environmental monitoring in smart cities. The origin-destination matrix estimation (ODME) aims to estimate the traffic of vehicles between origin and destination areas in the city of OPorto in Portugal from traffic observed from sensors installed at roads. This estimation is stated as an optimization problem and solved by linear programming. The results have shown that the number and location of sensors are important issues to be considered. 1. Introduction The origin-destination matrix (ODM) represents flows of vehicles or persons between different origin and destination points. This information is useful for city planning and is usually obtained through expensive and laborious field work which quickly becomes outdated [Chen et al. 2005]. An example of origin destination field work was accom- plished in 2016 in Curitiba with an approximate cost of 6 million Brazilian currency [Curitiba 2016]. Alternatives to the ODM estimation (ODME) have been discussed for several decades, and many solutions have been proposed. A survey about the subject can be found in [Bera and Rao 2011]. The ODME is stated as an optimization problem where traffic estimates should match flow measures of vehicles from sensors. This problem is modeled using mixed integer linear problem (MILP) and solved with the GUSEK 0.2.24 solver. Moreover, sensors’ position at roads is an important issue. This is referred to as the network sensor location problem (NSLP) in the literature. In this paper, we are interested on testing different strategies for sensor distribution at roads of Oporto and its effects on ODME. Section 2 states the ODME and NSLP problems as well as performance metrics for ODME. Section 3 presents the MILP model and Section 4 shows the results. Conclusions are show in Section 5. 2. Background In this section, the problem of origin-destination matrix estimation is presented as well as the underlying network sensor location problem. The performance metrics used for evaluation of ODME are presented as well. 49 LADaS 2018 - Latin America Data Science Workshop 2.1. Origin-destination matrix estimation (ODME) The origin-destination matrix (ODM) contains information about flow of vehicles be- tween two geographical areas in a city. Let assume that there are n origin-destination (OD) pairs with a flow qw of vehicles in a particular pair w 2 W (the set of all OD pairs). These OD pairs are usually located at the borders of a region [Teknomo and Fernandez 2014]. Moreover, each flow of vehicles for an OD pair dis- tributes over city roads using different routes. Each route is composed by links (or edges) identified as city roads. These links are eventually monitored by traffic sensors which gives the counts of vehicles in each link. As a link can be shared by different routes, counts in each link can be due to more than one OD pair. Hence, a link count is not a direct flow measure for a particular OD pair (except if it is not shared). The origin-destination matrix estimation (ODME) can be considered as a dual problem of the static traffic assignment (STA) [Bierlaire 2002]. By giving the flow of vehicles in an OD pair, STA gives the flow of vehicles in each link of a route used by this pair. ODME on the contrary, provides the flow of vehicles for OD pairs from the flow observed in links. The flow of vehicles va in a link a is given by (1). It is the sum of n flows qw multiplied by a given probability pwa of using link a by an OD pair w. Additionally, (2) is a non-negativity constraint of qw . X va = qw .pwa (1) w2W qw 0 (2) By assuming that a link z is monitored by sensors, an estimate v z of the flow of vehicles in z should approximate the measured flow vz according to (3). Similarly, q w is approximated to qw according to (4). 8z (v z ⇡ vz ) (3) 8w (q w ⇡ qw ) (4) According to [Bierlaire 2002], ODME is generally complex because the number of unknown variables (flow of vehicles in OD pairs) is greater than the number of equa- tions (flow of vehicles in links). Other references point out different techniques to solve ODME as [Maher 1983] for Bayesian inference, [Cascetta 1984] and [Bell 1991] for gen- eral least squares, [Spiess 1987] for maximum likelihood and [Sherali and Hobeika 1994] for linear programming applications. 2.2. Network sensor location problem (NSLP) The network sensor location problem (NSLP) concerns how to define the location and number of sensors. In [Ye and Wen 2017], nine rules have been considered for sensor allocation as follows: 50 LADaS 2018 - Latin America Data Science Workshop • R1: choose links such that all OD pairs are covered by sensors; • R2: choose links with high rate qw /va ; • R3: choose links having all routes covered by sensors (route covering rule); • R4: choose links with high flow va (interception rule); • R5: choose links with high probability pwa (maximal OD demand fraction rule); • R6: choose links with big number of routes (maximal net route flow captured); • R7: choose links with big number of OD pairs (maximal network OD flow); • R8: avoid redundant links (independence rule); • R9: minimize the number of sensors (minimal observation rule); The article [Ye and Wen 2017] considers rules R1, R8 and R9 for all experiments. Tests are made with rule R2 or R4 and R7 simultaneously. The results show that using R4 and R7 simultaneously is better than R2 only. The considered network has 93 links and 210 OD pairs. In our work, R1 is considered by default. The other rules R2, R4, R7 are tested separately. Differently from [Ye and Wen 2017], R4 and R7 are applied separately, and there is also a combination between R4 and R7, indicated by R4R7 and done due the fact that preliminary results showed better results in these two rules. Another addition of this article is the focus on how the number of sensors interferes in different ways depending on sensor location choice rules. 2.3. Performance metrics The following metrics are used to evaluate ODME: mean absolute error (MAE), root mean square error (RMSE), total demand deviation (TDD), and determination coefficient (R2 ) according to (5), (6), (7), and (8), respectively. P |q qw | M AE = w2W w (5) n r 1P w2W (q w qw ) 2 n P RM SE = (6) w2W qw P P | w2W q w w2W qw | T DD = P (7) w2W qw P P P n( w2W qw q w ) ( w2W qw )( w2W q w ) 2 R = (p P 2 P 2 P 2 P 2 )2 (8) (n w2W qw ( w2W qw ) )(n w2W (q w ) ( w2W q w ) ) Both MAE and RMSE are error metrics used to evaluate differences between ob- served and real flows of OD pairs. On other hand, TDD is an error related to the sum of observed and real OD flows. All these metrics should be minimized for good results. The determination coefficient R2 evaluates the proportion of the variance in the dependent variable (estimated OD flows q w ) that is predictable from the independent vari- able (real OD flows qw ) variance, resulting in a output between 0 and 1. The dependence between observed and estimated values is greater as R2 ⇡ 1. in this case, R2 should be maximized. 51 LADaS 2018 - Latin America Data Science Workshop 3. MILP for ODME The ODM is composed by n decision variables representing the traffic flows of n OD pairs. The MILP model for ODME presented here is similar to the one of [Sherali and Hobeika 1994] but using GUSEK 0.2.24 to solve it [Bettoni 2018]. Variables are shown in Table 1. There are (np + 3 · ns) variables. Decision variables are qw , estimated flows in links are v z , and Fz+ and Fz are variables used to quantify matching errors. Table 1. Mixed integer linear programming variables Variable Type Symbol Amount Estimated flow per OD pair w R 0 qw np OD pairs Estimated flow in link z R 0 vz ns sensors Superior deviation in z R 0 Fz+ ns sensors Inferior deviation in z R 0 Fz ns sensors Total (np + 3 · ns) The cost function is represented by (9). X min (Fz+ + Fz ) (9) qw z2Z Equation (10) requires that a probability pwz of flow qw should use link z. Each observable link has a restriction of this kind. X vz = pwz · qw (10) w2W Equation (11) sets a matching between the observed and estimated traffic in link z. Fz is positive when the estimated values are larger than the observed ones and Fz+ is on the other way around. This restriction acts with (9) and (10) in order to find v z ⇡ vz by adjusting qw decision variables. vz = (v z + Fz+ Fz ) (11) Restriction (12) is based on the non-negativity constraint specified in (2). qw 0 (12) Restrictions (13) and (14) indicate that Fz+ and Fz+ are bound to non-negative values. Fz+ 0 (13) Fz 0 (14) For every link sensor, there is a restriction as in (11), (13) and (14), and for every qw , there is a restriction as in (12). 52 LADaS 2018 - Latin America Data Science Workshop 4. Results Section 4.1 describes the data used to the ODME problem formulation. Section 4.2 shows results for ODME using MILP and different numbers of sensors. 4.1. Data description In this work, data comes from the taxi trip database found in [UCI 2015] and the geo- graphic database found in [OpenStreetMap contributors 2018]. After density clustering city areas and identifying routes (map matching) as shown in [Pando and Luders 2017], 13 areas of both origins and destinations have been identified in Oporto yielding to 169 OD pairs. Moreover, the true origin-destination matrix and the probability of route choice have been determined by GPS data of taxi trips. The [UCI 2015] dataset has eight attributes and 1, 710, 670 instances. Attribute “polyline” is a sequence of lat-long locations which identifies a taxi trajectory in the network. The attribute “timestamp” is used to group taxi trips in hourly time intervals and to join all trips between 8-9 am allowing to get the true ODM. Attributes “Datatype” and “missing data” are used to select trips for workdays only and trips without missing data, respectively. The geographic database [OpenStreetMap contributors 2018] defines the city net- work as a directed graph which is composed by GPS-referenced nodes and edges. 4.2. Influence of the number and position of sensors The experiments have been made from 200 to 2000 virtual sensors with locations defined by rules R2, R4, R7 or R4R7. The number of sensors was increased by 100 between 200 and 500 sensors, and by 250 for more than 500 sensors. Figure 1. MAE for different rules and number of sensors. According to Fig. 1 (MAE), rule R7 yields better results for any number of sensors except 1500 sensors. Particularly, for less than 1500 sensors. The gap between R7 and R4R7 or R4 is reduced for a large number of sensors. R4R7 has the worst result for 200 sensors, but it is better than R4 until 1500 sensors, where R4 is better. R2 has the worst result for every number of sensors except for 200 sensors. The results for R2 are close to 53 LADaS 2018 - Latin America Data Science Workshop Figure 2. RMSE using different sets of rules and numbers of sensors. R4 and R4R7 for small number of sensors but R4 and R4R7 are better for a larger number of sensors than R2. In Fig. 2 (RMSE), R7 has better results for 200 sensors and between 400 and 1250 sensors. Results for R7 are worse for 300 sensors than for 200 sensors which is unexpected since the result was worse with more information added. R4R7 has better results for 300 sensors, and R4 has better results for 1500 sensors and above. R2 has the worst result for any number of sensors, showing a slightly decline in RMSE from above 400 to close to 330 for 200 and 2000 sensors, respectively. Figure 3. R2 using different sets of rules and numbers of sensors. Fig. 3 (R2 ) has a similar information of Fig. 2. R7 also has better results for 200 sensors and between 400 and 1250 sensors in this metric. Results for R7 are also worse for 300 sensors than for 200 sensors. R4R7 has better results for 300 sensors, and R4 has better results for 1500 sensors and above. R2 has the worst result except for 200 or 300 sensors. Fig. 4 (TDD) shows different behavior when compared to the three previous anal- ysis. R4R7 has better results most of the time. R4, R7, and R4R7 are close to zero 54 LADaS 2018 - Latin America Data Science Workshop Figure 4. TDD using different sets of rules and numbers of sensors. between 400 and 800 sensors, while the minimum TDD value for R2 occurs with 1000 sensors, which is exactly the number of sensors used in Section 4.2. R7 has the better results in RMSE and R2 metrics between 400 and 1250 sensors. It has the better results for any number of sensors except with 1500 sensors for MAE as well as the better results in TDD for 1500 sensors and above. R4 is better above 1250 sensors for RMSE and R2 . R4R7 is only better for TDD between 300 and 750 sensors, and with 300 sensors for RMSE. 5. Conclusion This work presented the effects of using different rules of distribution of sensors at roads of Oporto in Portugal in the quality of ODME. The quality of ODME was evaluated by four different metrics: MAE, RMSE, TDD and R2 . The better set of rules may vary depending on the number of sensors and metrics used. For MAE, RMSE and R2 metrics, experiments with more sensors produced better results using any set of rules. TDD is better mainly between 400 and 800 sensors, but it presents poor results after 1000 sensors. Rule R7 is better for two metrics and any number of sensors except for 300 sensors. For 1500 sensors and above, R4 has better results for metrics RMSE and R2 but R7 has better results for the other two metrics, i.e., a tie occurs between R4 and R7. For 300 sensors, the best performing set of rules was R4R7 in three of four metrics. For future work, the introduction of probe vehicles could add information to im- prove ODM and assignment matrix estimations. The influence of number and location of sensors can also be analyzed for other ODME techniques as for instance general least squares. References Bell, M. G. H. (1991). The estimation of origin-destination matrices by constrained gener- alised least squares. Transportation Research Part B: Methodological, 25B(1):13–22. Bera, S. and Rao, K. V. K. (2011). Estimation of origin-destination matrix from traffic counts : the state of the art. European Transport, 49:3–23. 55 LADaS 2018 - Latin America Data Science Workshop Bettoni, L. (2018). Gusek description. http://gusek.sourceforge.net/ gusek.html. Access in 2018-06-08. Bierlaire, M. (2002). The total demand scale : a new measure of quality for static and dynamic origin – destination trip tables. Transportation Research Part B, 36:837–850. Cascetta, E. (1984). Estimation of trip matrices from traffic counts and survey data. A generalized least squares estimator. Transportation Research Part B, I(415):289–299. Chen, A., Chootinan, P., and Recker, W. W. (2005). Examining the quality of synthetic Origin Destination Trip table estimated by path flow estimator. Journal of Transporta- tion Engineering, 131(July):506–513. Curitiba (2016). Pesquisa origem destino curitiba. Access in 2018, May. Maher, M. J. (1983). Inferences on trip matrices from observations on link volumes: A Bayesian statistical approach. Transportation Research Part B, 17(6). OpenStreetMap contributors (2018). Planet dump retrieved from https://planet.osm.org. Access in 2018, January. Pando, L. U. and Luders, R. (2017). Estimation of origin-destination matrix from traffic counts in the city of porto with pso and taxi trip data. volume 1, pages 901–911, Uberlandia, Brasil. Encontro Nacional de Inteligência Artificial. Sherali, H. D. and Hobeika, A. G. (1994). A linear programming approach for synthesiz- ing origin–destination trip tables from link traffic volumes. Transportation Research Part B, 28(3):213–233. Spiess, H. (1987). A maximum likelihood model for estimating origin-destination matri- ces. Transportation Research Part B, 21(5):395–412. Teknomo, K. and Fernandez, P. (2014). A theoretical foundation for the relationship between generalized origin-destination matrix and flow matrix based on ordinal graph trajectories. Journal of Advanced Transportation, 48(6):608–626. UCI (2015). Taxi trips dataset of porto, portugal. Access in 2016, November. Ye, P. and Wen, D. (2017). Optimal traffic sensor location for origin – destination esti- mation using a compressed sensing framework. Journal of Advanced Transportation, 18(7):1857–1866. 56