=Paper=
{{Paper
|id=Vol-1392/paper-02
|storemode=property
|title=Analyzing Open Data from the City of Montreal
|pdfUrl=https://ceur-ws.org/Vol-1392/paper-02.pdf
|volume=Vol-1392
|dblpUrl=https://dblp.org/rec/conf/icml/PineauB15
}}
==Analyzing Open Data from the City of Montreal==
Analyzing Open Data from the City of Montreal Joelle Pineau JPINEAU @ CS . MCGILL . CA McGill University, Montreal, CANADA Pierre-Luc Bacon PBACON @ CS . MCGILL . CA McGill University, Montreal, CANADA Abstract statistics, epidemiology, neuroscience, environmental sci- ence. They worked in teams of 3 for this project. There is a significant effort towards moving much of the data from the city of Montreal into an Open Data format. In this short paper, we 1.1. Context and project instructions report on a recent initiative to analyze this data According to instructions, participants were not restricted using machine learning techniques in the con- to using only the data from the city of Montreal website, text of a graduate course project. We review the though needed to use some of it. In particular, when appro- approach, summarize accomplishments, and pro- priate, students were encouraged to incorporate data from vide several recommendations for improving the other sources (e.g. equivalent data from other cities), or impact from such efforts. collect additional data (e.g. a new test set) to deepen their investigation. 1. Introduction The choice of prediction task and dataset to use was open. The goal was to pick a prediction question that is rele- Many cities worldwide have started to devote significant vant and important to the citizens or administrators of the efforts and resources to publicly releasing data relating to city. Particular attention was given to designing a predic- their operations and situations. There is an opportunity tion task that was well suited to the choice of dataset; and for machine learning practitioners to use this data to an- vice versa, picking the right data for tackling the chosen swer several questions of interest for citizens, administra- prediction question. The choice of algorithms and software tors, businesses, and researchers. systems was left open, including allowing use of existing machine learning toolboxes. The emphasis was on proper A course project was assigned in the context of a graduate scientific methodology for computational analysis of urban course of Applied Machine Learning at McGill University. data, rather than on the implementation of machine learn- The stated goal of the project was to use open data from the ing algorithms. city of Montreal’s website to identify an interesting predic- tion question that can be tackled using machine learning methods, and solve the problem using appropriate machine 1.2. Characteristics of the city of Montreal dataset learning algorithms and methodology. Previously, students The city of Montreal’s Open Data resource2 currently con- had received 2 months of instructions on machine learning tains 177 datasets, organized under different themes, as methods 1 . The course involved 65 students at various lev- listed in Table 1. Some datasets are re-listed under several els of their studies, from advanced undergraduate to Mas- themes, for example a dataset on the location and dimen- ters and PhD, 1 course in structure and 2 graduate teach- sions of community gardens appears under both the Envi- ing assistants. Course participants came from a diverse ronment and Housing and urban planning. set of backgrounds, including computer science, electrical, mechanical and biomedical engineering, mathematics and Several of these datasets include descriptive data, for ex- ample the list of municipal buildings in a particular bor- 1 The course syllabus: ough, with their respective addresses, or a document de- http://www.cs.mcgill.ca/˜jpineau/comp598/ scribing the yearly accomplishments in terms of universal accessibility of buildings (municipal and others). In many Proceedings of the 2 nd International Workshop on Mining Urban Data, Lille, France, 2015. Copyright c 2015 for this paper by its 2 The data can be accessed here: authors. Copying permitted for private and academic purposes. http://donnees.ville.montreal.qc.ca/ RR Analyzing Open Data from the City of Montreal Table 1. Themes and number of datasets from the City of Mon- Table 2. List of projects treal open data website. Real estate Montreal Real Estate Pricing Prediction of Real Estate Property Prices in Montreal Theme Number of datasets Location, Location, Location! Transportation Organization and administration 54 Estimating Traffic Levels in Montreal using Computer Vision Sports, leisure, culture and development 43 and Machine Learning Techniques Infrastructures 28 Predicting STM Bus Intervals Using Vehicles, Bicycles and Environment 27 Pedestrian Traffic Data Housing and urban planning 21 Financial resources 19 Predicting Method of Transportation Election and referendum 17 Biking Lane Usage Prediction Information management 16 BIXI Montreal Public safety and security 13 Modeling imbalance in Bike Share Networks Communication and public relations 12 Predicting Bike Counts for BIXI Stations in Montreal Material resources and services 9 Prediction Problems on Bike Accident and Usage Data in Mon- Buildings and land 8 treal Economic development 7 Prediction of Bicycle Accidents in Montreal Human resources 3 Prediction of Bike Accidents, a Comparison of New York and Legal affairs 2 Montreal Property assessment 1 Load Forecasting for Smart City with Possible Electrical Vehicle Penetration Reconstruction/analysis of city images respects, the data is not systematically or uniformly avail- Where am I? Predicting Montreal Neighbourhoods from Google able: the list of municipal buildings is available for only Street View Images one of the 19 boroughs of the city. The data is available in Patch-Wide Classification of Historical Aerial Images of the Is- several formats (PDF, TXT, XLS, ODT, CSV, DOC, XML, land of Montreal KML, KMZ, GML, SHP, DXF, JSON, 3DM, ZIP), though Reviving Old Montreal Object Recognition of Historical Datasets each dataset is provided in a (small) subset of these formats. Food safety Smart System for Restaurant Rating 2. Overview of projects and results Predicting Severe Food Safety Violations in Toronto, Ontario Library usage A total of 22 projects were completed, across a range of Predicting Montreal Library Book Loans topics. Project titles are listed in Table 2. The primary Book Recommender Systems for Montreal Libraries challenge for most teams was to identify a dataset that con- tained enough data to perform a substantial machine learn- ing analysis. This proved harder than expected, and thus ages (originally taken in black&white). several teams converged on using similar datasets from the The choice of machine learning method to solve the chosen set of 177 available. The most popular datasets pertained task was left open to the participants. In most cases, they to the usage of the Bixi bike-sharing service, and data on needed to tackle the full pipeline, from feature extraction, the location of bicycling accidents. In some cases, partic- to training the learner, to setting up a valid evaluation pro- ipants complemented the available data with similar data tocol. Many teams used common software libraries (e.g. from other cities, for example a project doing a compar- scikit-learn (Pedregosa et al., 2011)) to assist with some ative analysis of bicycle accidents in Montreal and New portion of the work. York. We now highlight a few of the projects. A second challenge for many teams was to identify an ap- propriate prediction question, which was both feasible (i.e. 2.1. Sample project: Prediction of real estate property sufficient available data) and interesting (i.e. with impact prices in Montreal for citizens or administrators of the city). In some cases, the prediction question arose naturally out of the data, for This project aimed to predict the price of houses in Mon- example predicting the loan rate of library books. On the treal. A total of 25,000 records were extracted from on- other hand, some participants were particularly creative line listings of real estate brokers. Complementary infras- with their choice of task. Good examples of this were tructure and geographical information for each listing was found in the analysis of city images, which included a acquired from additional open data sources from the city project aiming at the automatic colouring of historical im- of Montreal and Statistics Canada. Pre-processing was ap- Rk Analyzing Open Data from the City of Montreal plied, for example removing properties with an asking price open data strategy that leads to the release of urban data less than $10,000. Principal components analysis was used suitable for machine learning analysis. To meet this goal, to project the feature space to a lower-dimensional space. the teams designing the open data platforms and controlling Several machine learning algorithms were considered: lin- the information flow may need to acquire expertise about ear regression, support vector regression, k-nearest neigh- the goals and challenges of machine learning, in order to bours, and random forest regression. Algorithms were im- offer appropriate datasets. Computer scientists and statis- plemented using the scikit-learn package (Pedregosa et al., ticians have a role to play in informing these teams about 2011). The most promising results were obtained by an en- the benefits that machine learning can bring to our society, semble of k-nearest neighbour and random forest, achiev- and in providing convincing examples of cases where ma- ing a prediction error on par with previous literature on sim- chine learning has enhanced the quality of life of citizens, ilar datasets for other cities. In the case where the asking and productivity of organizations. price of a house is included, prediction error of the selling Use of urban data to enhance transportation models. price can be further reduced. Such a tool could be used by Several of the projects targeted the use of the city of Mon- citizens to get a more accurate estimate of a property’s mar- treal data to predict various aspects of urban transporta- ket value. It may also be used by municipalities to assess tion, from the usage of the bike sharing service, to the ex- property value for tax purposes. Finally, it may be used to pected timing of buses and automobiles. We observe that inform economic indices. those datasets yielded some of the most interesting analy- sis because they were more extensive than other datasets, in 2.2. Sample project: Biking lane usage prediction terms of number of data points. The projects completed to This project aimed to predict the number of cyclists pass- date targeted specific aspects of the transportation network ing through different streets in Montreal on a given day. in isolation of others, however there is significant poten- The analysis focused on ten different streets, and learned tial to combine such results into a coherent model of urban from daily counts obtained from sensors installed on the transportation, and eventually to use this model to evalu- streets, over a period of dates between 2009 and 2013, with ate different transportation strategies (e.g. adding bicycle a total of 1722 records. Several features were considered, lanes, changing bus routes, etc.) including the day of the week, weather, air quality index, Use of machine learning to enhance delivery of goods price of gas, special events (festivals, football and hockey and services. Several of the projects attempted to use the games), for a total of 47 features. This complementary available data to predict usage of various services, from the data was extracted from various online sources. Several above-mentioned Bixi bike sharing service, to the borrow- machine learning algorithms were considered: linear re- ing of library books. Such analysis can be useful to make gression, k-nearest neighbours, boosted decision trees, and more efficient use of available municipal resources. How- support vector regression. Prediction performance was as- ever these cases pose particular challenges because the ob- sessed using the mean absolute error, as well as the ratio served demand often depends on the availability of goods between the mean squared error for a given method and or services. So for example, one will not observe any de- the mean squared error of a baseline (dummy) predictor. mand for a particular book if that book was not available The boosted decision trees yielded the best performance. A at the library. Similarly, it is difficult to accurately predict complementary analysis of the feature impact using Lasso the real demand for the shared Bixis at a particular loca- regression suggested that the day of the week was one of tion once that station has no more bicycles available, and the most important features, possibly because the bicycle it is difficult to accurately predict demand at a new loca- usage varies greatly between weekdays and weekends. tion. Some of the technical recommendations below relate to this aspect. 3. Discussion Use of machine learning to enhance human perception In this section we discuss several opportunities and chal- of urban data. One of the most original projects targeted lenges that arose during the project. the automatic re-coloration of old grey-scale images of the city. While the results so far were not fully satisfying, there 3.1. Opportunities is potential, as the methods improve, to use this technol- ogy to allow people to gain a new perspective on historical From app design to data science. Many early open data material. Some of the other projects relating to analysis efforts from large cities have focused on releasing descrip- of images have similar potential to enhance human under- tive data, amenable to app design, often used in the context standing of the urban landscape, past or present. of hackaton events. While such activities continue to be exciting and worthwhile endeavours, we believe that many Use of urban data as complementary data. A fre- communities have much to gain from also considering an quent use of the city of Montreal open data in the projects Rj Analyzing Open Data from the City of Montreal listed above was as a supplement to other more extensive port. The spotlight talks were preferred over long talks due datasets. An example of this are the three projects pertain- to the number of projects. It proved difficult to provide ac- ing to Real estate, where a large amount of data was first re- curate detailed evaluations from such short presentations, trieved from real estate brokerage websites, and then com- and so most of the feedback was qualitative. The spotlights plemented (via geo-location features) with city of Mon- talks were held roughly 2 weeks before the final report was treal data on local municipal infrastructure. Additional sup- due, and thus focused more on the problem definition and plementary information was also considered, from sources methods, with few results. The final report was formatted such as Statistics Canada (for sociodemographic indica- as a research paper, max. 8 pages in length, and provided tors), the YellowPages (for location of grocery stores, med- a more accurate account of the project accomplishments. ical clinics, yoga studios, etc.) and public transit authorities In previous years, a poster session was held, instead of the (for bus and subway access locations). spotlights and written report. This format offers more op- portunity for interaction between participants. The option 3.2. Teaching challenges was not retained this year due to scheduling constraints. Methods beyond the curriculum. Several of the projects 3.3. Practical challenges required students to tackle machine learning methods that were beyond the basic course curriculum. The lectures for Language of dataset. Most of the data available for the the course were not designed with the final project in mind, city of Montreal is in French. Few of the resources have but rather to provided good coverage of basic algorithms been translated. Even in the case of quantitative data, the and methods for applied machine learning in general. For- lack of English-language description posed an important tunately, online resources are plentiful, and most students problem for some of the young researchers. were able to acquire the necessary material in areas perti- Design of the prediction task. When working with pre- nent to their topic. In many cases however, understanding viously used supervised machine learning benchmarks, the of that material seemed to be very superficial, and more op- target problem (i.e. output variable) of interest has already portunity for one-on-one learning would have improved the been identified. When working with new datasets, it can be quality of the analysis. challenging to identify the right target variable. For exam- Managing multiple projects. One of the familiar chal- ple in the case of the projects pertaining to transportation, lenges with open-topic course projects is the load it creates it may at first seem useful to predict the number or loca- in terms of supervision. The instructor and teaching assis- tion of bicycle accidents within the city. However these tants must have the time to provide individualized advice to events are relatively rare, and dealing with rare events is of- each project team. We observed the most intense needs dur- ten challenging from a statistical and algorithmic perspec- ing the project definition phase, with some teams requiring tive (especially in small datasets). An alternative may be up to 3-4 half-hour long meetings to properly define their to predict the number of close encounters between cyclists scope and aims. and vehicles, which are less rare, but such data is not typ- ically available. Alternately, predicting the flow of larger Scope of conclusions. We observed two challenges per- vehicles (cars, buses, trucks) may be more fruitful, since it taining to the interpretation of the results. First, as with any can be reliably estimated, and can be used within a larger data analysis, the urge can be strong to interpret the results predictive model on urban transportation. in ways that are not warranted by the methodology used. For example, reporting results indicating that old aerial im- Lack of parallel datasets. Comparative analyses (between ages of the city can be classified in terms of usage type years, between neighbourhoods) can yield rich informa- (farmland, forest, residential, water) with 80% accuracy, tion. This can only be tackled if data from parallel settings but failing to state that the accuracy is in fact much lower is available. The well-known Boston housing dataset (Har- for farmland and forests, but higher for water and residen- rison & Rubinfeld, 1978) was used as a comparison for tial areas. Second, while quantitative results are typically some of the projects pertaining to real estate. In general, it the preferred metric of performance, it is often the qual- is useful to keep this in mind when planning for additional itative results that speak most to the human imagination. releases of urban open data. There is a tendency to pick a few select qualitative results to “tell a story”; this can be a powerful way of showing 3.4. Machine learning challenges results, but it can easily be used to mis-characterize the expected performance of a system across the full range of Small data. The typical ICML attendee may be tempted events. to believe that all the interesting tasks for machine learn- ing deal with so-called big-data. Yet several important Presentation format. Two components were used for eval- problems occur in the small data setting. The challenges uation: an in-class 3-minute spotlight talk and a written re- in this case are different, possibly less computational and R9 Analyzing Open Data from the City of Montreal more statistical. There remains many opportunities to con- tation. nect to the big-data community through the use of auxiliary Choice of machine learning algorithm. There is a ten- datasets. dency among novice machine learning practitioners to Sparse, incomplete, noisy datasets. As with most real- spend significant efforts on testing several machine learn- world datasets, a major problem with urban data remains ing algorithms, with the belief that the choice of algo- the poor quality and uniformity of the data published. rithm is the dominant factor in achieving good predic- Often, the data is not curated by a person familiar with tion performance. Another tendency is to assume that machine learning methods. There exists many statistical the most advanced methods will necessarily outperform and machine learning methods to overcome problems of more naive methods. In practice, several algorithms may data quality, such as expectation maximization (Demptster perform equivalently, or simple methods may outperform et al., 1977), multiple imputation (Rubin, 1987). How- more complicated ones, for example when there is insuf- ever the effective application of these approaches to com- ficient data to properly train a complex hypothesis space, plex datasets generally requires a good understanding of or the hyper-parameters are not properly optimized. Sim- the methods (e.g. to construct a good model of imputation). ilar to the choice of features, algorithms can be compared using an appropriate cross-validation methodology. Feature coding for heterogenous data. Several projects observed that the choice of coding method for the data had Interpretability of results Methods such as linear regres- a significant impact on the performance of their machine sion, decision tree and naive bayes classifiers, are often pre- learning algorithm. For example in the bike lane usage pre- ferred to more complex methods such as neural networks or diction, an important feature was the day of the week. En- kernel methods, in the case where interpretability of the re- coding this as 7 binary features reduced the error rate by sults is necessary. In some applications, the knowledge of more than 5%, compared to using a single 7-valued cate- which features are most predictive of a particular outcome gorical feature. Another similar effect was seen in the real (e.g. finding which municipal amenities are best predic- estate price prediction task, where a logarithmic function tors of higher real estate prices) is of utmost interest. Sev- was used to re-scale prices. Typically, the choice of en- eral newer models have been proposed that combine rich coding can be validated using standard methods for feature hypothesis spaces with interpretability (Letham et al., To selection. appear). Feature selection for complex data. For some domains, From supervised learning to decision-making So far the set of features that can be considered is very large, thus we have been mostly concerned with supervised learning, an important problem is in selecting the right set of fea- where the goal of the learner is to predict a given quan- tures. Furthermore, it is often possible to enhance the fea- tity (the output) from observed variables (the input). In ture set by incorporating supplementary data sources. It can some cases, the goal may be to use the analysis to change be difficult to select the sufficient and necessary set of fea- a decision strategy. For example, by correctly predicting tures for a given prediction task. Cross-validation methods which restaurants may be found in violation of the health can be used to automatically compare different feature sets. and safety laws, it may be possible to more efficiently de- But this can be problematic in the case of small datasets ploy food safety agents. It is important to be aware of the where only limited data is available for validation of the fact that such a change in policy may result in a shift in feature set. An effective method in those cases is usually to the observed data. In the case where one wants to optimize use domain knowledge and expert advice to narrow down the decision strategy, it may be more appropriate to phrase the candidate features to a manageable set (or small num- the problem under the framework of reinforcement learn- ber of candidate sets). Another possible approach to tackle ing (Sutton & Barto, 1998). this problem is to use data from another city to predict the Off-policy learning A related case for concern arises when right set of features. Considering the case of Food Safety the data was acquired under a particular decision strategy, analysis, while Montreal has released only 750 records of and the results of the analysis are used to change that deci- food inspections (Montreal food data), San Francisco has sion strategy; in such case it can be difficult to accurately released 10,000 records (San Francisco food data). There- predict what will happen under the new decision policy. fore one could optimize the choice of features using the San This is known as the off-policy learning problem in the Francisco data and then apply the model and learn a simple machine learning literature (Sutton & Barto, 1998). Con- prediction strategy on the Montreal data. More sophisti- sider for example analyzing the usage data from Montreal’s cated methods for transfer learning are also worth investi- Bixi bike sharing service, then using the predictions de- gating. Finally, it is worth pointing out that the choice of rived from this analysis to determine which stations have features can be key not just for building a good predictor, lower demand, and then reducing bicycle availability at but also for building a good model for missing data impu- those stations. If those stations had low demand because R8 Analyzing Open Data from the City of Montreal they were already subject to reduced availability, then the Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., further shift to reduces availability likely would not result Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- in more satisfied customers overall. napeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: machine learning in python. Journal of Ma- 4. Conclusion chine Learning Research, 12, 2011. This paper presents a recent initiative to apply machine Rubin, D.B. Multiple Imputation for Nonresponse in Sur- learning techniques to analyze open data from the City veys. J. Wiley & Sons, 1987. of Montreal data, conducted in the context of a graduate San Francisco food data. https://data.sfgov.org/health-and- course project. Several of the challenges and opportunities social- services/restaurant-scores/stya-26eb? identified are commonly known in the machine learning community. Our goal in presenting this work is to illus- Sutton, Richard S. and Barto, Andrew G. Introduction to trate how such challenges arise in the context of analyzing Reinforcement Learning. MIT Press, Cambridge, MA, urban data, and in doing so, facilitate collaboration with in- USA, 1st edition, 1998. ISBN 0262193981. terested parties from other communities. While the City of Montreal was not involved in the elaboration of the course project, we have since communicated results of the projects with them. We have also received inquiries from officials of other cities. There is clearly significant interest in the outcomes of such initiatives. Acknowledgements Much of the credit for this paper goes to the students of the Fall 2014 edition of the course COMP-598: Applied Machine Learning, at McGill University. The first sam- ple project on the prediction of real estate property prices was realized by Nissan Pow, Emil Janulewicz, and Liu Liu. The second sample project on the biking lane usage predic- tion was realized by Robert Wenger, Haomin Zheng, and Stefan Dimitrov. Several of the issues highlighted in the discussion were extracted directly from those and other stu- dents’ project reports. Additional thanks go to Angus Leigh who acted as a teaching assistant for the course, jointly with Pierre-Luc Bacon. References Demptster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39, 1977. Harrison, D. and Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 1978. Letham, B., Rudin, C., McCormick, T., and Madigan, D. Buildling interpretable classifiers with rules using bayesian analysis. Annals of Applied Statistics, To ap- pear. Montreal food data. http://donnees.ville.montreal.qc.ca/ dataset/inspection-aliments-contrevenants. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Re