=Paper= {{Paper |id=Vol-1392/paper-02 |storemode=property |title=Analyzing Open Data from the City of Montreal |pdfUrl=https://ceur-ws.org/Vol-1392/paper-02.pdf |volume=Vol-1392 |dblpUrl=https://dblp.org/rec/conf/icml/PineauB15 }} ==Analyzing Open Data from the City of Montreal== https://ceur-ws.org/Vol-1392/paper-02.pdf
                          Analyzing Open Data from the City of Montreal


Joelle Pineau                                                                                       JPINEAU @ CS . MCGILL . CA
McGill University, Montreal, CANADA
Pierre-Luc Bacon                                                                                    PBACON @ CS . MCGILL . CA
McGill University, Montreal, CANADA



                          Abstract                                   statistics, epidemiology, neuroscience, environmental sci-
                                                                     ence. They worked in teams of 3 for this project.
     There is a significant effort towards moving
     much of the data from the city of Montreal into
     an Open Data format. In this short paper, we                    1.1. Context and project instructions
     report on a recent initiative to analyze this data              According to instructions, participants were not restricted
     using machine learning techniques in the con-                   to using only the data from the city of Montreal website,
     text of a graduate course project. We review the                though needed to use some of it. In particular, when appro-
     approach, summarize accomplishments, and pro-                   priate, students were encouraged to incorporate data from
     vide several recommendations for improving the                  other sources (e.g. equivalent data from other cities), or
     impact from such efforts.                                       collect additional data (e.g. a new test set) to deepen their
                                                                     investigation.

1. Introduction                                                      The choice of prediction task and dataset to use was open.
                                                                     The goal was to pick a prediction question that is rele-
Many cities worldwide have started to devote significant             vant and important to the citizens or administrators of the
efforts and resources to publicly releasing data relating to         city. Particular attention was given to designing a predic-
their operations and situations. There is an opportunity             tion task that was well suited to the choice of dataset; and
for machine learning practitioners to use this data to an-           vice versa, picking the right data for tackling the chosen
swer several questions of interest for citizens, administra-         prediction question. The choice of algorithms and software
tors, businesses, and researchers.                                   systems was left open, including allowing use of existing
                                                                     machine learning toolboxes. The emphasis was on proper
A course project was assigned in the context of a graduate
                                                                     scientific methodology for computational analysis of urban
course of Applied Machine Learning at McGill University.
                                                                     data, rather than on the implementation of machine learn-
The stated goal of the project was to use open data from the
                                                                     ing algorithms.
city of Montreal’s website to identify an interesting predic-
tion question that can be tackled using machine learning
methods, and solve the problem using appropriate machine             1.2. Characteristics of the city of Montreal dataset
learning algorithms and methodology. Previously, students            The city of Montreal’s Open Data resource2 currently con-
had received 2 months of instructions on machine learning            tains 177 datasets, organized under different themes, as
methods 1 . The course involved 65 students at various lev-          listed in Table 1. Some datasets are re-listed under several
els of their studies, from advanced undergraduate to Mas-            themes, for example a dataset on the location and dimen-
ters and PhD, 1 course in structure and 2 graduate teach-            sions of community gardens appears under both the Envi-
ing assistants. Course participants came from a diverse              ronment and Housing and urban planning.
set of backgrounds, including computer science, electrical,
mechanical and biomedical engineering, mathematics and               Several of these datasets include descriptive data, for ex-
                                                                     ample the list of municipal buildings in a particular bor-
  1
    The course syllabus:                                             ough, with their respective addresses, or a document de-
http://www.cs.mcgill.ca/˜jpineau/comp598/                            scribing the yearly accomplishments in terms of universal
                                                                     accessibility of buildings (municipal and others). In many
Proceedings of the 2 nd International Workshop on Mining Urban
Data, Lille, France, 2015. Copyright c 2015 for this paper by its      2
                                                                         The data can be accessed here:
authors. Copying permitted for private and academic purposes.        http://donnees.ville.montreal.qc.ca/




                                                                    RR
                                            Analyzing Open Data from the City of Montreal


Table 1. Themes and number of datasets from the City of Mon-                            Table 2. List of projects
treal open data website.
                                                                    Real estate
                                                                     Montreal Real Estate Pricing
                                                                     Prediction of Real Estate Property Prices in Montreal
 Theme                                        Number of datasets     Location, Location, Location!
                                                                    Transportation
 Organization and administration                      54             Estimating Traffic Levels in Montreal using Computer Vision
 Sports, leisure, culture and development             43                 and Machine Learning Techniques
 Infrastructures                                      28
                                                                     Predicting STM Bus Intervals Using Vehicles, Bicycles and
 Environment                                          27
                                                                         Pedestrian Traffic Data
 Housing and urban planning                           21
 Financial resources                                  19             Predicting Method of Transportation
 Election and referendum                              17             Biking Lane Usage Prediction
 Information management                               16             BIXI Montreal
 Public safety and security                           13             Modeling imbalance in Bike Share Networks
 Communication and public relations                   12             Predicting Bike Counts for BIXI Stations in Montreal
 Material resources and services                      9              Prediction Problems on Bike Accident and Usage Data in Mon-
 Buildings and land                                   8                  treal
 Economic development                                  7             Prediction of Bicycle Accidents in Montreal
 Human resources                                       3             Prediction of Bike Accidents, a Comparison of New York and
 Legal affairs                                         2                 Montreal
 Property assessment                                   1             Load Forecasting for Smart City with Possible Electrical Vehicle
                                                                         Penetration
                                                                    Reconstruction/analysis of city images
respects, the data is not systematically or uniformly avail-         Where am I? Predicting Montreal Neighbourhoods from Google
able: the list of municipal buildings is available for only              Street View Images
one of the 19 boroughs of the city. The data is available in         Patch-Wide Classification of Historical Aerial Images of the Is-
several formats (PDF, TXT, XLS, ODT, CSV, DOC, XML,                      land of Montreal
KML, KMZ, GML, SHP, DXF, JSON, 3DM, ZIP), though                     Reviving Old Montreal
                                                                     Object Recognition of Historical Datasets
each dataset is provided in a (small) subset of these formats.
                                                                    Food safety
                                                                     Smart System for Restaurant Rating
2. Overview of projects and results                                  Predicting Severe Food Safety Violations in Toronto, Ontario
                                                                    Library usage
A total of 22 projects were completed, across a range of             Predicting Montreal Library Book Loans
topics. Project titles are listed in Table 2. The primary            Book Recommender Systems for Montreal Libraries
challenge for most teams was to identify a dataset that con-
tained enough data to perform a substantial machine learn-
ing analysis. This proved harder than expected, and thus            ages (originally taken in black&white).
several teams converged on using similar datasets from the          The choice of machine learning method to solve the chosen
set of 177 available. The most popular datasets pertained           task was left open to the participants. In most cases, they
to the usage of the Bixi bike-sharing service, and data on          needed to tackle the full pipeline, from feature extraction,
the location of bicycling accidents. In some cases, partic-         to training the learner, to setting up a valid evaluation pro-
ipants complemented the available data with similar data            tocol. Many teams used common software libraries (e.g.
from other cities, for example a project doing a compar-            scikit-learn (Pedregosa et al., 2011)) to assist with some
ative analysis of bicycle accidents in Montreal and New             portion of the work.
York.
                                                                    We now highlight a few of the projects.
A second challenge for many teams was to identify an ap-
propriate prediction question, which was both feasible (i.e.
                                                                    2.1. Sample project: Prediction of real estate property
sufficient available data) and interesting (i.e. with impact
                                                                         prices in Montreal
for citizens or administrators of the city). In some cases,
the prediction question arose naturally out of the data, for        This project aimed to predict the price of houses in Mon-
example predicting the loan rate of library books. On the           treal. A total of 25,000 records were extracted from on-
other hand, some participants were particularly creative            line listings of real estate brokers. Complementary infras-
with their choice of task. Good examples of this were               tructure and geographical information for each listing was
found in the analysis of city images, which included a              acquired from additional open data sources from the city
project aiming at the automatic colouring of historical im-         of Montreal and Statistics Canada. Pre-processing was ap-




                                                                   Rk
                                       Analyzing Open Data from the City of Montreal

plied, for example removing properties with an asking price       open data strategy that leads to the release of urban data
less than $10,000. Principal components analysis was used         suitable for machine learning analysis. To meet this goal,
to project the feature space to a lower-dimensional space.        the teams designing the open data platforms and controlling
Several machine learning algorithms were considered: lin-         the information flow may need to acquire expertise about
ear regression, support vector regression, k-nearest neigh-       the goals and challenges of machine learning, in order to
bours, and random forest regression. Algorithms were im-          offer appropriate datasets. Computer scientists and statis-
plemented using the scikit-learn package (Pedregosa et al.,       ticians have a role to play in informing these teams about
2011). The most promising results were obtained by an en-         the benefits that machine learning can bring to our society,
semble of k-nearest neighbour and random forest, achiev-          and in providing convincing examples of cases where ma-
ing a prediction error on par with previous literature on sim-    chine learning has enhanced the quality of life of citizens,
ilar datasets for other cities. In the case where the asking      and productivity of organizations.
price of a house is included, prediction error of the selling
                                                                  Use of urban data to enhance transportation models.
price can be further reduced. Such a tool could be used by
                                                                  Several of the projects targeted the use of the city of Mon-
citizens to get a more accurate estimate of a property’s mar-
                                                                  treal data to predict various aspects of urban transporta-
ket value. It may also be used by municipalities to assess
                                                                  tion, from the usage of the bike sharing service, to the ex-
property value for tax purposes. Finally, it may be used to
                                                                  pected timing of buses and automobiles. We observe that
inform economic indices.
                                                                  those datasets yielded some of the most interesting analy-
                                                                  sis because they were more extensive than other datasets, in
2.2. Sample project: Biking lane usage prediction                 terms of number of data points. The projects completed to
This project aimed to predict the number of cyclists pass-        date targeted specific aspects of the transportation network
ing through different streets in Montreal on a given day.         in isolation of others, however there is significant poten-
The analysis focused on ten different streets, and learned        tial to combine such results into a coherent model of urban
from daily counts obtained from sensors installed on the          transportation, and eventually to use this model to evalu-
streets, over a period of dates between 2009 and 2013, with       ate different transportation strategies (e.g. adding bicycle
a total of 1722 records. Several features were considered,        lanes, changing bus routes, etc.)
including the day of the week, weather, air quality index,        Use of machine learning to enhance delivery of goods
price of gas, special events (festivals, football and hockey      and services. Several of the projects attempted to use the
games), for a total of 47 features. This complementary            available data to predict usage of various services, from the
data was extracted from various online sources. Several           above-mentioned Bixi bike sharing service, to the borrow-
machine learning algorithms were considered: linear re-           ing of library books. Such analysis can be useful to make
gression, k-nearest neighbours, boosted decision trees, and       more efficient use of available municipal resources. How-
support vector regression. Prediction performance was as-         ever these cases pose particular challenges because the ob-
sessed using the mean absolute error, as well as the ratio        served demand often depends on the availability of goods
between the mean squared error for a given method and             or services. So for example, one will not observe any de-
the mean squared error of a baseline (dummy) predictor.           mand for a particular book if that book was not available
The boosted decision trees yielded the best performance. A        at the library. Similarly, it is difficult to accurately predict
complementary analysis of the feature impact using Lasso          the real demand for the shared Bixis at a particular loca-
regression suggested that the day of the week was one of          tion once that station has no more bicycles available, and
the most important features, possibly because the bicycle         it is difficult to accurately predict demand at a new loca-
usage varies greatly between weekdays and weekends.               tion. Some of the technical recommendations below relate
                                                                  to this aspect.
3. Discussion                                                     Use of machine learning to enhance human perception
In this section we discuss several opportunities and chal-        of urban data. One of the most original projects targeted
lenges that arose during the project.                             the automatic re-coloration of old grey-scale images of the
                                                                  city. While the results so far were not fully satisfying, there
3.1. Opportunities                                                is potential, as the methods improve, to use this technol-
                                                                  ogy to allow people to gain a new perspective on historical
From app design to data science. Many early open data             material. Some of the other projects relating to analysis
efforts from large cities have focused on releasing descrip-      of images have similar potential to enhance human under-
tive data, amenable to app design, often used in the context      standing of the urban landscape, past or present.
of hackaton events. While such activities continue to be
exciting and worthwhile endeavours, we believe that many          Use of urban data as complementary data. A fre-
communities have much to gain from also considering an            quent use of the city of Montreal open data in the projects




                                                                 Rj
                                         Analyzing Open Data from the City of Montreal

listed above was as a supplement to other more extensive            port. The spotlight talks were preferred over long talks due
datasets. An example of this are the three projects pertain-        to the number of projects. It proved difficult to provide ac-
ing to Real estate, where a large amount of data was first re-      curate detailed evaluations from such short presentations,
trieved from real estate brokerage websites, and then com-          and so most of the feedback was qualitative. The spotlights
plemented (via geo-location features) with city of Mon-             talks were held roughly 2 weeks before the final report was
treal data on local municipal infrastructure. Additional sup-       due, and thus focused more on the problem definition and
plementary information was also considered, from sources            methods, with few results. The final report was formatted
such as Statistics Canada (for sociodemographic indica-             as a research paper, max. 8 pages in length, and provided
tors), the YellowPages (for location of grocery stores, med-        a more accurate account of the project accomplishments.
ical clinics, yoga studios, etc.) and public transit authorities    In previous years, a poster session was held, instead of the
(for bus and subway access locations).                              spotlights and written report. This format offers more op-
                                                                    portunity for interaction between participants. The option
3.2. Teaching challenges                                            was not retained this year due to scheduling constraints.
Methods beyond the curriculum. Several of the projects
                                                                    3.3. Practical challenges
required students to tackle machine learning methods that
were beyond the basic course curriculum. The lectures for           Language of dataset. Most of the data available for the
the course were not designed with the final project in mind,        city of Montreal is in French. Few of the resources have
but rather to provided good coverage of basic algorithms            been translated. Even in the case of quantitative data, the
and methods for applied machine learning in general. For-           lack of English-language description posed an important
tunately, online resources are plentiful, and most students         problem for some of the young researchers.
were able to acquire the necessary material in areas perti-
                                                                    Design of the prediction task. When working with pre-
nent to their topic. In many cases however, understanding
                                                                    viously used supervised machine learning benchmarks, the
of that material seemed to be very superficial, and more op-
                                                                    target problem (i.e. output variable) of interest has already
portunity for one-on-one learning would have improved the
                                                                    been identified. When working with new datasets, it can be
quality of the analysis.
                                                                    challenging to identify the right target variable. For exam-
Managing multiple projects. One of the familiar chal-               ple in the case of the projects pertaining to transportation,
lenges with open-topic course projects is the load it creates       it may at first seem useful to predict the number or loca-
in terms of supervision. The instructor and teaching assis-         tion of bicycle accidents within the city. However these
tants must have the time to provide individualized advice to        events are relatively rare, and dealing with rare events is of-
each project team. We observed the most intense needs dur-          ten challenging from a statistical and algorithmic perspec-
ing the project definition phase, with some teams requiring         tive (especially in small datasets). An alternative may be
up to 3-4 half-hour long meetings to properly define their          to predict the number of close encounters between cyclists
scope and aims.                                                     and vehicles, which are less rare, but such data is not typ-
                                                                    ically available. Alternately, predicting the flow of larger
Scope of conclusions. We observed two challenges per-
                                                                    vehicles (cars, buses, trucks) may be more fruitful, since it
taining to the interpretation of the results. First, as with any
                                                                    can be reliably estimated, and can be used within a larger
data analysis, the urge can be strong to interpret the results
                                                                    predictive model on urban transportation.
in ways that are not warranted by the methodology used.
For example, reporting results indicating that old aerial im-       Lack of parallel datasets. Comparative analyses (between
ages of the city can be classified in terms of usage type           years, between neighbourhoods) can yield rich informa-
(farmland, forest, residential, water) with 80% accuracy,           tion. This can only be tackled if data from parallel settings
but failing to state that the accuracy is in fact much lower        is available. The well-known Boston housing dataset (Har-
for farmland and forests, but higher for water and residen-         rison & Rubinfeld, 1978) was used as a comparison for
tial areas. Second, while quantitative results are typically        some of the projects pertaining to real estate. In general, it
the preferred metric of performance, it is often the qual-          is useful to keep this in mind when planning for additional
itative results that speak most to the human imagination.           releases of urban open data.
There is a tendency to pick a few select qualitative results
to “tell a story”; this can be a powerful way of showing            3.4. Machine learning challenges
results, but it can easily be used to mis-characterize the
expected performance of a system across the full range of           Small data. The typical ICML attendee may be tempted
events.                                                             to believe that all the interesting tasks for machine learn-
                                                                    ing deal with so-called big-data. Yet several important
Presentation format. Two components were used for eval-             problems occur in the small data setting. The challenges
uation: an in-class 3-minute spotlight talk and a written re-       in this case are different, possibly less computational and




                                                                   R9
                                        Analyzing Open Data from the City of Montreal

more statistical. There remains many opportunities to con-         tation.
nect to the big-data community through the use of auxiliary
                                                                   Choice of machine learning algorithm. There is a ten-
datasets.
                                                                   dency among novice machine learning practitioners to
Sparse, incomplete, noisy datasets. As with most real-             spend significant efforts on testing several machine learn-
world datasets, a major problem with urban data remains            ing algorithms, with the belief that the choice of algo-
the poor quality and uniformity of the data published.             rithm is the dominant factor in achieving good predic-
Often, the data is not curated by a person familiar with           tion performance. Another tendency is to assume that
machine learning methods. There exists many statistical            the most advanced methods will necessarily outperform
and machine learning methods to overcome problems of               more naive methods. In practice, several algorithms may
data quality, such as expectation maximization (Demptster          perform equivalently, or simple methods may outperform
et al., 1977), multiple imputation (Rubin, 1987). How-             more complicated ones, for example when there is insuf-
ever the effective application of these approaches to com-         ficient data to properly train a complex hypothesis space,
plex datasets generally requires a good understanding of           or the hyper-parameters are not properly optimized. Sim-
the methods (e.g. to construct a good model of imputation).        ilar to the choice of features, algorithms can be compared
                                                                   using an appropriate cross-validation methodology.
Feature coding for heterogenous data. Several projects
observed that the choice of coding method for the data had         Interpretability of results Methods such as linear regres-
a significant impact on the performance of their machine           sion, decision tree and naive bayes classifiers, are often pre-
learning algorithm. For example in the bike lane usage pre-        ferred to more complex methods such as neural networks or
diction, an important feature was the day of the week. En-         kernel methods, in the case where interpretability of the re-
coding this as 7 binary features reduced the error rate by         sults is necessary. In some applications, the knowledge of
more than 5%, compared to using a single 7-valued cate-            which features are most predictive of a particular outcome
gorical feature. Another similar effect was seen in the real       (e.g. finding which municipal amenities are best predic-
estate price prediction task, where a logarithmic function         tors of higher real estate prices) is of utmost interest. Sev-
was used to re-scale prices. Typically, the choice of en-          eral newer models have been proposed that combine rich
coding can be validated using standard methods for feature         hypothesis spaces with interpretability (Letham et al., To
selection.                                                         appear).
Feature selection for complex data. For some domains,              From supervised learning to decision-making So far
the set of features that can be considered is very large, thus     we have been mostly concerned with supervised learning,
an important problem is in selecting the right set of fea-         where the goal of the learner is to predict a given quan-
tures. Furthermore, it is often possible to enhance the fea-       tity (the output) from observed variables (the input). In
ture set by incorporating supplementary data sources. It can       some cases, the goal may be to use the analysis to change
be difficult to select the sufficient and necessary set of fea-    a decision strategy. For example, by correctly predicting
tures for a given prediction task. Cross-validation methods        which restaurants may be found in violation of the health
can be used to automatically compare different feature sets.       and safety laws, it may be possible to more efficiently de-
But this can be problematic in the case of small datasets          ploy food safety agents. It is important to be aware of the
where only limited data is available for validation of the         fact that such a change in policy may result in a shift in
feature set. An effective method in those cases is usually to      the observed data. In the case where one wants to optimize
use domain knowledge and expert advice to narrow down              the decision strategy, it may be more appropriate to phrase
the candidate features to a manageable set (or small num-          the problem under the framework of reinforcement learn-
ber of candidate sets). Another possible approach to tackle        ing (Sutton & Barto, 1998).
this problem is to use data from another city to predict the
                                                                   Off-policy learning A related case for concern arises when
right set of features. Considering the case of Food Safety
                                                                   the data was acquired under a particular decision strategy,
analysis, while Montreal has released only 750 records of
                                                                   and the results of the analysis are used to change that deci-
food inspections (Montreal food data), San Francisco has
                                                                   sion strategy; in such case it can be difficult to accurately
released 10,000 records (San Francisco food data). There-
                                                                   predict what will happen under the new decision policy.
fore one could optimize the choice of features using the San
                                                                   This is known as the off-policy learning problem in the
Francisco data and then apply the model and learn a simple
                                                                   machine learning literature (Sutton & Barto, 1998). Con-
prediction strategy on the Montreal data. More sophisti-
                                                                   sider for example analyzing the usage data from Montreal’s
cated methods for transfer learning are also worth investi-
                                                                   Bixi bike sharing service, then using the predictions de-
gating. Finally, it is worth pointing out that the choice of
                                                                   rived from this analysis to determine which stations have
features can be key not just for building a good predictor,
                                                                   lower demand, and then reducing bicycle availability at
but also for building a good model for missing data impu-
                                                                   those stations. If those stations had low demand because




                                                                  R8
                                       Analyzing Open Data from the City of Montreal

they were already subject to reduced availability, then the           Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
further shift to reduces availability likely would not result         Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
in more satisfied customers overall.                                  napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
                                                                      Scikit-learn: machine learning in python. Journal of Ma-
4. Conclusion                                                         chine Learning Research, 12, 2011.

This paper presents a recent initiative to apply machine          Rubin, D.B. Multiple Imputation for Nonresponse in Sur-
learning techniques to analyze open data from the City              veys. J. Wiley & Sons, 1987.
of Montreal data, conducted in the context of a graduate          San Francisco food data. https://data.sfgov.org/health-and-
course project. Several of the challenges and opportunities         social- services/restaurant-scores/stya-26eb?
identified are commonly known in the machine learning
community. Our goal in presenting this work is to illus-          Sutton, Richard S. and Barto, Andrew G. Introduction to
trate how such challenges arise in the context of analyzing         Reinforcement Learning. MIT Press, Cambridge, MA,
urban data, and in doing so, facilitate collaboration with in-      USA, 1st edition, 1998. ISBN 0262193981.
terested parties from other communities. While the City of
Montreal was not involved in the elaboration of the course
project, we have since communicated results of the projects
with them. We have also received inquiries from officials
of other cities. There is clearly significant interest in the
outcomes of such initiatives.

Acknowledgements
Much of the credit for this paper goes to the students of
the Fall 2014 edition of the course COMP-598: Applied
Machine Learning, at McGill University. The first sam-
ple project on the prediction of real estate property prices
was realized by Nissan Pow, Emil Janulewicz, and Liu Liu.
The second sample project on the biking lane usage predic-
tion was realized by Robert Wenger, Haomin Zheng, and
Stefan Dimitrov. Several of the issues highlighted in the
discussion were extracted directly from those and other stu-
dents’ project reports. Additional thanks go to Angus Leigh
who acted as a teaching assistant for the course, jointly with
Pierre-Luc Bacon.

References
Demptster, A.P., Laird, N.M., and Rubin, D.B. Maximum
  likelihood from incomplete data via the em algorithm.
  Journal of the Royal Statistical Society, Series B, 39,
  1977.
Harrison, D. and Rubinfeld, D.L. Hedonic housing prices
  and the demand for clean air. Journal of Environmental
  Economics and Management, 1978.
Letham, B., Rudin, C., McCormick, T., and Madigan,
  D. Buildling interpretable classifiers with rules using
  bayesian analysis. Annals of Applied Statistics, To ap-
  pear.
Montreal food data. http://donnees.ville.montreal.qc.ca/
 dataset/inspection-aliments-contrevenants.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,




                                                                 Re