Predicting Land Use of Italian Cities using Structural Semantic Models Gianni Barlacchi1,2 , Bruno Lepri3 , Alessandro Moschitti1,4 1 Department of Information Engineering and Computer Science, University of Trento 2 TIM Semantics and Knowledge Innovation Lab, Trento 3 Fondazione Bruno Kessler, Trento 4 Qatar Computing Research Institute, HBKU {gianni.barlacchi,amoschitti}@gmail.com lepri@fbk.eu Abstract activities. For example, the automatic analysis of English. We propose a hierarchical se- land use enables the possibility of better adminis- mantic representation of urban areas ex- trating a city in terms of resources and provided tracted from a social network to classify services. However, such analysis requires specific the most predominant land use, which is information, which is often not available for pri- a very common task in urban computing. vacy concerns. In this paper we follow the ap- We encode geo-social data from Location- proach proposed in (Barlacchi et al., 2017) and Based Social Networks with standard fea- we use public textual descriptions of urban ar- ture vectors and a conceptual tree structure eas to design a novel machine learning represen- that we call Geo-Tree. We use the latter tation. We represent urban areas as: (i) a bag- in kernel machines, which can thus per- of-concepts (BOC), e.g., the terms Arts and En- form accurate classification, exploiting hi- tertainment, College and University, Event, Food erarchical substructure of concepts as fea- extracted from the Foursquare description of the tures. Our comparative study on three area; and (ii) the same concepts above organized in datasets extracted from Milan, Rome and a tree, which reflects the hierarchical organization Naples shows that Tree Kernels applied of Foursquare activities. We combine BOC vec- to Geo-Trees are very effective improving tors with Tree Kernels (TKs) (Collins and Duffy, the state of the art. 2002; Moschitti, 2006) applied to concept trees (Geo-Tree) and use them in Support Vector Ma- Italiano. In questo lavoro, proponiamo un chines (SVMs). The Geo-Tree allows the model nuovo modello semantico per la rappre- to learn complex structural and semantic patterns sentazione di aree urbane utilizzando dati from the hierarchical conceptualization of an area. da social media. In particolare, model- We show that TKs not only can capture seman- liamo tale informazione con una struttura tic information from natural language text, e.g., as ad albero che abbiamo chiamato Geo- shown for semantic role labeling (Moschitti et al., Tree. Questa viene utilizzata, in combi- 2008) and question answering (Severyn and Mos- nazione con un vettore di feature clas- chitti, 2013; Barlacchi et al., 2015b), but they can sico, nelle kernel machine per fare clas- also learn from the hierarchy above to perform se- sificazione della destinazione di uso delle mantic inference, such as deciding which is the aree urbane. Abbiamo valutato il nostro major activity of a land. approccio su tre grandi metropoli italiane quali Milano, Roma e Napoli. I risultati We carried out a study on land use prediction mostrano come i Geo-Tree, applicati ai of three Italian cities: Milan, Rome and Naples Tree Kernel, riescono a raggiungere risul- as follows: (i) we divided each city in squares of tati di molto superiori ad altri modelli at- 200x200 meters; (ii) then, we classify the most tualmente stato dell’arte. predominant land use class (e.g., High Density Ur- ban Fabric or Open Space and Outdoor), assigned 1 Introduction by the city administration. The results show that The growing availability of data from cities (Bar- GeoTKs achieve an impressive improvement over lacchi et al., 2015a) (e.g., traffic flow, human mo- state-of-the-art classification approaches based on bility and geographical data) opens new opportu- BOC., i.e., 21.2%, 13.6% and 54.3% of relative nities for predicting and thus optimizing human improvement in Macro-F1 over Milan, Rome and Naples datasets, respectively. (i) High Density Urban Fabric, (ii) Medium Den- sity Urban Fabric, (iii) Low Density Urban Fab- 2 Related Work ric, (iv) Industrial, commercial, public, military Previous work has modeled land use classification and private units, (v) Open Space & Recreation, by means of different sources of information. For (vi) Transportation. We collapsed Medium and example, Yuan et al. (2012) built a framework that, Low Density Urban Fabric into one single cate- using human mobility patterns derived from taxi- gory, ML-Density Urban Fabric as they only have cab trajectories and Point Of Interests (POIs), clas- few samples. Land use distribution is very fine- sifies the functionality of an area for the city of grained, making its classification based on POI in- Beijing. Assem et al. (2016) proposed a spatio- formation very difficult. A trade-off between clas- temporal approach based on three different clus- sification accuracy and the desired area granular- tering algorithms to model the change of function- ity consists in segmenting the regions in squared ality of a city’s region over time. They extracted cells. As each cell can contain more than one land features from Foursquare’s POIs and check-in ac- use label, we consider the predominant label as its tivities of Manhattan. Yao et al. (2017) built se- primary use. quences of POI concepts reflecting their spatial distance. Then, they applied Word2Vec (Mikolov 3.2 Point-Of-Interest et al., 2013) to these sequences to derive vectors A POI is usually characterized by a location (i.e., representing each area, which was used to train latitude and longitude), textual information (e.g., a land use classifier. In general, most previous a description of the activity in that place) and work applies extensive feature engineering, which a hierarchical categorization that provides differ- is typically costly as it requires to fully understand ent levels of detail about the activity of the place the target domain. Our approach alleviates this (e.g., Food, Asian Restaurant, Chinese Restau- problem with automatic feature engineering ap- rant). We used POIs extracted from Foursquare, a plied to an abstract land representation. geolocation-based social network supported with web search facilities for places and a recommen- 3 Land Description Data dation system. In particular, we extracted 46,731, Geospatial city areas are described with the pop- 43,389 and 7,219 POIs from Milan, Rome and ular shape file format, where each shape is a col- Naples4 , respectively. We focused on the ten lection of points geo-localized using their coordi- macro-categories of such POIs5 , each one special- nates. The latter are provided with the well-known ized in maximum four levels of detail. Coordinate Reference System (CRS) WGS84, 4 Structural Models adopted for the common latitude/longitude geolo- cation. We use (i) shape files provided by Urban In most machine learning algorithms data exam- Atlas1 , a website providing data for large urban ar- ples are transformed in feature vectors, which eas (more than 100, 000 inhabitants) and (ii) POIs in turn are used in dot products to carry out from Foursquare2 . both learning and classification. Kernel Machines (KMs) allow for replacing the dot product with 3.1 Land Use kernel functions, which directly compute it on the Cities are divided in small areas associated with examples, i.e., they avoid the transformation of ex- a main land use. In total, there are 17 differ- amples in vectors. The main advantage of KMs is ent land use classes defined from the open dataset a much lower computational complexity as it does Urban Atlas 3 . We focused on those related to not directly depend on the feature space size. city centers, discarding those less interesting from 4.1 Point-of-interests Features a social viewpoint, i.e., associated with rural ar- The most straightforward way to represent an area eas such as forests, agricultural, semi-natural and by means of Foursquare data is the use its POIs. wetland areas and mineral extraction and dump Every venue is hierarchically categorized (e.g., sites. Thus, we selected the following categories: Professional and Other Places → Medical Center 1 https://www.eea.europa.eu/data-and-maps/data/urban- → Doctor’s office) and the categories are used to atlas produce an aggregated representation of the area. 2 https://foursquare.com/ 3 4 https://www.eea.europa.eu/data-and-maps/data/urban- For some reasons Foursquare is less popular in Naples 5 atlas#tab-additional-information https://developer.foursquare.com/categorytree We define a feature vector for a grid cell by count- the paths of FH starting from grid concepts. Figure ing the macro-level category (e.g., Food) in all the 1 shows an example of the FH paths of a cell and POIs that we found in that cell. the resulting Geo-Tree. This way, the nodes of the first level, i.e., 4.2 Geographical Tree Kernel the root children, correspond to the most general Foursquare has its own hierarchy of categories, FH categories, e.g., Arts & Entertainment, Event, which is used to characterize each location and ac- Food, etc., the second level of our tree corre- tivity (e.g., restaurants or shops) in the database. sponds to the second level of the hierarchical tree Thus, each Foursquare POI is associated with a hi- of Foursquare, and so on. The terminal nodes are erarchical path, which semantically describes the the finest-grained descriptions in terms of category type of location/activity (e.g., for Chinese Restau- about the area, e.g., College Baseball Diamond rant, we have the path Food → Asian Restau- or Southwestern French Restaurant. For exam- rant → Chinese Restaurant). The path is much ple, Fig. 2 illustrates the semantic structure of a more informative than just the target POI name, grid cell obtained by combining all the categories’ as it provides feature combinations following the chains of each venue. structure and the node proximity information, e.g., Food & Asian Restaurant or Asian Restaurant & Chinese Restaurant are valid features whereas Food & Chinese Restaurant is not. Figure 2: Example of Geo-Tree in Milan for an area labeled as Open Space & Recreation. GeoTK: given a Geo-Tree, we can encode all its substructures in kernel machines using TKs. In particular, we used the Syntactic Tree Kernels (STKb ) with Bag-Of-Words and the Partial Tree Kernel (PTK) (Moschitti, 2006). Our TKs by con- struction do not consider the frequency6 of the POIs present in a given grid cell. BOC kernel: to complement GeoTK, we repre- sent a cell also creating a BOC representation, namely we count the macro-level category (e.g., Food) in all the POIs that we found in any cell grid. This way, we generate feature vectors by counting the number of each activity under each macro-category. In order to take into consideration the popularity of the area, we included (i) the total sum of unique users that did at least one check-in in the cell, and (ii) the total sum of check-in done in the cell. Note that, given an area, the number of unique users provides an idea on how many peo- ple visited it, while the number of check-in can be Figure 1: Example of Geo-Tree built from a col- used to represent its popularity. lection POIs in a cell. Kernel combination: finally, given two geo- graphical areas, xa and xb , we define a kernel Geo-Tree: we propose a new tree structure, i.e., combining Geo-Tree and BOC as: K(xa , xb ) = Geo-Tree, whose nodes and edges among them are T K(ta , tb ) + KV (va , vb ), where T K is any subsets of the Foursquare hierarchy (FH). A Geo- 6 Tree of a grid cell is constituted by a new root node It is possible to add the frequency in the kernel computa- tion but for our study we preferred to have a completely dif- connecting the subtrees of FH rooted in concepts ferent representation from previous typical frequency-based present in the cell. In other words, we connect all approaches. structural kernel function applied to tree represen- City Model Prec. Rec. F1 baseline 0.200 0.119 0.149 tations, ta and tb of the geographical areas and XGBoost 0.294 0.317 0.297 KV is a kernel applied to the feature vectors, va STK b+Rbf 0.368 0.364 0.360 Milan and vb , extracted from xa and xb using any data PTK+Rbf 0.430 0.350 0.345 STK b 0.448 0.307 0.320 source available (e.g., text, social media, mobile PTK 0.364 0.302 0.309 phone and census data). baseline 0.200 0.089 0.124 XGBoost 0.291 0.306 0.279 5 Experiments and Results STK b+Lin 0.359 0.314 0.317 Rome We performed our experiments on the data from STK 0.338 0.300 0.302 PTK 0.340 0.300 0.299 Milan, Rome and Naples. We used a grid of PTK+Lin 0.359 0.297 0.291 200x200meters as it is indicated as the best size baseline 0.200 0.100 0.133 from other similar previous work on land use XGBoost 0.236 0.272 0.219 classification (Toole et al., 2012; Zhan et al., STK b+Rbf 0.361 0.331 0.338 Naples STK b+Lin 0.338 0.302 0.300 2014; Barlacchi et al., 2017). We applied a STK b 0.409 0.290 0.299 pre-processing step in order to filter out cells for PTK 0.318 0.298 0.297 which land use classification cannot be performed. Table 1: Classification results on Rome, Milan and Naples. Prec., Rec. and F1 are averaged over all categories. In particular, for Milan and Rome, we selected the central point of the shape and we included nomial and radial basis function kernels, named those cells that have their centroid in the radius SVM-{Lin, Poly, Rbf}, respectively, and our of 15 and 8 kilometers, respectively. For Naples, structural semantic models, indicated with STKb we kept all the cells due to the smaller size of the and PTK. We also combined kernels with a sim- city. Then, for all the three cities, we removed the ple summation, e.g., PTK+Lin indicates an SVM cells that (i) cover areas without a specified land using such kernel combination. use (e.g., the cells in the sea) and (ii) do not have Table 1 shows the average of F1, Precision and POIs (e.g., the countryside cells). After this step, Recall over the different categories. The model we obtained a grid with 2,581, 5,657 and 1,314 baseline is obtained by always classifying an ex- cells for Milan, Rome and Naples, respectively. ample with the label High Density Urban Fabric, We created, separately for each city, the training which is the most frequent. Due to space con- and test set randomly sampling 80% vs. 20% of straint, we only reported six models, namely: the the cells. We labelled the dataset following the baseline, XGBoost and the top four kernel models. same category aggregation strategy proposed by Zhan et al. (2014), who assigned the predominant We note that: (i) GeoTK always outperforms land use class to each grid cell. XGBoost and the baseline, demonstrating the su- periority of our novel approach. This is an inter- To train our models, we applied SVM-Light- esting finding as XGboost is the current state of the TK7 , which enables the use of structural kernels art for land use classification. (ii) STKb combined (Moschitti, 2006) in SVM-Light8 . In particular, with feature vector always produces the best re- due to the nature of the task, we used the Python sults, improving the F1-score over XGBoost up to wrapper around SVM-Light-TK to perform mul- 6.3, 3.8 and 11.9 absolute points for Milan, Rome ticlass classification9 . We experimented with lin- and Naples, respectively. (iii) Kernel combina- ear, polynomial and radial basis function kernels tions always provide the best results. applied to standard feature vectors. We measured the performance of our classifier by averaging Pre- 6 Conclusions cision, Recall and F1 over all land use categories. In this paper, we have introduced Geo-Trees, a 5.1 Results for Land Use Classification novel semantic representation based on a hierar- We trained multi-class classifiers using com- chical classification of POIs, to better exploit geo- mon learning algorithm such XGboost (Chen and social data to the classification of the primary land Guestrin, 2016), and SVM using linear, poly- use of an urban area. This is an important task 7 as it gives the urban planners and policy makers http://disi.unitn.it/moschitti/Tree-Kernel.htm 8 the possibility to better administrate and renew a http://svmlight.joachims.org/ 9 https://github.com/aseveryn/SVMTK-Multiclass- city in terms of infrastructures, resources and ser- Classifier vices. More in detail, we have built our classi- fiers with combinations of a kernel over BOC and Jameson L Toole, Michael Ulm, Marta C González, TKs applied to Geo-Trees, thus exploiting hierar- and Dietmar Bauer. 2012. Inferring land use from mobile phone activity. In SIGKDD International chical substructure of concepts as features. Our Workshop on Urban Computing, pages 1–8. ACM. comparative study on three large Italian cities, Mi- lan, Rome and Naples shows that our models can Yao Yao, Xia Li, Xiaoping Liu, Penghua Liu, Zhaotang relatively improve the state of the art up to 11.9 Liang, Jinbao Zhang, and Ke Mai. 2017. Sens- ing spatial distribution of urban land use by integrat- absolute points in F1-score. ing points-of-interest and google word2vec model. International Journal of Geographical Information Acknowledgments Science, 31(4):825–848. This work has been partially supported by the EC Jing Yuan, Yu Zheng, and Xing Xie. 2012. Discov- project CogNet, 671625 (H2020-ICT-2014-2, Re- ering regions of different functions in a city using human mobility and pois. In KDD, pages 186–194. search and Innovation action). ACM. Xianyuan Zhan, Satish V Ukkusuri, and Feng Zhu. References 2014. Inferring urban land use using large-scale so- cial media check-in data. Networks and Spatial Eco- Haytham Assem, Lei Xu, Teodora Sandra Buda, and nomics, 14(3-4):647–667. Declan O’Sullivan. 2016. Spatio-temporal clus- tering approach for detecting functional regions in cities. In ICTAI, pages 370–377. IEEE. Gianni Barlacchi, Marco De Nadai, Roberto Larcher, Antonio Casella, Cristiana Chitic, Giovanni Torrisi, Fabrizio Antonelli, Alessandro Vespignani, Alex Pentland, and Bruno Lepri. 2015a. A multi-source dataset of urban life in the city of milan and the province of trentino. Scientific data, 2:150055. Gianni Barlacchi, Massimo Nicosia, and Alessandro Moschitti. 2015b. Sacry: Syntax-based automatic crossword puzzle resolution system. ACL-IJCNLP 2015, page 79. G Barlacchi, A Rossi, B Lepri, and A Moschitti. 2017. Structural semantic models for automatic analysis of land use. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In KDD, pages 785– 794, New York, NY, USA. ACM. Michael Collins and Nigel Duffy. 2002. New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In ACL. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word represen- tations in vector space. CoRR, abs/1301.3781. Alessandro Moschitti, Daniele Pighin, and Roberto Basili. 2008. Tree kernels for semantic role label- ing. Computational Linguistics, 34(2):193–224. Alessandro Moschitti. 2006. Efficient convolution ker- nels for dependency and constituent syntactic trees. In ECML, pages 318–329. Springer. Aliaksei Severyn and Alessandro Moschitti. 2013. Au- tomatic feature engineering for answer selection and extraction. In EMNLP, volume 13, pages 458–467.