Semantic Enhancement of Volunteered Geographic Information Laura Di Rocco DIBRIS - Università degli Studi di Genova Abstract. The continue use of social media has created a revolution in the production of data in terms of heterogeneity and volume. This huge amount of data is often georeferenced, and the geospatial position of data is a very relevant dimension for analysis. The considerable hetero- geneity and the variable quality of user-generated data prevent however the full exploitation of this information. In this PhD project, we focus on user-generated geospatial data, commonly referred to as Volunteered Geographic Information, where geographical objects are annotated by using tags. The simplicity to use tags allows users to put a lot of tags on objects. This creates a big and noisy tagspace in which it is hard to find data tagged by other people. To overcome this heterogeneity and quality problem, a solution is to rely on ontologies to classify spatial entities tags and names. This problem is addressed in the PhD project with specific reference to the tags used in OpenStreetMap for describ- ing georeferenced objects relevant in urban context, producing as result an ontology for classifying such objects, and a corresponding ontology population approach. Keywords: Geographic Ontology, Implicit/Explicit Geographical Information, Volunteered Geographical Information, Semantic Classification 1 Introduction In recent years, we are witnessing a revolution in the production of data, resulting in a significant increase in data complexity, in terms of volume, heterogeneity, and distribution. Data are increasingly being gathered by ubiquitous information- sensing mobile devices, sensor networks, Web applications, software logs. Such a huge amount and variety of data is a valuable source from which to extract information and knowledge. This huge amount of data is often georeferenced, and the geospatial position of data is a very relevant dimension for analysis. The advent of new pervasive communication tools like social media plays a relevant role in such a revolution in data production. Consistent user-generated data represents indeed a valuable source for the extraction of new types of information patterns and knowledge. The multifaceted nature of user-generated data, along with its geographic component, can be exploited to better understand social dynamics and propagation of information. Georeferencing information can be explicit, if the user-generated data is ex- plicitly associated with a geographic position –consider for instance the case of content generated from a mobile phone which GPS coordinates are known to the application gathering the content– or implicit, that is it can be deduced, for instance, from the content itself. A typical example is the bulk of geospatial information that can be extracted from short text messages exchanged by users on Twitter. In this case, explicit geographic information can be available in the metadata associated with the tweet (user profile location and GPS coordinates of the device) or can be inferred, with variable degree of confidence, by the message content itself, which may contain images, names of entities with known spatial location, or by the users social relationships and activities. Users play an important role as information producers also for what concerns geospatial information itself, in Volunteered Geographic Information (VGI). Good- child [5] defines VGI as “[...] a special case of the more general Web phenomenon of user-generated content[...] ”. Crowdsourced geospatial data is becoming very popular mainly due to its free availability and its constant updating. Among all projects for spatial data crowdsourcing, OpenStreetMap (OSM) is by far the most popular. OSM is a collaborative project to create a free editable map of the world via crowdsourced data. The data uploaded to OSM is continuously increasing. However, the considerable heterogeneity and the variable quality of user-generated data prevent the full exploitation of this information. Specifi- cally, in crowdsourced geospatial data, geographical objects are annotated by tags. Tagging seems to be the natural way for people to classify objects as well as an attractive way to discover new material. The simplicity in using tags allows users to put a lot of tags on objects [2]. This results in a big and noisy tagspace in which it is hard to find data tagged by other people. This is due also to the subjectivity of tagging. To overcome these heterogeneity and quality problems, a solution is to lift from the syntactic to the semantic level both for what concerns spatial entities tags and names, relying on ontologies. The paper is structured as follow: Section 1 presents the problem, Section 2 presents a comparison to state of the art and methodology of our solution and Section 3 presents the case of study and application of our ontology. 2 Problem Statement The PhD project addresses the problem of classification of explicit geographic in- formation coming from Volunteered Geographic Information. We want to seman- tically enhance explicit geospatial information available in open source format (OpenStreetMap), thus overcoming the heterogeneity and quality issues inherent in crowdsourced data and tagging discussed above. In order to exploit OSM for georeferencing, OSM tags need to be used. An alternative approach is proposed by OSMOnto1 [3]. This is a relevant related project proposing an ontology for tags. The purpose of the ontology of tags is to stay as close as possible to the structure of the OSM files in order to facilitate database querying. However this work is not relevant for our research as it does not try to correct any possible conceptual mistakes in the taxonomy of OSM tags, but rather to reflect it faithfully in the structure of the ontology. 1 It is possible to find more information here: http://wiki.openstreetmap.org/wiki/ OSMonto The semantics of OSM tags is quite poor, this work aims at enhancing it through the definition of a properly structured ontology to classify OSM tags. This choice is due to limitation of existing semantic gazetteer. Indeed, classical gazetteer, like Geonames2 , are not very useful as they are too coarse-grained. The proposed approach is thus aimed at extracting accurate and complete georefer- enced data (in terms of types of spatial objects) from crowdsourced information, through an appropriate classification. The developed ontology allows to use the data present in OSM as instances of an ontology. We will extract non-spatial information from geospatial data in order to create a classification hierarchy. In this way, OSM users will be able to choose the most suitable tag in OSM for describing a geospatial object. Since the goal of developing a general-purpose ontology for describing geospatial entities is very ambitious and different efforts have been made with different specific goals [7,3,1], we restrict our goal as follows. Our approach aims at classifying geospatial objects that are relevant in urban contexts, thus, that may appear in a generic city, trying to avoid, whenever pos- sible, to focus on the specificities of a particular city. To define our ontology, we keep into account that we mainly want to “semantify” OSM tags, thus we start from the analysis of OSM tags employed in the context of a specific urban context: the city of “London, UK”. We start from London (UK) because there is a big amount of OSM data, thus London (UK) provides a good dataset, likely covering most of the relevant concepts in a urban context. Our proposed solution will utilize a classification technique based on “facets”. We exploit the possibility to model the domain in three mutual exclusive facets in order to obtain a total generalization. 3 OpenStreetMap Faceted Ontology Ontologies have been used for a set of tasks: improving communication between agents (human or software), reusing data models, developing knowledge schemas, etc. All these tasks deal with interoperability issues and can be applied in dif- ferent domains. State of the art. Several researchers try to extract other kind of information from tweets, for example, to perform sentiment analysis. The main approaches in this field rely on: (a) machine learning techniques, (b) AI techniques, agents and ontologies. For example, the work of Kontopoulos et al. [6] describes two possible approaches to create a domain ontology for sentiment analysis: Formal Concept Analysis and Ontology Learning. In order to create an ontology, we need a collection of objects together with some properties. For sentiment analysis applications, it is possible to extract this information from a collection of data, i.e., a training dataset. These two techniques are not suitable for our goal for the following reasons. First of all, because we want to use our ontologies for georeferencing microblogs information (we do not have text to extract information) and secondly because the geographic domain related to a city is not generalizable. Indeed, these two techniques are very good in sentiment analysis domain and in natural language processing field, as we have seen. 2 http://www.geonames.org/ Methodology. In our work we need to create an ontology that allows us to search explicit geographic information only. There are multiple ontology engineering methodologies that facilitate the process of developing, maintaining and in overall handling complete life-cycle management. Examples include KACTUS, METHONTOLOGY, SENSUS and DILIGENT. In this work we do not focus on the methodological aspects of ontology development, using rather a simple approach to develop our ontology. This choice is due to the limited complexity of the domain and also because of peculiarity of the project proposal, which requires a manual process to be used. We do not want to put an over-structure on OSM tags. Ontologies exist that want to put an over-structure on OSM tags. In particular, LinkedGeoData [9] is a project that aims at linking OSM data to other LinkedData repositories (such as GeoNames and/or other online ontologies) by converting it to RDF so that it can be queried from a SPARQL endpoint. However, LinkedGeoData does not include all OSM entities and therefore it is not very useful for our purposes. Our target ontology aims at classifying geospatial objects that are relevant in urban contexts. We cannot exploit existing Gazetteers, like GeoNames, that have a low level of detail respect to internal subdivision of states and are thus too coarse-grained for our purposes. We rather need a semantic gazetteer providing a high level of detail (e.g., inside a city). We then generalize the ontology classes to encompass concepts that may occur in a generic city, trying to avoid, whenever possible, to focus on the speci- ficities of a particular city. To this aim, we selected the faceted approach to develop the ontology. Facet ontology [8] classifies objects using multiple taxonomies. A facet is a hierarchy of homogeneous concepts describing an aspect of the domain, where each term denotes an atomic concept. Each facets is designed separately, and models a distinct aspect of the domain. Each facet consists of a terminology, i.e., a finite set of names or terms, structured by a subsumption relation. In our ontology, facets correspond to geophysical, geopolitical, and Point of Interest aspects. The developed ontology allows data present in OSM to be used as instances of an ontology. A significant semantic support is brought to volun- teered geographic data allowing the search at a conceptual level. Conscious of the heterogeneous nature of geospatial data, we do not provide any contribution to the spatial component, already well structured. Instead we aim at improv- ing the non-spatial content which is per se heterogeneous and only syntactically semi-structured. The non-spatial content is now accessible through a semantic structure which allows for the conceptual search. The use of “facet” takes into account the different aspects involved (i.e., natural area, political area and Point of Interest) thus obtaining a complete characterization of the domain of interest. Population Approach. Differently from what happens for standard ontologies, we decide to manage individuals of our ontology using a different approach. The idea is to have individuals in form of rows of views on a relational database. We will use a new technology able to manage unstructured data with a paral- lel support. This technology is similar to NoSQL databases but it maintains a relational structure. The population approach is chosen relying on how our ontology will be used by the future applications. In cases where individuals exist in ontologies, they represent “concrete” objects that have to be classified. However, an ontology need not include any individuals, but one of the general purposes of an ontology is to provide a means of classifying individuals, even if those individuals are not explicitly part of the ontology. For these reasons, we will use database views to manage our individuals. In this way, we are able to directly manage the string type returned by databases. This choice is made to simplify future usage. Indeed, if we have an individual, we have an URI to represent the corresponding object. With our approach, we perform a matching between OSM objects and our concepts, using simple standard queries on a relational database. With matching we mean the relationship between particular tags and concepts that allow to automatise the population process. 4 Application & Use of Ontology The developed approach finds a relevant application in the context of our ongo- ing research project aimed at realtime integration of textual data coming from microblogs and crowdsourced geospatial data [4]. Specifically, the target appli- cation first gathers in realtime from Twitter, through the use of the suitable streaming API, both explicitly georeferenced tweets and tweets missing an ex- plicit geospatial information. Then, tweet contents are geoparsed relying on the textual descriptions of OSM objects. Social relationships among users and their activities (such as mentions and retweets) are then exploited to further refine tweet (and corresponding user) geopositioning, since some social relations are strong indicators of spatial proximity. Georeferencing information belonging from different sources (content vs social interaction analysis), appropriately weighted according to the respective confidence, are then merged. Since explicit tagging is used only in a small percentage of tweets, we will use the geospatial information implicit in the messages to improve the resolution of the georeferencing process. This is useful for several applications. For example we could create heat maps to highlight areas from which tweets are generated or areas which tweets refer to. This could be very useful in application such as emergency response. In our ongoing project [4], we proposed to extract implicit geoinformation contained in tweet contents using a semantic gazetteer. The use of external knowledge like a gazetteer can help us to detect toponyms in the tweet contents. In order to extract toponyms from text, we plan to consider three different approaches: 1. Perform a simple string matching with toponyms extracted from OSM. 2. Improve this matching by relying on a geospatial ontology. A semantic sup- port can help us to find toponyms using also the context of the tweet (e.g. the presence of word "cinema" allow us to search a toponym in the proper class of the ontology), improving the precision of matching. 3. Rely on NLP classifiers in order to identify reference to geographical lo- cations from texts in order to identify toponyms by the context of terms (prepositions, verbs, ...). These three approaches can help us to understand and evaluate the extend to which semantic support improves geotagging of non-geotagged tweets. The semantic enrichment allows us to obtain new geographical knowledge that can be exploited to improve the set of extracted geonames and thus the quality of geotagging. To evaluate the use of ontology in this type of application, we have a clear plan for a two-fold evaluation. More specifically, we will perform the following two types of evaluation: 1. comparison between manually geotagged tweets and automatically geotagged tweets; 2. for geotagged tweets, comparison between the known (exact) position and the location inferred by our approach (only in case we have tweets with coordinates and containing geonames). The first aim of our evaluation is to understand if our approach works correctly on a specific dataset. References 1. A. Ballatore and M. Bertolotto. Semantically enriching vgi in support of implicit feedback analysis. In Web and Wireless Geographical Information Systems, volume 6574 of LNCS, pages 78–93. Springer, 2011. 2. G. Begelman, P. Keller, F. Smadja, et al. Automated tag clustering: Improving search and exploration in the tag space. In Collaborative Web Tagging Workshop, 2006. 3. M. Codescu, G. Horsinka, O. Kutz, T. Mossakowski, and R. Rau. Osmonto-an ontology of openstreetmap tags. State of the map Europe (SOTM-EU) 2011, 2011. 4. L. Di Rocco, M. Bertolotto, B. Catania, G. Guerrini, and T. Cosso. Extracting fine-grained implicit georeferencing information from microblogs exploiting crowd- sourced gazetteers and social interactions. In AGILE International Conference on Geographic Information Science, 2016. 5. M. F. Goodchild. Citizens as sensors: the world of volunteered geography. GeoJour- nal, 69(4):211–221, 2007. 6. E. Kontopoulos, C. Berberidis, T. Dergiades, and N. Bassiliades. Ontology-based sentiment analysis of twitter posts. Expert systems with applications, 40(10):4065– 4074, 2013. 7. C. Masolo, S. Borgo, A. Gangemi, N. Guarino, A. Oltramari, R. Oltramari, L. Schneider, L. P. Istc-cnr, and I. Horrocks. Wonderweb deliverable d17. the won- derweb library of foundational ontologies and the dolce ontology, 2002. 8. S. R. Ranganathan. Prolegomena to library classification. The Five Laws of Library Science, 1967. 9. C. Stadler, J. Lehmann, K. Höffner, and S. Auer. Linkedgeodata: A core for a web of spatial open data. Semantic Web, 3(4):333–354, 2012.