Categories and Subject Descriptors

September

Automatic recommendations of categories for geospatial entities

Giorgos Giannopoulos

giann@imis.athena- giann@imis.athenainnovation.gr 0

Nikos Karagiannakis

nkaragiannakis@imis nkaragiannakis@imis. athena-innovation.gr 0

Spiros Athanasiou

spathan@imis.athenainnovation.gr 0 0 IMIS Institute , R.C. ATHENA

2015

19 2015 3 6

Over the last years, thanks to Open Data initiative and the Semantic Web, there has been a vast increase on user contributed data. In several cases (e.g. OpenStreetMap, Geonames), the respective data include geospatial information, that is the coordinates and/or the precise geometries of buildings, roads, areas, etc. In such cases, proper schemas are de ned to allow users to annotate the entities they contribute. However, browsing through a large and unstructured list of categories in order to select the most tting one might be time consuming for the end users. In this paper, we present an approach for recommending categories for geospatial entities, based on previously annotated entities. Speci cally, we de ne and implement a series of training features in order to represent the geospatial entities and capture their relation with the categories they are annotated with. These features involve spatial and textual properties of the entities. We evaluate two di erent approaches (SVM and kNN) on several combinations of the de ned training features and we demonstrate that the best algorithm (SVM) can provide recommendations with high precision, utilizing the de ned features. The aforementioned work is deployed in OSMRec, a plugin for JOSM tool for editing OpenStreetMap.

eol>Category recommendation spatial entities OSM

Categories and Subject Descriptors

[Information systems]: [Recommender systems]

1. INTRODUCTION

OpenStreetMap (OSM)1 is an initiative for crowdsourcing map information from users. It is based on a large and active community that contributes both data and tools that facilitate the constant enrichment and enhancement of OSM maps. One of the most prominent tools of OSM is JOSM2, a graphical tool that allows users to (a) download all the spatial features (i.e. roads, buildings, stations, areas, etc.) of a geographic area from OSM, (b) visualize these features on a map overlay, (c) add, change or delete spatial features and (d) reload the changed dataset into the publicly accessible OSM map dataset. Speci cally, through its interface, the user can draw the geometry of a spatial feature. Then, she can annotate the feature with categories, in the form of key-value pairs. These spatial entities can also be annotated by categories (classes, tags) that assign semantics to them. Each entity may belong to multiple classes; for example, a building can be further characterized as school or bar. The system allows the selection of such categories from an existing list3 or the de nition of new categories by the user. At this point lies the problem we handle, which is expected in such crowdsourcing approaches: the user can de ne an arbitrary category which (a) might already exist in a slightly di erent form or (b) might have no actual semantic meaning in the context of OSM and, thus, degrade the quality of the inserted information. Besides, the user might be discouraged by the large amount of available categories and quit the task of annotating the inserted spatial entity. Trying to avoid (a) and (b) by restricting the category selection only to categories that already exist in the OSM dataset has important disadvantages. First, it would cancel a large part of the crowdsourcing character of OSM and, second, it is not guaranteed that the OSM categories, in their current form, cover every aspect of characterizing spatial features. Instead, the most suitable course of action is to guide the users during their annotation process, recommending to them already existing categories to annotate the new spatial entities. In this work, we propose such a process, that trains recommendation models on existing, annotated geospatial datasets and is able to recommend categories for newly created enti1https://www.openstreetmap.org/ 2https://josm.openstreetmap.de/ 3http://wiki.openstreetmap.org/wiki/Map features ties. The main contribution of our work lies on the problem speci c training features we de ne in order to capture the relations between the spatial and textual properties of each spatial entity with the categories-classes that characterize it. Essentially, this way, the proposed framework takes into account the similarity of the new spatial entities to already existing (and annotated with categories) ones in several levels: spatial similarity, e.g. the number of nodes of the feature's geometry and textual similarity, e.g. common important keywords in the names of the features.

We perform an extensive evaluation that assesses the e ectiveness of several feature subsets deployed in the frame of two classi cation algorithms: (i) Support Vector Machines (SVM) and (ii) k Nearest Neighbour (kNN). The experimental results show that a proper combination of classi cation algorithm and training features can achieve recommendation precision of more than 90%, rendering the proposed approach suitable for deployment on real world use cases. To this end, the proposed framework is implemented as a JOSM plugin4, allowing the real-time and e ective annotation of newly created spatial entities into OSM. To the best of our knowledge, this is the rst approach on recommending annotation categories for spatial entities in OSM. Another existing work on a similar problem [ 1 ] focuses on classifying road segments in OSM, thus it specializes only on geometrical and topological features of the speci c entities and reduces the space of recommendation categories from more than 1000 to only 21.

The rest of the paper is organized as follows. Section 2 describes our proposed method, including the de ned training features and the assessed algorithms. Section 3 presents the evaluation of training features and algorithms in terms of recommendation performance and Section 4 concludes.

2. RECOMMENDATION MODELS

Next, we describe the model training and recommendation process we follow. In general, our goal is, given a training set, that is a set of spatial entities that are already annotated with categories, to be able to produce category recommendations for each new, unannotated entity. Thus, we aim at (a) learning correlations between attributes of the spatial entities and the respective categories and (b) \matching", through these correlations, new entities with already annotated ones, in order to produce recommendations. The rst step to this end is to de ne proper, problem speci c training features, that describe each entity and capture its latent relation with the category that annotates it. In Section 2.1 we describe the de ned features that correspond to geometric and textual properties of the entities. The next step is to feed training entities, expressed through their features, into a classi cation algorithm that utilizes them to classify new entities to categories. We applied both model based (Support Vector Machines) and memory based (k Nearest Neighbour) state of art classi cation methods. Each of the algorithms is described in Section 2.2. 2.1

Training features

4http://wiki.openstreetmap.org/wiki/JOSM/Plugins/OSMRec In our scenario, the training items are the geospatial entities that have already been annotated with categories. However, since the same entity might be annotated with more than one category, we consider, for each entity, as many training items as its categories. The items are respesented in exactly the same way, w.r.t. the values of their training features, with the exception of the di erent label that corresponds to the di erent class. Each item is represented as a feature vector, with each feature corresponding to a property of the item. Next, we present the implemented features:

Geometry type. Six distinct geometry types are identi ed: Point, LineString, Polygon, LinearRing, Circle and Rectangle. This is a boolean feature, so six positions in the feature vector are assigned to it.

Number of geometry points. Depending on the algorithm this might be a double precision number, or a set of boolean positions in the feature vector that are used to represent it. For the latter case, each position represents a di erent range. In total, we de ne (based on a statistical analysis of the frequencies of entities having certain numbers of points) 13 ranges, thus mapping this integer characteristic into 13 boolean features: [ 1-10 ], (10-20], ..., (200-300], (300-500], (5001000], (1000-...). So, according to the number of points of an entity's geometry, the proper position is set to 1, while the rest positions are set to 0.

Area of geometry. Depending on the algorithm this might be a double precision number, or a set of boolean positions in the feature vector that are used to represent the various ranges of areas we consider. In the latter case, we de ne intuitively (and considering that in this problem setting we are mainly interested in entities that are buildings) 24 ranges with increasing length in square meters, until the area of 4000m2, where we consider the 25th range of (4000-...).

Mean edge length. Depending on the algorithm this might be a double precision number, or a set of boolean positions in the feature vector representing di erent ranges of the mean length of the geometry's edges. In our case, we de ne 23 ranges, starting from length less than 2meters and ending with length more than 200meters.

Variance of edge lengths. Depending on the algo

rithm this might be a double precision number, or a set of boolean positions in the feature vector representing di erent variations of a geometry's edges from the mean value. Likewise, we de ne 36 ranges, based on statistics on our training dataset.

Textual features. For each entity of the training set, we extract the textual description of its name. We consider each word separately and count their frequency within the dataset. Then, we sort the list of words in descending frequency and lter out words with frequency less than 20. Finally, we apply a stopword list and remove words without any particular meaning. What remains are special meaning identi ers, such as avenue, church, school, park, etc. Each of these special keywords is used as a separate boolean feature.

We should note here that we select di erent representations (double precision number or set of boolean features) in order to favour the functionality of the di erent models and similarity functions we apply. Namely, we consider boolean features in the case of SVM to facilitate the e ectiveness of the training model, since it is well known that such models perform better when the input feature vectors are vertically normalized within (or at least close) to the interval [ 0,1 ]. That is, de ning several area ranges corresponding to separate feature positions, we allow the model to correlate these di erent areas (feature positions) with di erent training classes. On the other hand, in the case of k-NN, where similarity measures such as the Euclidean distance or the Cosine similarity are applied, using the exact value of a feature (e.g. area of geometry) is preferable in order to better quantify the di erence between two entities.

2.2 Algorithms

SVM. The rst algorithm applies multiclass SVM classication considering as training items the geospatial entities themselves and as labels the categories that characterize them. The method maps the training entities into a multidimensional feature space and aims at nding the optimal hyperplanes that discriminated the entities belonging to different categories. The optimality of the hyperplane depends on the selected parameter C, which adjusts the trade-o between misclassi ed training entities and optimal discrimination of correctly classi ed entities. The output of the training process is a model (essentially a weight vector) that is able to map the feature vector of a new, unannotated entity to a set of categories, providing, also, a matching score for each category, This way, one can obtain the top-n most tting categories for a new spatial entity. kNN. The second algorithm is the well known approach of performing kNN on the initial set of training entities. That is, the algorithm compares the new entity with each one of the training entities and recommends the categories that characterize the most similar training entities. As similarity measures we consider cosine similarity and euclidean distance. The two similarity functions are initially applied to the feature vectors of the respective entities. However, we empirically observed that, in the speci c setting, boosting the vector-calculated similarity score with the area-based and point-based similarity scores between two entities improves the precision of the algorithms. Speci cally, we use the following formula in our experiments to calculate the similarity score S between two entities u; v: Scos = cosSim + 2 (1 areaDist) + 2 (1

pointDist) areaDist = pointDist =

areau areav max areau; areav pointsu pointsv max pointsu; pointsv (1) where cosSim is the cosine similarity on the whole feature vector of the two entities and is interchanged in our experiments with the similarity that is based on the euclidean distance euSim = 1 euDist= max euDist (2) .

3. EXPERIMENTAL EVALUATION

Next, we present the evaluation of the proposed methods, w.r.t. the recommendation precision they achieve. First, we describe the dataset and the evaluation methodology. Then, we compare the two algorithms and, nally, we discuss the contribution of individual groups of training features.

3.1 Dataset and Evaluation methodology

We performed our evaluation on a subset of OSM data covering Athens, Greece, which we exported through the Overpass API5 from OSM's site. The dataset contains overall 111; 491 geospatial entities which were properly divided into training, validation and test sets, as will be demonstrated next. The total number of distinct OSM categories/classes where 1; 421. The dataset is partitioned into ve subsets of similar sizes. Then, combining each time di erent subsets to make (a) the training, (b) the validation and (c) the test set, we create ve di erent arrangements for ve-fold crossvalidation. In every fold, the validation set was used to tune the parameters of the classi cation model and the test set was the one where the actual evaluation of the method was performed. Of course, the validation part was applied only with SVM, since kNN does not train any model to be tuned. As our evaluation measure, we consider the precision of category recommendations, i.e. the ratio of correct recommendations to the total recommendations:

P = #correct category recommendations #total category recommendations (3) We consider three variations of the measure, depending on how strictly we de ne the recommendation correctness: P 1: In this case, a recommendation is considered correct if the recommended category with the highest rank from the recommendation model is indeed a category that characterizes the test geospatial entity.

P 5: In this case, a recommendation is considered correct if one of the ve recommended categories with the highest rank from the recommendation model is indeed a category that characterizes the test geospatial entity.

P 10: Similarly, a recommendation is considered correct if one of the ten recommended categories with the highest rank from the recommendation model is indeed a category that characterizes the test geospatial entity.

3.2 Algorithm Comparison

We performed two rounds of experiments for the SVM method. The rst one regarded optimizing the classi cation model by training it with di erent parameterizations on C (trade-o parameter between the margin size and training error of the algorithm). After obtaining, through the 5-fold validation, the optimal parameters on the validation set, we run the actual evaluation of the recommendation model, using the three precision measure variations de ned above. Note that the plain k-NN algorithm is a purely memory based algorithms, so it requires no training. The results for the best performing con guration, for each algorithm are given in Table 1.

Features Geometry Type Points and Area Mean and Variance Simple Geometry Only Textual Simple Geom. & Text. All C

5 200000 10000 170000 100000 200000 650

Valid. Set

P 1

Test Set

P 5

It is obvious that SVM outperforms kNN by far. Regarding the strictest measure (P 1), the recommendation model achieves precision of 79%, which is a very good result, considering that it is measured on real-world OSM data without any restrictions that might favour the method. Further, when we consider the other two measures, the precision becomes even higher, reaching 88% and 92% for P 5 and P 10 respectively. Given that recommending 5 or 10 categories to the user is a realistic option, P 5 and P 10 are suitable for evaluating in a real-world application, and thus the e ectiveness of the system proves to be very high.

3.3 Training Features Analysis

The evaluation presented in Section 3.2 indicated that the SVM algorithm produces by far more precise recommendations in our setting. In this section, we analyze which training feature combinations achieve the highest precision results, when applied with SVM. To this end, we ran several experiments, following the same ve-fold, training-validationtesting process, selecting each time, di erent groups of traing features. Next we report on these results (Table 2). Geometry Type. This con guration consists in using vectors that contained only the features regarding the geometry type of the OSM entities. The geometry types are: polygon, linestring, linear ring, point, circle and rectangle Points and Area. This con guration consists in using vectors that contain boolean features that correspond to the number of points and the area of the entities.

Mean and Variance. This con guration consists in using only the mean and variance features.

Simple Geometry. This con guration consists in using vectors containing only the geometry information about the OSM entities, without any class, relation or textual features in the vectors. In this experiment we also excluded the mean and variance geometry features from the vector.

Only Textual. This con guration consists in using vectors with features regarding only textual information. The precision came out very low with this con guration because only a little fraction of the OSM entities contain information about their names.

Simple Geometry and Textual. This con guration con

sists in using the geometry features (without the features regarding mean and variance values), plus the textual information extracted from the names of the entities. This is the best achieving con guration in the whole set of experiments, on all evaluated methods.

All. This con guration consists in using all available features. As mentioned above the class and relation features exist only in the train sets. The results are slightly worst that the best performing con guration above.

In brief, the above analysis indicated that, in the speci c scenario, the most highly contributing training features are the ones considering simple geometric properties of the entities. The textual features, although they perform very poorly on their own (probably due to their sparseness) they can boost the performance of the simple geometric features.

4. CONCLUSIONS

In this paper, we presented a framework for producing category recommendations on spatial entities, based on previously annotated entities. We de ned a set of problem speci c training features that, combined with SVM classi cation, result to recommendations of high precision. Although we base our work on the OSM use case, the proposed framework is general enough to be used in any other scenario that involves spatial entities and an initial set of annotations on a few of them. Further, our method is already implemented as a plugin for JOSM OSM editing tool. Our next steps involve de ning further meaningful training features and performing more extended evaluation on the e ectiveness of the algorithms, e.g. by excluding very frequent categories from the experimental data.

5. ACKNOWLEDGEMENTS.

This research is conducted as part of the EU project GeoKnow6 FP7-ICT-2011-8, 318159.

[1] Jilani , M. and Corcoran , P. and Bertolotto , M. Automated Highway Tag Assessment of OpenStreetMap Road Networks . In Proceedings of SIGSPATIAL'14 , 2014 .