=Paper=
{{Paper
|id=Vol-1405/paper-02
|storemode=property
|title=Automatic Recommendations of Categories for Geospatial Entities
|pdfUrl=https://ceur-ws.org/Vol-1405/paper-02.pdf
|volume=Vol-1405
|dblpUrl=https://dblp.org/rec/conf/recsys/GiannopoulosKSA15
}}
==Automatic Recommendations of Categories for Geospatial Entities==
Automatic recommendations of categories for geospatial entities Giorgos Giannopoulos Nikos Karagiannakis Dimitrios Skoutas IMIS Institute, R.C. ATHENA IMIS Institute, R.C. ATHENA IMIS Institute, R.C. ATHENA giann@imis.athena- nkaragiannakis@imis. dskoutas@imis.athena- innovation.gr athena-innovation.gr innovation.gr Spiros Athanasiou IMIS Institute, R.C. ATHENA spathan@imis.athena- innovation.gr ABSTRACT OpenStreetMap (OSM)1 is an initiative for crowdsourcing Over the last years, thanks to Open Data initiative and map information from users. It is based on a large and ac- the Semantic Web, there has been a vast increase on user tive community that contributes both data and tools that contributed data. In several cases (e.g. OpenStreetMap, facilitate the constant enrichment and enhancement of OSM Geonames), the respective data include geospatial informa- maps. One of the most prominent tools of OSM is JOSM2 , tion, that is the coordinates and/or the precise geometries a graphical tool that allows users to (a) download all the of buildings, roads, areas, etc. In such cases, proper schemas spatial features (i.e. roads, buildings, stations, areas, etc.) are defined to allow users to annotate the entities they con- of a geographic area from OSM, (b) visualize these features tribute. However, browsing through a large and unstruc- on a map overlay, (c) add, change or delete spatial features tured list of categories in order to select the most fitting one and (d) reload the changed dataset into the publicly acces- might be time consuming for the end users. In this paper, we sible OSM map dataset. Specifically, through its interface, present an approach for recommending categories for geospa- the user can draw the geometry of a spatial feature. Then, tial entities, based on previously annotated entities. Specifi- she can annotate the feature with categories, in the form of cally, we define and implement a series of training features in key-value pairs. These spatial entities can also be annotated order to represent the geospatial entities and capture their by categories (classes, tags) that assign semantics to them. relation with the categories they are annotated with. These Each entity may belong to multiple classes; for example, a features involve spatial and textual properties of the enti- building can be further characterized as school or bar. ties. We evaluate two different approaches (SVM and kNN) on several combinations of the defined training features and The system allows the selection of such categories from an we demonstrate that the best algorithm (SVM) can provide existing list3 or the definition of new categories by the user. recommendations with high precision, utilizing the defined At this point lies the problem we handle, which is expected features. The aforementioned work is deployed in OSMRec, in such crowdsourcing approaches: the user can define an ar- a plugin for JOSM tool for editing OpenStreetMap. bitrary category which (a) might already exist in a slightly different form or (b) might have no actual semantic meaning Categories and Subject Descriptors in the context of OSM and, thus, degrade the quality of the [Information systems]: [Recommender systems] inserted information. Besides, the user might be discour- aged by the large amount of available categories and quit the task of annotating the inserted spatial entity. Trying to Keywords avoid (a) and (b) by restricting the category selection only to Category recommendation, spatial entities, OSM categories that already exist in the OSM dataset has impor- tant disadvantages. First, it would cancel a large part of the crowdsourcing character of OSM and, second, it is not guar- anteed that the OSM categories, in their current form, cover every aspect of characterizing spatial features. Instead, the most suitable course of action is to guide the users during 1. INTRODUCTION their annotation process, recommending to them already ex- isting categories to annotate the new spatial entities. In this work, we propose such a process, that trains recom- mendation models on existing, annotated geospatial datasets and is able to recommend categories for newly created enti- 1 https://www.openstreetmap.org/ 2 Copyright held by the author(s). https://josm.openstreetmap.de/ 3 LocalRec’15, September 19, 2015, Vienna, Austria. http://wiki.openstreetmap.org/wiki/Map features ties. The main contribution of our work lies on the problem In our scenario, the training items are the geospatial entities specific training features we define in order to capture the that have already been annotated with categories. However, relations between the spatial and textual properties of each since the same entity might be annotated with more than spatial entity with the categories-classes that characterize it. one category, we consider, for each entity, as many training Essentially, this way, the proposed framework takes into ac- items as its categories. The items are respesented in exactly count the similarity of the new spatial entities to already ex- the same way, w.r.t. the values of their training features, isting (and annotated with categories) ones in several levels: with the exception of the different label that corresponds spatial similarity, e.g. the number of nodes of the feature’s to the different class. Each item is represented as a feature geometry and textual similarity, e.g. common important vector, with each feature corresponding to a property of the keywords in the names of the features. item. Next, we present the implemented features: We perform an extensive evaluation that assesses the effec- tiveness of several feature subsets deployed in the frame of • Geometry type. Six distinct geometry types are two classification algorithms: (i) Support Vector Machines identified: Point, LineString, Polygon, LinearRing, Cir- (SVM) and (ii) k Nearest Neighbour (kNN). The experimen- cle and Rectangle. This is a boolean feature, so six tal results show that a proper combination of classification positions in the feature vector are assigned to it. algorithm and training features can achieve recommenda- tion precision of more than 90%, rendering the proposed • Number of geometry points. Depending on the approach suitable for deployment on real world use cases. algorithm this might be a double precision number, or To this end, the proposed framework is implemented as a a set of boolean positions in the feature vector that JOSM plugin4 , allowing the real-time and effective annota- are used to represent it. For the latter case, each po- tion of newly created spatial entities into OSM. sition represents a different range. In total, we define (based on a statistical analysis of the frequencies of en- To the best of our knowledge, this is the first approach on tities having certain numbers of points) 13 ranges, thus recommending annotation categories for spatial entities in mapping this integer characteristic into 13 boolean fea- OSM. Another existing work on a similar problem [1] focuses tures: [1-10], (10-20], ..., (200-300], (300-500], (500- on classifying road segments in OSM, thus it specializes only 1000], (1000-...). So, according to the number of points on geometrical and topological features of the specific en- of an entity’s geometry, the proper position is set to 1, tities and reduces the space of recommendation categories while the rest positions are set to 0. from more than 1000 to only 21. • Area of geometry. Depending on the algorithm this might be a double precision number, or a set of The rest of the paper is organized as follows. Section 2 de- boolean positions in the feature vector that are used to scribes our proposed method, including the defined training represent the various ranges of areas we consider. In features and the assessed algorithms. Section 3 presents the the latter case, we define intuitively (and considering evaluation of training features and algorithms in terms of that in this problem setting we are mainly interested recommendation performance and Section 4 concludes. in entities that are buildings) 24 ranges with increas- ing length in square meters, until the area of 4000m2, 2. RECOMMENDATION MODELS where we consider the 25th range of (4000-...). Next, we describe the model training and recommendation process we follow. In general, our goal is, given a training • Mean edge length. Depending on the algorithm this set, that is a set of spatial entities that are already annotated might be a double precision number, or a set of boolean with categories, to be able to produce category recommen- positions in the feature vector representing different dations for each new, unannotated entity. Thus, we aim at ranges of the mean length of the geometry’s edges. (a) learning correlations between attributes of the spatial In our case, we define 23 ranges, starting from length entities and the respective categories and (b) “matching”, less than 2meters and ending with length more than through these correlations, new entities with already anno- 200meters. tated ones, in order to produce recommendations. • Variance of edge lengths. Depending on the algo- rithm this might be a double precision number, or a The first step to this end is to define proper, problem spe- set of boolean positions in the feature vector represent- cific training features, that describe each entity and cap- ing different variations of a geometry’s edges from the ture its latent relation with the category that annotates it. mean value. Likewise, we define 36 ranges, based on In Section 2.1 we describe the defined features that corre- statistics on our training dataset. spond to geometric and textual properties of the entities. The next step is to feed training entities, expressed through • Textual features. For each entity of the training set, their features, into a classification algorithm that utilizes we extract the textual description of its name. We con- them to classify new entities to categories. We applied both sider each word separately and count their frequency model based (Support Vector Machines) and memory based within the dataset. Then, we sort the list of words (k Nearest Neighbour) state of art classification methods. in descending frequency and filter out words with fre- Each of the algorithms is described in Section 2.2. quency less than 20. Finally, we apply a stopword list and remove words without any particular meaning. 2.1 Training features What remains are special meaning identifiers, such as avenue, church, school, park, etc. Each of these special 4 http://wiki.openstreetmap.org/wiki/JOSM/Plugins/OSMRec keywords is used as a separate boolean feature. We should note here that we select different representations 3. EXPERIMENTAL EVALUATION (double precision number or set of boolean features) in or- Next, we present the evaluation of the proposed methods, der to favour the functionality of the different models and w.r.t. the recommendation precision they achieve. First, we similarity functions we apply. Namely, we consider boolean describe the dataset and the evaluation methodology. Then, features in the case of SVM to facilitate the effectiveness of we compare the two algorithms and, finally, we discuss the the training model, since it is well known that such mod- contribution of individual groups of training features. els perform better when the input feature vectors are ver- tically normalized within (or at least close) to the interval [0,1]. That is, defining several area ranges corresponding to 3.1 Dataset and Evaluation methodology We performed our evaluation on a subset of OSM data cov- separate feature positions, we allow the model to correlate ering Athens, Greece, which we exported through the Over- these different areas (feature positions) with different train- pass API5 from OSM’s site. The dataset contains overall ing classes. On the other hand, in the case of k-NN, where 111, 491 geospatial entities which were properly divided into similarity measures such as the Euclidean distance or the training, validation and test sets, as will be demonstrated Cosine similarity are applied, using the exact value of a fea- next. The total number of distinct OSM categories/classes ture (e.g. area of geometry) is preferable in order to better where 1, 421. The dataset is partitioned into five subsets of quantify the difference between two entities. similar sizes. Then, combining each time different subsets to make (a) the training, (b) the validation and (c) the test 2.2 Algorithms set, we create five different arrangements for five-fold cross- SVM. The first algorithm applies multiclass SVM classi- validation. In every fold, the validation set was used to tune fication considering as training items the geospatial enti- the parameters of the classification model and the test set ties themselves and as labels the categories that characterize was the one where the actual evaluation of the method was them. The method maps the training entities into a multi- performed. Of course, the validation part was applied only dimensional feature space and aims at finding the optimal with SVM, since kNN does not train any model to be tuned. hyperplanes that discriminated the entities belonging to dif- ferent categories. The optimality of the hyperplane depends As our evaluation measure, we consider the precision of cat- on the selected parameter C, which adjusts the trade-off egory recommendations, i.e. the ratio of correct recommen- between misclassified training entities and optimal discrim- dations to the total recommendations: ination of correctly classified entities. The output of the #correct category recommendations training process is a model (essentially a weight vector) that P = (3) #total category recommendations is able to map the feature vector of a new, unannotated en- tity to a set of categories, providing, also, a matching score We consider three variations of the measure, depending on for each category, This way, one can obtain the top-n most how strictly we define the recommendation correctness: fitting categories for a new spatial entity. P 1 : In this case, a recommendation is considered correct if kNN. The second algorithm is the well known approach of the recommended category with the highest rank from the performing kNN on the initial set of training entities. That recommendation model is indeed a category that character- is, the algorithm compares the new entity with each one izes the test geospatial entity. of the training entities and recommends the categories that characterize the most similar training entities. As similarity P 5 : In this case, a recommendation is considered correct measures we consider cosine similarity and euclidean dis- if one of the five recommended categories with the highest tance. The two similarity functions are initially applied to rank from the recommendation model is indeed a category the feature vectors of the respective entities. However, we that characterizes the test geospatial entity. empirically observed that, in the specific setting, boosting the vector-calculated similarity score with the area-based P 10 : Similarly, a recommendation is considered correct if and point-based similarity scores between two entities im- one of the ten recommended categories with the highest rank proves the precision of the algorithms. Specifically, we use from the recommendation model is indeed a category that the following formula in our experiments to calculate the characterizes the test geospatial entity. similarity score S between two entities u, v: Scos = cosSim + 2 ∗ (1 − areaDist) + 2 ∗ (1 − pointDist) 3.2 Algorithm Comparison We performed two rounds of experiments for the SVM method. areau − areav areaDist = The first one regarded optimizing the classification model by max areau , areav training it with different parameterizations on C (trade-off pointsu − pointsv parameter between the margin size and training error of the pointDist = max pointsu , pointsv algorithm). After obtaining, through the 5-fold validation, (1) the optimal parameters on the validation set, we run the actual evaluation of the recommendation model, using the where cosSim is the cosine similarity on the whole feature three precision measure variations defined above. Note that vector of the two entities and is interchanged in our exper- the plain k-NN algorithm is a purely memory based algo- iments with the similarity that is based on the euclidean rithms, so it requires no training. The results for the best distance performing configuration, for each algorithm are given in euSim = 1 − euDist/ max euDist (2) Table 1. 5 . http://overpass-api.de/ Features C Valid. Set Test Set P1 P1 P5 P 10 Geometry Type 5 66.95 67.86 69.45 71.03 Points and Area 200000 64.44 64.66 69.50 74.60 Mean and Variance 10000 65.087 65.17 78.69 85.63 Simple Geometry 170000 63.05 61.43 67.19 71.39 Only Textual 100000 6.71 6.01 7.83 8.261 Simple Geom. & Text. 200000 79.54 79.04 87.90 92.07 All 650 63.56 63.92 83.11 89.82 Table 2: SVM feature combinations. Algorithms Valid. Set Test Set precision came out very low with this configuration because P1 P1 P5 P 10 only a little fraction of the OSM entities contain information SVM 79.54 79.04 87.90 92.07 about their names. kNN - 65.53 73.03 78.05 Simple Geometry and Textual. This configuration con- Table 1: Recommendation precision. sists in using the geometry features (without the features regarding mean and variance values), plus the textual infor- mation extracted from the names of the entities. This is the It is obvious that SVM outperforms kNN by far. Regard- best achieving configuration in the whole set of experiments, ing the strictest measure (P 1 ), the recommendation model on all evaluated methods. achieves precision of 79%, which is a very good result, con- sidering that it is measured on real-world OSM data with- All. This configuration consists in using all available fea- out any restrictions that might favour the method. Further, tures. As mentioned above the class and relation features when we consider the other two measures, the precision be- exist only in the train sets. The results are slightly worst comes even higher, reaching 88% and 92% for P 5 and P 1 0 that the best performing configuration above. respectively. Given that recommending 5 or 10 categories to the user is a realistic option, P 5 and P 10 are suitable for In brief, the above analysis indicated that, in the specific sce- evaluating in a real-world application, and thus the effec- nario, the most highly contributing training features are the tiveness of the system proves to be very high. ones considering simple geometric properties of the entities. The textual features, although they perform very poorly on their own (probably due to their sparseness) they can boost 3.3 Training Features Analysis the performance of the simple geometric features. The evaluation presented in Section 3.2 indicated that the SVM algorithm produces by far more precise recommen- dations in our setting. In this section, we analyze which 4. CONCLUSIONS training feature combinations achieve the highest precision In this paper, we presented a framework for producing cat- results, when applied with SVM. To this end, we ran several egory recommendations on spatial entities, based on previ- experiments, following the same five-fold, training-validation- ously annotated entities. We defined a set of problem spe- testing process, selecting each time, different groups of traing cific training features that, combined with SVM classifica- features. Next we report on these results (Table 2). tion, result to recommendations of high precision. Although we base our work on the OSM use case, the proposed frame- Geometry Type. This configuration consists in using vec- work is general enough to be used in any other scenario that tors that contained only the features regarding the geometry involves spatial entities and an initial set of annotations on type of the OSM entities. The geometry types are: polygon, a few of them. Further, our method is already implemented linestring, linear ring, point, circle and rectangle as a plugin for JOSM OSM editing tool. Our next steps in- volve defining further meaningful training features and per- Points and Area. This configuration consists in using vec- forming more extended evaluation on the effectiveness of the tors that contain boolean features that correspond to the algorithms, e.g. by excluding very frequent categories from number of points and the area of the entities. the experimental data. Mean and Variance. This configuration consists in using 5. ACKNOWLEDGEMENTS. only the mean and variance features. This research is conducted as part of the EU project Geo- Know6 FP7-ICT-2011-8, 318159. Simple Geometry. This configuration consists in using vectors containing only the geometry information about the 6. REFERENCES OSM entities, without any class, relation or textual features [1] Jilani, M. and Corcoran, P. and Bertolotto, M. in the vectors. In this experiment we also excluded the mean Automated Highway Tag Assessment of and variance geometry features from the vector. OpenStreetMap Road Networks. In Proceedings of SIGSPATIAL’14, 2014. Only Textual. This configuration consists in using vec- 6 tors with features regarding only textual information. The http://geoknow.eu/