=Paper= {{Paper |id=Vol-1405/paper-02 |storemode=property |title=Automatic Recommendations of Categories for Geospatial Entities |pdfUrl=https://ceur-ws.org/Vol-1405/paper-02.pdf |volume=Vol-1405 |dblpUrl=https://dblp.org/rec/conf/recsys/GiannopoulosKSA15 }} ==Automatic Recommendations of Categories for Geospatial Entities== https://ceur-ws.org/Vol-1405/paper-02.pdf
  Automatic recommendations of categories for geospatial
                       entities

               Giorgos Giannopoulos                  Nikos Karagiannakis                   Dimitrios Skoutas
              IMIS Institute, R.C. ATHENA           IMIS Institute, R.C. ATHENA       IMIS Institute, R.C. ATHENA
                 giann@imis.athena-                 nkaragiannakis@imis.              dskoutas@imis.athena-
                    innovation.gr                    athena-innovation.gr                  innovation.gr
                                                      Spiros Athanasiou
                                                    IMIS Institute, R.C. ATHENA
                                                    spathan@imis.athena-
                                                        innovation.gr

ABSTRACT                                                            OpenStreetMap (OSM)1 is an initiative for crowdsourcing
Over the last years, thanks to Open Data initiative and             map information from users. It is based on a large and ac-
the Semantic Web, there has been a vast increase on user            tive community that contributes both data and tools that
contributed data. In several cases (e.g. OpenStreetMap,             facilitate the constant enrichment and enhancement of OSM
Geonames), the respective data include geospatial informa-          maps. One of the most prominent tools of OSM is JOSM2 ,
tion, that is the coordinates and/or the precise geometries         a graphical tool that allows users to (a) download all the
of buildings, roads, areas, etc. In such cases, proper schemas      spatial features (i.e. roads, buildings, stations, areas, etc.)
are defined to allow users to annotate the entities they con-       of a geographic area from OSM, (b) visualize these features
tribute. However, browsing through a large and unstruc-             on a map overlay, (c) add, change or delete spatial features
tured list of categories in order to select the most fitting one    and (d) reload the changed dataset into the publicly acces-
might be time consuming for the end users. In this paper, we        sible OSM map dataset. Specifically, through its interface,
present an approach for recommending categories for geospa-         the user can draw the geometry of a spatial feature. Then,
tial entities, based on previously annotated entities. Specifi-     she can annotate the feature with categories, in the form of
cally, we define and implement a series of training features in     key-value pairs. These spatial entities can also be annotated
order to represent the geospatial entities and capture their        by categories (classes, tags) that assign semantics to them.
relation with the categories they are annotated with. These         Each entity may belong to multiple classes; for example, a
features involve spatial and textual properties of the enti-        building can be further characterized as school or bar.
ties. We evaluate two different approaches (SVM and kNN)
on several combinations of the defined training features and        The system allows the selection of such categories from an
we demonstrate that the best algorithm (SVM) can provide            existing list3 or the definition of new categories by the user.
recommendations with high precision, utilizing the defined          At this point lies the problem we handle, which is expected
features. The aforementioned work is deployed in OSMRec,            in such crowdsourcing approaches: the user can define an ar-
a plugin for JOSM tool for editing OpenStreetMap.                   bitrary category which (a) might already exist in a slightly
                                                                    different form or (b) might have no actual semantic meaning
Categories and Subject Descriptors                                  in the context of OSM and, thus, degrade the quality of the
[Information systems]: [Recommender systems]                        inserted information. Besides, the user might be discour-
                                                                    aged by the large amount of available categories and quit
                                                                    the task of annotating the inserted spatial entity. Trying to
Keywords                                                            avoid (a) and (b) by restricting the category selection only to
Category recommendation, spatial entities, OSM                      categories that already exist in the OSM dataset has impor-
                                                                    tant disadvantages. First, it would cancel a large part of the
                                                                    crowdsourcing character of OSM and, second, it is not guar-
                                                                    anteed that the OSM categories, in their current form, cover
                                                                    every aspect of characterizing spatial features. Instead, the
                                                                    most suitable course of action is to guide the users during
1.     INTRODUCTION                                                 their annotation process, recommending to them already ex-
                                                                    isting categories to annotate the new spatial entities.

                                                                    In this work, we propose such a process, that trains recom-
                                                                    mendation models on existing, annotated geospatial datasets
                                                                    and is able to recommend categories for newly created enti-
                                                                    1
                                                                      https://www.openstreetmap.org/
                                                                    2
Copyright held by the author(s).
                                                                      https://josm.openstreetmap.de/
                                                                    3
LocalRec’15, September 19, 2015, Vienna, Austria.                     http://wiki.openstreetmap.org/wiki/Map features
ties. The main contribution of our work lies on the problem         In our scenario, the training items are the geospatial entities
specific training features we define in order to capture the        that have already been annotated with categories. However,
relations between the spatial and textual properties of each        since the same entity might be annotated with more than
spatial entity with the categories-classes that characterize it.    one category, we consider, for each entity, as many training
Essentially, this way, the proposed framework takes into ac-        items as its categories. The items are respesented in exactly
count the similarity of the new spatial entities to already ex-     the same way, w.r.t. the values of their training features,
isting (and annotated with categories) ones in several levels:      with the exception of the different label that corresponds
spatial similarity, e.g. the number of nodes of the feature’s       to the different class. Each item is represented as a feature
geometry and textual similarity, e.g. common important              vector, with each feature corresponding to a property of the
keywords in the names of the features.                              item. Next, we present the implemented features:

We perform an extensive evaluation that assesses the effec-
tiveness of several feature subsets deployed in the frame of           • Geometry type. Six distinct geometry types are
two classification algorithms: (i) Support Vector Machines               identified: Point, LineString, Polygon, LinearRing, Cir-
(SVM) and (ii) k Nearest Neighbour (kNN). The experimen-                 cle and Rectangle. This is a boolean feature, so six
tal results show that a proper combination of classification             positions in the feature vector are assigned to it.
algorithm and training features can achieve recommenda-
tion precision of more than 90%, rendering the proposed                • Number of geometry points. Depending on the
approach suitable for deployment on real world use cases.                algorithm this might be a double precision number, or
To this end, the proposed framework is implemented as a                  a set of boolean positions in the feature vector that
JOSM plugin4 , allowing the real-time and effective annota-              are used to represent it. For the latter case, each po-
tion of newly created spatial entities into OSM.                         sition represents a different range. In total, we define
                                                                         (based on a statistical analysis of the frequencies of en-
To the best of our knowledge, this is the first approach on              tities having certain numbers of points) 13 ranges, thus
recommending annotation categories for spatial entities in               mapping this integer characteristic into 13 boolean fea-
OSM. Another existing work on a similar problem [1] focuses              tures: [1-10], (10-20], ..., (200-300], (300-500], (500-
on classifying road segments in OSM, thus it specializes only            1000], (1000-...). So, according to the number of points
on geometrical and topological features of the specific en-              of an entity’s geometry, the proper position is set to 1,
tities and reduces the space of recommendation categories                while the rest positions are set to 0.
from more than 1000 to only 21.                                        • Area of geometry. Depending on the algorithm
                                                                         this might be a double precision number, or a set of
The rest of the paper is organized as follows. Section 2 de-             boolean positions in the feature vector that are used to
scribes our proposed method, including the defined training              represent the various ranges of areas we consider. In
features and the assessed algorithms. Section 3 presents the             the latter case, we define intuitively (and considering
evaluation of training features and algorithms in terms of               that in this problem setting we are mainly interested
recommendation performance and Section 4 concludes.                      in entities that are buildings) 24 ranges with increas-
                                                                         ing length in square meters, until the area of 4000m2,
2.     RECOMMENDATION MODELS                                             where we consider the 25th range of (4000-...).
Next, we describe the model training and recommendation
process we follow. In general, our goal is, given a training           • Mean edge length. Depending on the algorithm this
set, that is a set of spatial entities that are already annotated        might be a double precision number, or a set of boolean
with categories, to be able to produce category recommen-                positions in the feature vector representing different
dations for each new, unannotated entity. Thus, we aim at                ranges of the mean length of the geometry’s edges.
(a) learning correlations between attributes of the spatial              In our case, we define 23 ranges, starting from length
entities and the respective categories and (b) “matching”,               less than 2meters and ending with length more than
through these correlations, new entities with already anno-              200meters.
tated ones, in order to produce recommendations.                       • Variance of edge lengths. Depending on the algo-
                                                                         rithm this might be a double precision number, or a
The first step to this end is to define proper, problem spe-             set of boolean positions in the feature vector represent-
cific training features, that describe each entity and cap-              ing different variations of a geometry’s edges from the
ture its latent relation with the category that annotates it.            mean value. Likewise, we define 36 ranges, based on
In Section 2.1 we describe the defined features that corre-              statistics on our training dataset.
spond to geometric and textual properties of the entities.
The next step is to feed training entities, expressed through          • Textual features. For each entity of the training set,
their features, into a classification algorithm that utilizes            we extract the textual description of its name. We con-
them to classify new entities to categories. We applied both             sider each word separately and count their frequency
model based (Support Vector Machines) and memory based                   within the dataset. Then, we sort the list of words
(k Nearest Neighbour) state of art classification methods.               in descending frequency and filter out words with fre-
Each of the algorithms is described in Section 2.2.                      quency less than 20. Finally, we apply a stopword
                                                                         list and remove words without any particular meaning.
2.1     Training features                                                What remains are special meaning identifiers, such as
                                                                         avenue, church, school, park, etc. Each of these special
4
    http://wiki.openstreetmap.org/wiki/JOSM/Plugins/OSMRec               keywords is used as a separate boolean feature.
We should note here that we select different representations      3.     EXPERIMENTAL EVALUATION
(double precision number or set of boolean features) in or-       Next, we present the evaluation of the proposed methods,
der to favour the functionality of the different models and       w.r.t. the recommendation precision they achieve. First, we
similarity functions we apply. Namely, we consider boolean        describe the dataset and the evaluation methodology. Then,
features in the case of SVM to facilitate the effectiveness of    we compare the two algorithms and, finally, we discuss the
the training model, since it is well known that such mod-         contribution of individual groups of training features.
els perform better when the input feature vectors are ver-
tically normalized within (or at least close) to the interval
[0,1]. That is, defining several area ranges corresponding to
                                                                  3.1      Dataset and Evaluation methodology
                                                                  We performed our evaluation on a subset of OSM data cov-
separate feature positions, we allow the model to correlate
                                                                  ering Athens, Greece, which we exported through the Over-
these different areas (feature positions) with different train-
                                                                  pass API5 from OSM’s site. The dataset contains overall
ing classes. On the other hand, in the case of k-NN, where
                                                                  111, 491 geospatial entities which were properly divided into
similarity measures such as the Euclidean distance or the
                                                                  training, validation and test sets, as will be demonstrated
Cosine similarity are applied, using the exact value of a fea-
                                                                  next. The total number of distinct OSM categories/classes
ture (e.g. area of geometry) is preferable in order to better
                                                                  where 1, 421. The dataset is partitioned into five subsets of
quantify the difference between two entities.
                                                                  similar sizes. Then, combining each time different subsets
                                                                  to make (a) the training, (b) the validation and (c) the test
2.2      Algorithms                                               set, we create five different arrangements for five-fold cross-
SVM. The first algorithm applies multiclass SVM classi-           validation. In every fold, the validation set was used to tune
fication considering as training items the geospatial enti-       the parameters of the classification model and the test set
ties themselves and as labels the categories that characterize    was the one where the actual evaluation of the method was
them. The method maps the training entities into a multi-         performed. Of course, the validation part was applied only
dimensional feature space and aims at finding the optimal         with SVM, since kNN does not train any model to be tuned.
hyperplanes that discriminated the entities belonging to dif-
ferent categories. The optimality of the hyperplane depends       As our evaluation measure, we consider the precision of cat-
on the selected parameter C, which adjusts the trade-off          egory recommendations, i.e. the ratio of correct recommen-
between misclassified training entities and optimal discrim-      dations to the total recommendations:
ination of correctly classified entities. The output of the                       #correct category recommendations
training process is a model (essentially a weight vector) that              P =                                              (3)
                                                                                   #total category recommendations
is able to map the feature vector of a new, unannotated en-
tity to a set of categories, providing, also, a matching score    We consider three variations of the measure, depending on
for each category, This way, one can obtain the top-n most        how strictly we define the recommendation correctness:
fitting categories for a new spatial entity.
                                                                  P 1 : In this case, a recommendation is considered correct if
kNN. The second algorithm is the well known approach of           the recommended category with the highest rank from the
performing kNN on the initial set of training entities. That      recommendation model is indeed a category that character-
is, the algorithm compares the new entity with each one           izes the test geospatial entity.
of the training entities and recommends the categories that
characterize the most similar training entities. As similarity    P 5 : In this case, a recommendation is considered correct
measures we consider cosine similarity and euclidean dis-         if one of the five recommended categories with the highest
tance. The two similarity functions are initially applied to      rank from the recommendation model is indeed a category
the feature vectors of the respective entities. However, we       that characterizes the test geospatial entity.
empirically observed that, in the specific setting, boosting
the vector-calculated similarity score with the area-based        P 10 : Similarly, a recommendation is considered correct if
and point-based similarity scores between two entities im-        one of the ten recommended categories with the highest rank
proves the precision of the algorithms. Specifically, we use      from the recommendation model is indeed a category that
the following formula in our experiments to calculate the         characterizes the test geospatial entity.
similarity score S between two entities u, v:
    Scos = cosSim + 2 ∗ (1 − areaDist) + 2 ∗ (1 − pointDist)      3.2      Algorithm Comparison
                                                                  We performed two rounds of experiments for the SVM method.
                                            areau − areav
                               areaDist =                         The first one regarded optimizing the classification model by
                                           max areau , areav      training it with different parameterizations on C (trade-off
                                        pointsu − pointsv         parameter between the margin size and training error of the
                           pointDist =
                                       max pointsu , pointsv      algorithm). After obtaining, through the 5-fold validation,
                                                           (1)    the optimal parameters on the validation set, we run the
                                                                  actual evaluation of the recommendation model, using the
where cosSim is the cosine similarity on the whole feature
                                                                  three precision measure variations defined above. Note that
vector of the two entities and is interchanged in our exper-
                                                                  the plain k-NN algorithm is a purely memory based algo-
iments with the similarity that is based on the euclidean
                                                                  rithms, so it requires no training. The results for the best
distance
                                                                  performing configuration, for each algorithm are given in
               euSim = 1 − euDist/ max euDist              (2)    Table 1.
                                                                  5
.                                                                     http://overpass-api.de/
                                  Features                  C      Valid. Set             Test Set
                                                                       P1           P1       P5     P 10
                            Geometry Type                  5         66.95        67.86    69.45   71.03
                            Points and Area             200000       64.44        64.66    69.50   74.60
                           Mean and Variance             10000       65.087       65.17    78.69   85.63
                           Simple Geometry              170000       63.05        61.43    67.19   71.39
                             Only Textual               100000        6.71         6.01     7.83   8.261
                         Simple Geom. & Text.           200000       79.54        79.04    87.90 92.07
                                  All                     650        63.56        63.92    83.11   89.82

                                             Table 2: SVM feature combinations.

      Algorithms     Valid. Set             Test Set                precision came out very low with this configuration because
                         P1          P1       P5     P 10           only a little fraction of the OSM entities contain information
         SVM           79.54        79.04    87.90 92.07            about their names.
         kNN              -         65.53    73.03 78.05
                                                                    Simple Geometry and Textual. This configuration con-
         Table 1: Recommendation precision.                         sists in using the geometry features (without the features
                                                                    regarding mean and variance values), plus the textual infor-
                                                                    mation extracted from the names of the entities. This is the
It is obvious that SVM outperforms kNN by far. Regard-              best achieving configuration in the whole set of experiments,
ing the strictest measure (P 1 ), the recommendation model          on all evaluated methods.
achieves precision of 79%, which is a very good result, con-
sidering that it is measured on real-world OSM data with-           All. This configuration consists in using all available fea-
out any restrictions that might favour the method. Further,         tures. As mentioned above the class and relation features
when we consider the other two measures, the precision be-          exist only in the train sets. The results are slightly worst
comes even higher, reaching 88% and 92% for P 5 and P 1 0           that the best performing configuration above.
respectively. Given that recommending 5 or 10 categories
to the user is a realistic option, P 5 and P 10 are suitable for    In brief, the above analysis indicated that, in the specific sce-
evaluating in a real-world application, and thus the effec-         nario, the most highly contributing training features are the
tiveness of the system proves to be very high.                      ones considering simple geometric properties of the entities.
                                                                    The textual features, although they perform very poorly on
                                                                    their own (probably due to their sparseness) they can boost
3.3    Training Features Analysis                                   the performance of the simple geometric features.
The evaluation presented in Section 3.2 indicated that the
SVM algorithm produces by far more precise recommen-
dations in our setting. In this section, we analyze which           4.     CONCLUSIONS
training feature combinations achieve the highest precision         In this paper, we presented a framework for producing cat-
results, when applied with SVM. To this end, we ran several         egory recommendations on spatial entities, based on previ-
experiments, following the same five-fold, training-validation-     ously annotated entities. We defined a set of problem spe-
testing process, selecting each time, different groups of traing    cific training features that, combined with SVM classifica-
features. Next we report on these results (Table 2).                tion, result to recommendations of high precision. Although
                                                                    we base our work on the OSM use case, the proposed frame-
Geometry Type. This configuration consists in using vec-            work is general enough to be used in any other scenario that
tors that contained only the features regarding the geometry        involves spatial entities and an initial set of annotations on
type of the OSM entities. The geometry types are: polygon,          a few of them. Further, our method is already implemented
linestring, linear ring, point, circle and rectangle                as a plugin for JOSM OSM editing tool. Our next steps in-
                                                                    volve defining further meaningful training features and per-
Points and Area. This configuration consists in using vec-          forming more extended evaluation on the effectiveness of the
tors that contain boolean features that correspond to the           algorithms, e.g. by excluding very frequent categories from
number of points and the area of the entities.                      the experimental data.

Mean and Variance. This configuration consists in using             5.     ACKNOWLEDGEMENTS.
only the mean and variance features.                                This research is conducted as part of the EU project Geo-
                                                                    Know6 FP7-ICT-2011-8, 318159.
Simple Geometry. This configuration consists in using
vectors containing only the geometry information about the          6.     REFERENCES
OSM entities, without any class, relation or textual features           [1] Jilani, M. and Corcoran, P. and Bertolotto, M.
in the vectors. In this experiment we also excluded the mean                Automated Highway Tag Assessment of
and variance geometry features from the vector.                             OpenStreetMap Road Networks. In Proceedings of
                                                                            SIGSPATIAL’14, 2014.
Only Textual. This configuration consists in using vec-
                                                                    6
tors with features regarding only textual information. The              http://geoknow.eu/