=Paper= {{Paper |id=Vol-1392/paper-04 |storemode=property |title=Automatic Extrapolation of Missing Road Network Data in OpenStreetMap |pdfUrl=https://ceur-ws.org/Vol-1392/paper-04.pdf |volume=Vol-1392 |dblpUrl=https://dblp.org/rec/conf/icml/FunkeSS15 }} ==Automatic Extrapolation of Missing Road Network Data in OpenStreetMap== https://ceur-ws.org/Vol-1392/paper-04.pdf
   Automatic Extrapolation of Missing Road Network Data in OpenStreetMap


Stefan Funke                                                                               FUNKE @ FMI . UNI - STUTTGART. DE
University of Stuttgart, 70569 Stuttgart, Germany
Robin Schirrmeister                                                           SCHIRRMR @ INFORMATIK . UNI - FREIBURG . DE
University of Freiburg, 79110 Freiburg, Germany
Sabine Storandt                                                               STORANDT @ INFORMATIK . UNI - FREIBURG . DE
University of Freiburg, 79110 Freiburg, Germany



                          Abstract                                   gation (Holone et al., 2007; Vetter, 2010), location-based-
     Road network data from OpenStreetMap (OSM)                      services (Mooney & Corcoran, 2012), disaster warning
     is the basis of various real-world applications                 (Rahman et al., 2012), fleet management (Efentakis et al.,
     such as fleet management or traffic flow estima-                2014), traffic estimation (Tao et al., 2012) and many other
     tion, and has become a standard dataset for re-                 related topics1 . Completeness of the road network data
     search on route planning and related subjects.                  is mandatory for these applications to guarantee usability
     The quality of such applications and conclusive-                in practice. Road networks extracted from OSM (mainly
     ness of research crucially relies on correctness                Japan, Germany, North and South America and Australia)
     and completeness of the underlying road network                 have also become standard benchmarks in route planning
     data. We introduce methods for automatic de-                    papers, see (Delling & Werneck, 2013; Baum et al., 2013;
     tection of gaps in the road network and extrapo-                Funke et al., 2014). For research on route planning the
     lation of missing street names by learning topo-                completeness of the data plays an important role, as the
     logical and semantic characteristics of road net-               developed algorithms are designed to take typical connec-
     works. Our experiments show that with the help                  tivity characteristics of road networks into account.
     of the learned data, the quality of the OSM road                But while the OSM data is of very high quality already,
     network data can indeed be improved.                            there is still structural information missing, as e.g. road
                                                                     or path sections (see Figure 1). Moreover street names
                                                                     are far from being complete. The correct street name is
1. Introduction                                                      necessary for specifying start and destination in a route
                                                                     planning query, for location-based services (’all shops on
OpenStreetMap (OSM) is a huge collection of crowd-
                                                                     Norris Street’) and for answering complex route planning
sourced spatial information. The goal of the OSM project
                                                                     queries like ’from A to B avoiding Park Street’ accurately.
is to map the whole world with all its road networks, build-
ings, regions and other kinds of natural and man-made en-            In this paper we design a classifier based on learning topo-
tities. The OSM data set size increases significantly ev-            logical and semantic characteristics of road networks which
ery year as more and more parts of the world are covered,            can then be used to identify pairs of candidate locations
and information becomes more detailed. For example, the              where road segments are likely to be missing inbetween.
world-wide road network in OSM contained at the begin-               We refer to missing structural data as holes in the following.
ning of 2007 less than 30 million data points whereas in             Furthermore we show how to instrument machine learning
2013 this number has grown to more than two billions.                techniques to identify road segments where the name tag
Nowadays, the quality of OSM data often even exceeds the             can be extrapolated with high confidence. Our experimen-
quality of proprietary data.                                         tal results prove the ability of our methods to enhance the
                                                                     quality of OSM road network data considerably.
OSM is the basis for numerous applications and research
projects, concerned e.g. with pedestrian and vehicle navi-               1
                                                                        http://wiki.openstreetmap.org/wiki/List_
                                                                     of_OSM-based_services
Proceedings of the 2 nd International Workshop on Mining Urban
Data, Lille, France, 2015. Copyright c 2015 for this paper by its
authors. Copying permitted for private and academic purposes.




                                                                    kd
                                     Automatic Extrapolation of Missing Road Network Data




Figure 1. OSM based map (left) and GoogleMaps on a small cut-out of Poland. In the OSM map, the roundabout in the lower left corner
is much more detailed and more building footprints and house numbers are available. For the coloured points in the left image, though,
there are road segments missing as observable on the right.


2. Related Work                                                      tomatically (not necessarily using machine learning tech-
                                                                     niques) include the deduction of turn restrictions from GPS
Studies on completeness and correctness of the OSM data              tracks (Efentakis et al., 2014), the detection of vandalism
are numerous, see e.g. (Girres & Touya, 2010) or (Hak-               using a rule-based approach (Neis et al., 2012), or the iden-
lay, 2010). The problem is that there either needs to be a           tification of basic spatial units (so called parcels) for fine-
ground truth one can compare to in an automated way (as              scale urban modeling (Long & Liu, 2013).
investigated in (Fan et al., 2014) for building footprints in
Munich), or data has to be manually compared to propri-              To the best of our knowledge, tools for detecting missing
etary data. In both cases, the ground truth sample sizes are         parts of the OSM road network automatically were not in-
typically limited. Machine learning was applied to OSM               vestigated before, the quality assurance tools in the OSM
data in order to automatically assess the quality of the road        project2 mostly focus on detecting syntactic errors in the
network data (Jilani et al., 2013b). Here, characteristics           map specification. Hole detection in other kind of net-
of certain street types (as e.g. motorways) are learned,             works, as e.g. sensor networks, is an important and well
including features such as total street length, number of            established problem, though. Here also topological char-
dead-ends, number of intersection points and connectivity            acteristics of the networks were taken into account (Funke,
to other street types. In the quality analysis, feature vectors      2005). But as such networks differ significantly from road
of streets are compared to the learned feature vector for the        networks in many aspects results are hardly transferable.
respective street type. If they do not resemble each other it
is assumed that the data quality of the considered street is         3. Basics
poor. The authors state that these learned features are also
useful to classify streets with unknown type.                        OSM data comes in form of nodes, ways and relations.
                                                                     Nodes are single locations with latitude, longitude and ad-
Preliminary results on automated quality improvement of              ditional tags (like the name of the location). Ways are or-
OSM data were also reported in (Jilani et al., 2013a). Here,         dered sets of nodes, describing e.g. a road or a building
Artificial Neural Networks (ANN) are applied to distin-              footprint. Relations are compositions of multiple nodes or
guish residential and pedestrian streets; features include           ways, e.g. to aggregate all buildings and roads within an
node count within a bounding box and betweenness cen-                industrial area. Ways and relations are typically also aug-
trality. In (Jilani et al., 2014), the automated street type         mented with tags that provide various information (e.g. the
classification for OSM was considered in more detail. In             street or region name).
addition to the above mentioned features, the shape of the
street is considered. About 20 different OSM street types            To be able to identify meaningful features for hole classifi-
were used in the experimental evaluation. The classifica-            cation and street name extrapolation later on, we extracted
tion accuracy varies widely but for some types even an ac-           all nodes and ways that describe roads in OSM and mod-
curacy of 100% is achieved.                                             2
                                                                        http://wiki.openstreetmap.org/wiki/
Further research on enhancing or correcting OSM data au-             Quality_assurance




                                                                   k3
                                   Automatic Extrapolation of Missing Road Network Data

                                G(V, E) where V is the
eled them into a directed !graph                                        street type is not available for one or more of these
set of vertices and E ⊆ V2 is the set of edges. Addi-                   edge, we set the feature value to 0.
tionally, we define a weight function w : E → R+ which
                                                                      • Node degree. In typical (OSM) road networks, the
provides the Euclidean length (computed on the sphere) of
                                                                        average node out-degree and in-degree (i.e. number
each edge. For given vertices s, t ∈ V we also define the
                                                                        of outgoing/incoming adjacent edges) is about 2, the
shortest path π(s, t) as the path
                               P from s to t minimizing the             maximum rarely exceeds 9. Nodes with a high de-
summed weight of the edges e∈π w(e).
                                                                        gree are typically important intersections and there-
Moreover we break down name tags associated with ways                   fore have a better chance to be in a well mapped area
by assigning the respective name n(e) to every edge e                   in OSM. On the other hand, nodes with a low degree
making up the way. Edges without names are labeled                      and especially dead-ends might be indicators for poor
n(e) = null. Similarly we associate with every edge e                   data coverage.
a type t(e) as inherited from the respective way it is part
of. Here, t(e) ∈ {0, 15} and reflects the hierarchy of the
                                                                  While street type difference and node degree can be com-
network (small numbers indicate important streets as mo-
                                                                  puted quite easily, coming up with a reasonable measure
torways, while high numbers refer to living streets).
                                                                  for connectivity is more difficult. In the following, we de-
                                                                  sign a measure based on the notion of local stretch that fits
4. Hole Detection in Road Networks                                our purpose of hole detection well.
The basic question is how to identify pairs of locations
                                                                  4.2. Measuring Connectivity
in the network where road segments are very likely to be
missing inbetween. To be able to deal with the enormous           Intuitively, two locations in a road network that are in close
amount of OSM data, we aim for methods which work                 proximity of each other should also have a short path within
without the need for manually checking large portions of          the street network. If the shortest path in the street network
the data set. Therefore, we will apply machine learning to        is much longer than the straight line distance, it might well
design a hole classifier which can be used to automatically       be that parts of a street are missing.
check location pair candidates.
                                                                  We will first formalize this condition and provide empirical
In the following we discuss several features that are rel-        evidence for the assumption that the shortest path distance
evant for hole detection. Obviously, connectivity charac-         and the straight line distance are highly correlated in road
teristics of the network play an important role. Therefore,       networks. Subsequently, we describe common exceptions
we will describe thoroughly how to measure connectivity           from this observation and introduce methods to deal with
between two nodes in a road network reasonably. Finally,          those.
we sketch the complete pipeline for hole detection, includ-
ing the generation of suitable training data for the machine      4.2.1. L OCAL S TRETCH C OMPUTATION
learning approach as well as a redundancy filter.
                                                                  Our goal is to automatically identify pairs of vertices s, t ∈
                                                                  V for which we assume that there exists a shorter path in
4.1. Road Network Characteristics
                                                                  reality than the one derived from the OSM network. We al-
To identify holes, we make use of several characteristics         ready outlined that the ratio of the shortest path distance
of road networks. To be specific, we are interested in the        between s and t and the straight line distance might be
following features:                                               a good indicator. This ratio is called local stretch. In
                                                                  the following we refer to the length of a shortest path by
  • Connectivity. The most important feature is how well          l(s, t) = |π(s, t)|, and to the straight line or Euclidean dis-
    two locations are connected via the street network.           tance by d(s, t). Then the local stretch can be formally de-
                                                                                         l(s,t)
    Measuring connectivity is non-trivial and therefore           fined as LS(s, t) = d(s,t)    . As d(s, t) is a lower bound for
    discussed in detail in the next section.                      the shortest path length, the local stretch is always greater
                                                                  or equal to 1. The closer it is to 1 the better the connectivity
  • Street type difference. Missing links between loca-
                                                                  between s and t in the road network. To compute π(s, t) a
    tions exhibit most likely the same street type as the
                                                                  Dijkstra run from s to t is the method of choice (or some
    mapped streets adjacent to those locations. So a
                                                                  accelerated variant).
    hole/missing link between an interstate and a dirt road
    is very unlikely. Therefore we compute the minimal            In Figure 2, the correlation of straight line and shortest path
    street type difference for two locations v, w by iter-        distance is visualized via the local stretch value. We ob-
    ating over all their adjacent edges E(v), E(w) and            serve that for large shortest path distances the local stretch
    calculating mine1 ∈E(v),e2 ∈E(w) |t(e1 ) − t(e2 )|. If the    is remarkable small, in fact it converges to about 1.25 (so




                                                                 kN
                                                Automatic Extrapolation of Missing Road Network Data

                3.75                                                        lakes or interstates.
                 3.5
                                                                            One way to overcome this problem would be to add a post-
                3.25
                                                                            processing phase in which for each identified pair of nodes
                  3
                                                                            with high LS it is automatically checked whether there is
                2.75
                                                                            some obstacle between them. But this approach imposes
local stretch




                 2.5
                                                                            several problems:
                2.25
                  2
                1.75                                                            • How to decide if an obstacle blocks the hole enough.
                 1.5                                                              Consider e.g. the two green points in Figure 1 (left):
                1.25                                                              There is a building on the straight line between them.
                  1                                                               Nevertheless, these two points indicate a real hole.
                       0   50        100          150          200   250
                                shortest path distance in km                    • Increased Runtime. If the number of pairs is large
Figure 2. Local stretch in dependency of the shortest path length                 (e.g. along every river, we expect a multitude of can-
for 8,000 random point-to-point queries in Southern Germany.                      didates), then to check for every single one if there is
                                                                                  a blockage inbetween is very time-consuming even if
                                                                                  a suitable spatial data structure for managing the ob-
                                                                                  stacles is used.
                                                                                • Distorted Learning. In the end, we want to use con-
                                                                                  nectivity as a feature in our machine learning ap-
                                                                                  proach for hole detection. If we consider these ob-
                                                                                  stacle induced high LS values in the learning process,
                                                                                  it might affect the ability to identify real holes later on.


                                                                            To overcome these problems, we introduce an approach
                                                                            that avoids reporting such obstacle induced high LS values
                                                                            in the first place. The basic idea is to incorporate obstacles
Figure 3. The straight line distance between the orange and the
                                                                            already in the local stretch computation phase. Compar-
green marker is about 500 meters, but the shortest path distance is
almost 8 kilometers, as the next bridge over the river is not close-
                                                                            ing the shortest path distance to the straight line distance is
by. This results in a local stretch value of 16.                            somewhat unfair if the straight line is blocked with obsta-
                                                                            cles. Therefore we should rather compare the shortest path
the shortest path is only 25% longer than the straight line                 between two locations in the street network to the shortest
distance). For smaller shortest path distances (< 50 km),                   path in the plane with movement-blocking obstacles, see
the local stretch values vary more and exceed 2 for some                    Figure 4 (left) for an illustration. The exact computation of
of the queries. Such point pairs with a higher local stretch                the shortest path with obstacles is rather complicated and
than the average are good candidates for hole indication as                 expensive (see e.g. (Mitchell, 1996)), therefore we sug-
they imply poor connectivity of the road network. Also it                   gest an easy way to get the approximative distance: We
makes only sense to search for holes on a very local level                  construct a two-dimensional grid graph covering the whole
anyhow. Holes between two far away locations are likely to                  area with a cell width of e.g. 10 meter. For every obstacle,
be caused by (several) local holes, i.e. many missing road                  we determine all grid points that are blocked by this obsta-
segments. Furthermore holes between two locations that                      cle and remove them and all adjacent edges from the grid
are many kilometers apart are at least equally likely to re-                graph. Then a conventional Dijkstra computation in the re-
sult from poor infrastructure in the area than from missing                 sulting grid graph provides a feasible path, see again Figure
road network data.                                                          4 (right). We refer to the length of this path as g(s, t) in the
                                                                            following.
4.2.2. I NCORPORATING O BSTACLES
                                                                            On this basis, we redefine local stretch as the ratio of l(s, t)
Unfortunately, local stretch alone is not a sufficient mea-                 and g(s, t) – abbreviated by LS ′ (s, t). Note that due to
sure for connectivity in road networks. Consider e.g. a                     the approximative nature of our shortest path length in the
river which can only be crossed via few bridges, then the                   plane and the fact that bridges etc. are not incorporated in
local stretch for two points on opposite sides of the river is              this calculation, LS ′ might be smaller than 1 (while LS
high as well (see Figure 3 for an illustration). Other kinds                always is ≥ 1). Still, the smaller LS ′ , the better the con-
of natural or artificial obstacles have the same effect, as e.g.            nectivity between two locations in the road network.




                                                                           jy
                                       Automatic Extrapolation of Missing Road Network Data




                                                                         Figure 5. The left image shows a small cutout of a road network.
                                                                         In the middle image, hole candidates are indicated by red lines.
Figure 4. Left image: The shortest path (green) between the two          The right image shows the single remaining hole after applying
black locations is much longer than the straight line distance (red).    the extremeness check.
But it is not much longer than the shortest path in the plane with
the lake considered as obstacle (orange). Right image: Approx-           check whether there might be more complex relationships
imative shortest path (purple) in the plane with obstacles using a       (e.g. considering node degrees). Therefore, we also used
grid approach.                                                           Random Forest as it might be able to exploit these more
                                                                         complex relationships.
4.3. Learning a Hole Classifier
                                                                         Both of these classifiers often work well with default pa-
With our newly designed connectivity measure LS ′ , we are               rameters3 and their learned models are fairly easily inter-
now able to compute all described road network features                  pretable. This makes them more suitable for our task than
for hole detection. As already outlined above, we are go-                e.g. Artificial Neural Networks.
ing to search for holes only between locations with a small
straight line distance as otherwise we cannot hope for good              4.3.3. E XTREMENESS C HECK
accuracy – furthermore, considering every pair of locations
                                                                         Feeding all reasonable node pairs into the learned classifier
in a large road network is computationally infeasible.
                                                                         provides us with the set of potential holes in the network.
                                                                         Unfortunately, it is very likely that a single missing road
4.3.1. G ENERATING T RAINING DATA
                                                                         segment leads to a multitude of reported candidate node
Manual creation of a ground truth data set large enough for              pairs. If between two vertices s, t ∈ V a segment is miss-
training the classifier is very time-consuming. Moreover,                ing, the classifier might not only declare s, t a hole but also
one needs to rely on the correctness/completeness of other               s′ , t′ with s′ in close proximity of s and t′ in close prox-
data (e.g. GoogleMaps) for this purpose. Hence to con-                   imity of t. In the example in Figure 5 the problem is il-
struct a large ground truth set of classified node pairs, we             lustrated. This unnecessarily decreases the accuracy of our
used the following method: Nodes in the network that are                 method and leads to more candidate locations that have to
directly connected with an edge (LS = 1) are no holes                    be manually checked in the end.
for sure. Also node pairs with a small local stretch do not
                                                                         To avoid this overhead, we introduce a filter in form of an
indicate a hole with high confidence (we used 2 as a thresh-
                                                                         extremeness check: For every pair s, t classified as hole,
old in the experiments). We repeatedly selected a node in
                                                                         we inspect LS ′ for all vertex pairs s′ , t′ with (s, s′ ) ∈ E
the network randomly and then searched for other nodes
                                                                         and (t′ , t) ∈ E, i.e. all neighbors of s and t in G. If for
in close proximity with small LS value. Among those we
                                                                         one of those pairs LS ′ (s′ , t′ ) is larger than LS ′ (s, t), we
randomly picked one to form a respective pair. For each
                                                                         prune s, t from the candidate list. The image in Figure 5
such pair we computed the feature vector and added it to
                                                                         on the right shows the result of applying the extremeness
the training data set. To generate training data for actual
                                                                         procedure for the considered example.
holes, we used a similar approach but removed all edges
on the shortest path between the two selected nodes before               The remaining candidate location pairs are then reported as
computing the feature vector. In this way, we created arti-              the result of the automatic hole detection procedure.
ficial holes. For the final evaluation of the accuracy of our
method, real holes will be used.                                         5. Extrapolation of Missing Street Names
4.3.2. C LASSIFIER C HOICE                                               Another important part of the road network data in OSM
                                                                         are the street name tags. If a user issues a query to a route
Using the described feature vectors, the goal is to learn a              planning service, start and destination are often specified
good classifier which distinguishes between holes and non-               by their respective street names. This only works well if
holes. We expect the relationships between our features                  street names are complete. In OSM, though, unlabeled
and the existence of a hole to be rather simple; for exam-               or only partially labeled streets are quite frequent. Of-
ple, we expect the higher the local stretch the more likely              ten, there are multiple ways in OSM with the same street
there is a hole between two nodes. Due to these expected                 name tag but these ways are not connected (as the ways
feature-target correlations, one suitable method for learn-
                                                                             3
ing is Logistic Regression. Nevertheless, we also want to                        using e.g. scikit-learn (Pedregosa et al., 2011)




                                                                        jR
                                       Automatic Extrapolation of Missing Road Network Data

                                                                     ple). We initially set this feature to 0 for all edges. Then
                                                                     we extract all name components in the network and iden-
                                                                     tify street names which exhibit multiple name components
                                                                     in close proximity of each other (as the same street name
                                                                     might also occur in many villages/cities, as e.g. ’Main
                                                                     Street’). For every such street name we run Dijkstra com-
                                                                     putations between all pairs of nodes in different compo-
                                                                     nents. For all untagged edges on one of the resulting short-
                                                                     est paths we set the feature value to 1.
                                                                     As a second feature we consider the number of close-by
                                                                     name components. So we run a Dijkstra computation from
                                                                     each of the two endpoints of an untagged edge until all
                                                                     nodes in the Dijkstra search tree are either dead-ends or
 (a)                                                                 are only adjacent to unrelaxed edges with n(e) 6= null. For
                                                                     all nodes in the Dijkstra search trees we compute the set
                                                                     of name tags of adjacent edges. Figure 6(c) shows a small
                   (b)           (c)                                 example where the feature vector entry equals 1. Being
                                                                     connected to a single name component might be a strong
Figure 6. (a) Small map section based on OSM data. For every
street name a random color was chosen and all segments with the
                                                                     indicator for the segment to belong to this component.
same name share the same colour. Thin gray road segments are         But as connectivity to a single name component could also
not tagged in OSM. In the second row, close-ups of (a) are shown:    mean that only one street in the area is tagged, we also con-
(b) illustrates a set of disconnected road segments with the same    sider the shortest path distance to the closest name com-
name (orange), and (c) shows small untagged side roads which
                                                                     ponent (retrievable from the two Dijkstra runs described
are likely to have the same name as the red street.
                                                                     above) and the number of intersections on the shortest path
                                                                     from the edge to the nearest name component. The higher
might be contributed by different volunteers, but none of            those two values the less likely it is that the edge belongs to
them mapped the complete course). If a user searches for             that name component. Finally, we again consider the street
a specific street (e.g. to see which shops are close-by),            type. Typically, a name component consists only of edges
he expects a single entity to be returned and not multiple           of the same street type. Hence the feature value is com-
ways. Also if in a route planning query a user prefers cer-          puted as the absolute street type difference of the edge and
tain streets or wants to avoid them, their names have to be          the most frequent street type in the closest name compo-
fully contained in the data to account for that.                     nent.

In the following we try to extrapolate missing street name           5.2. Training Data and Machine Learning
tags from given data. We want to connect multiple ways
with the same street name and extend partially tagged roads          Again, we generated a large training data set automati-
to completely tagged roads where possible. We describe se-           cally. For that purpose we first extracted completely tagged
mantic and topological characteristics of the road network           streets, i.e. we searched for name components with all
that used as features in machine learning help to decide             nodes in that component only being adjacent to tagged
whether an untagged road segment can be labeled with high            streets, so no surrounding street name data is missing. Then
confidence.                                                          we randomly deleted less than half of the name tags from
                                                                     edges on this street and also from edges inside a certain
5.1. Feature Extraction                                              radius around the street. Afterwards, we computed the fea-
                                                                     ture vector for each now untagged edge on the selected
We primarily rely on the assumption that all the road seg-           street and added the result to the training data set. Fur-
ments that belong to a street with one name are connected.           thermore we selected completely tagged streets in the same
According to our study in completely tagged areas this as-           way, but removed all of its tags and some tags on edges in
sumption is true for almost 99% of all streets. The visual-          the neighborhood. These are examples where extrapolation
ization in Figure 6 (a) also shows typical connectivity char-        is not possible. Again, we computed the feature vectors and
acteristics of road segments with the same name. We refer            added them to the training data.
to a connected set of edges with the same name as name
component. A first feature we consider is whether an edge            Like for hole classification, we deem Logistic Regression
is on a shortest path between two disconnected name com-             and Random Forest as suitable learning methods to infer
ponents with the same name (see Figure 6(b) for an exam-             which street segments can be extrapolated.




                                                                    jk
                                    Automatic Extrapolation of Missing Road Network Data




Figure 7. Accuracy of learned classifier on generated data using
Logistic Regression or Random Forest.




                                                                    Figure 9. Left: Location pair falsely identified as hole by our
                                                                    classifier as observable when marked as start and endpoint in
                                                                    GoogleMaps. Right: Correctly identified hole indicated in the
                                                                    lower image by the two red markers on the OSM based map. The
                                                                    upper image shows the same cut-out on GoogleMaps with the two
                                                                    points being directly connected.

                                                                    0.97 which underpins its importance for classification. The
Figure 8. Feature importance when using Random Forest for clas-
                                                                    other features achieved AuC scores below 0.6. Neverthe-
sification.
                                                                    less the combination of all features led to a 1-2% higher
                                                                    accuracy than considering only local stretch. The hierar-
6. Experimental Evaluation                                          chy difference resulting from the street types adjacent to
                                                                    s, t does not really contribute to the classification process.
We implemented the described feature extraction methods             One reason might be the way we generated the training
in C++. For machine learning, we used the scikit-learn              data. Non-holes between e.g. parallel running motorways
package for Python (Pedregosa et al., 2011). Experiments            and federal streets which exhibit rather high local stretch
were conducted on a single core of an Intel i5-4300U CPU            but also high hierarchy difference are not likely to be in-
with 1.90GHz and 12GB RAM. We used the OSM road                     cluded in our data set. Manually selecting such examples
network data of Germany (22.3 million nodes) and Poland             and adding them to the training data might increase the im-
(6.3 million nodes) for learning and evaluation.                    portance of the hierarchy feature. Another problem is that
                                                                    there are road segments without a type which distorts the
6.1. Hole Classification                                            learning process.
We created a data set containing 10,000 feature vectors of          Finally, we used our complete pipeline for real hole de-
holes and 10,000 feature vectors of non-holes with the pro-         tection on Germany and Poland. For evaluation, we firstly
cedure described in Section 4.3.1 on the Germany data set.          selected 2000 nodes randomly in Germany and Poland. For
                                                                    each such node s, we extracted all nodes t within a radius of
For evaluating our machine learning pipeline on this data,
                                                                    500m (straight line distance) to form candidate pairs (s, t).
we used 10-fold stratified cross-validation applying Logis-
                                                                    For each candidate pair we computed the respective fea-
tic Regression and Random Forest to our training data.
                                                                    ture vector. We extracted rivers and lakes from OSM and
The outcomes are summarized in Figure 7. We observe
                                                                    treated them as obstacles for the LS ′ computation. Due to
that both Logistic Regression and Random Forest work re-
                                                                    our efficient implementation of the local stretch computa-
markably well; both predict correctly in over 99% of the
                                                                    tion, it took less than 5 minutes to process all nodes. Then
cases. While Logistic Regression achieves a better overall
                                                                    we applied our classifier (Random Forest) to decide which
accuracy, Random Forest misclassifies slightly less holes
                                                                    of the candidate pairs are likely to be holes. Afterwards,
as non-holes.
                                                                    we used the extremeness check to filter superfluous candi-
Having a closer look at the importance of the considered            dates by classifying also node pairs close to the identified
features (see Figure 8) for Random Forest, we see that lo-          holes and selecting those with the highest LS ′ value (there-
cal stretch is most important followed by the out-degree            fore final node pairs not necessarily include one of the ran-
of s and the in-degree of t. We also evaluated the AuC              domly selected nodes in the beginning). Table 1 shows an
score of the features. Local stretch achieved a score of            overview of the number of pairs resulting from each step.




                                                                   jj
                                      Automatic Extrapolation of Missing Road Network Data

                                Germany      Poland                   in both approaches the feature expressing whether the seg-
             d(s, t) < 500m       64,194     44,332                   ment connects two name components with the same name
             classified           18,970     11,432
             extreme                 216        128                   had most influence. Despite a high AuC score, the number
             real holes                7         19                   of close-by name components made only little difference
                                                                      in the learned classifiers. We assume the feature indicating
Table 1. Number of hole candidates after each step of our detec-      the shortest path distance to the closest name component
tion pipeline and number of correctly recognized holes in the end.    shadows the number of name components, as a very small
                                                                      shortest path distance and a small number of close-by com-
                                                                      ponents are highly correlated.
                                                                      For real-world validation of our learned classifier, we se-
                                                                      lected 2000 unnamed road segments in each Germany and
                                                                      Poland and computed the feature vectors. Our classifier de-
                                                                      clared 235 road segments in Germany extrapolatable and
                                                                      164 in Poland. A visual analysis showed that most of
                                                                      the segments not declared extrapolatable lied in larger un-
                                                                      tagged areas. For the road segments where the classifier
Figure 10. OSM based map (left) and GoogleMaps (right) on a           indicated extrapolation might be possible, we selected the
cutout of Ulm, Germany. The red segments on the left indicate         closest name component for name suggestion or, if the seg-
missing street names in the OSM. In GoogleMaps the correct            ment connects two components with the same name, this
name for all those segments is ’Im Lehrer Feld’. As the surround-     name is the obvious choice. We relied on a comparison
ing streets are tagged with this name in OSM and our approach         to GoogleMaps and BingMaps data for evaluation. Un-
classified the road segments as extrapolatable, the OSM data cov-     fortunately, in surprisingly many cases (about 10%) the
erage can be increased here.
                                                                      street segments in question were unlabeled or unclear or not
                                                                      even present in GoogleMaps or BingMaps. We excluded
For the remaining extreme candidates we checked one-by-               these cases from the evaluation. For the remaining cases,
one if they are correctly classified by comparing to satellite        we achieved an accuracy of 96% in Germany and 91% in
images and map data from GoogleMaps. Figure 9 shows                   Poland (see Figure 10 for a positive example).
examples for a falsely identified hole and a real hole. The
falsely identified hole shows that it is nearly impossible to
design a perfect classifier as the very same configuration of         7. Conclusions and Future Work
streets might very well indicate a real hole at some other lo-        We showed that machine learning is a useful tool to de-
cation. Main sources of misclassifying non-holes as holes             tect missing and possibly extrapolatable road network data
were clusters of one-way streets in the middle of cities, vil-        in OSM. Making classified holes and nameless street seg-
lages close to federal streets that are not directly connected        ments with a good name suggestion available to the OSM
and tree-like network structures in rural areas with many             community might raise attention to such locations and fi-
dead-ends. In future work, training data could be created             nally lead to a faster improvement of the OSM data quality.
in a way that the classifier can better deal with such scenar-
ios.                                                                  There are various directions for future research. Our cur-
                                                                      rent methods do not work in regions where road data is
Nevertheless, the number of pairs that have to be manu-               completely missing. But e.g. mapped building footprints
ally checked is significantly smaller than the number of ini-         could be a strong indicator for the existence of infrastruc-
tially created candidate pairs. The percentage of real holes          ture in an area. Considering buildings could also improve
among the extreme pairs is 3% for Germany and about 15%               our classifiers. Including (large) buildings as obstacles for
for Poland. The difference might result from the much bet-            hole detection could lead to more realistic local stretch val-
ter overall OSM data quality in Germany or could also be              ues in cities. Moreover, considering house numbers could
seen as an indicator for already high data coverage.                  significantly help to decide if a street segment should be
                                                                      tagged with a certain name. If the house numbers of tagged
6.2. Street Name Extrapolation                                        and untagged segments complement each other there is a
We extracted 10,000 feature vectors for edge segments                 good chance that they share the same name.
where we assume extrapolation is possible and 10,000 for              Finally, many other aspects of the OSM data might be suit-
edge segments where we are sure extrapolation is impossi-             able for extrapolation or classification using machine learn-
ble. Then we used Logistic Regression and Random Forest               ing, e.g. distinguishing living and industrial areas or ex-
to learn feature importance/weights. In our cross-validation          trapolating missing house numbers.
both approaches achieved an accuracy of 99.95%. Also




                                                                     j9
                                  Automatic Extrapolation of Missing Road Network Data

References                                                         Proceedings of the Sixth ACM SIGSPATIAL Interna-
                                                                   tional Workshop on Computational Transportation Sci-
Baum, Moritz, Dibbelt, Julian, Pajor, Thomas, and Wagner,
                                                                   ence, IWCTS ’13, pp. 19:19–19:24. ACM, 2013b.
  Dorothea. Energy-optimal routes for electric vehicles.
  In Proceedings of the 21st ACM SIGSPATIAL Interna-           Jilani, Musfira, Corcoran, Padraig, and Bertolotto,
  tional Conference on Advances in Geographic Informa-            Michela. Automated highway tag assessment of open-
  tion Systems, pp. 54–63. ACM, 2013.                             streetmap road networks. 2014.
Delling, Daniel and Werneck, Renato F. Faster customiza-       Long, Ying and Liu, Xingjian. Automated identifica-
  tion of road networks. In Experimental Algorithms, 12th        tion and characterization of parcels (AICP) with open-
  International Symposium, SEA 2013, Rome, Italy, June           streetmap and points of interest. CoRR, abs/1311.6165,
  5-7, 2013. Proceedings, pp. 30–42, 2013.                       2013.

Efentakis, Alexandros, Brakatsoulas, Sotiris, Grivas,          Mitchell, Joseph SB. Shortest paths among obstacles in the
  Nikos, and Pfoser, Dieter. Crowdsourcing turning re-          plane. International Journal of Computational Geome-
  strictions for openstreetmap. In EDBT/ICDT Workshops,         try & Applications, 6(03):309–332, 1996.
  pp. 355–362, 2014.
                                                               Mooney, Peter and Corcoran, Padraig. Using OSM for
Fan, Hongchao, Zipf, Alexander, Fu, Qing, and Neis, Pas-        LBS–an analysis of changes to attributes of spatial ob-
  cal. Quality assessment for building footprints data on       jects. Springer, 2012.
  openstreetmap. International Journal of Geographical
                                                               Neis, Pascal, Goetz, Marcus, and Zipf, Alexander. To-
  Information Science, 28(4):700–719, 2014.
                                                                 wards automatic vandalism detection in openstreetmap.
Funke, Stefan. Topological hole detection in wireless sen-       ISPRS International Journal of Geo-Information, 1(3):
  sor networks and its applications. In Proceedings of the       315–332, 2012.
  2005 joint workshop on Foundations of mobile comput-         Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
  ing, pp. 44–53. ACM, 2005.                                     Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
                                                                 Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
Funke, Stefan, Nusser, André, and Storandt, Sabine. On k-
                                                                 napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
  path covers and their applications. In International Con-
                                                                 Scikit-learn: Machine learning in Python. Journal of
  ference on Very Large Databases (VLDB), 2014.
                                                                 Machine Learning Research, 12:2825–2830, 2011.
Girres, Jean-François and Touya, Guillaume. Quality as-
                                                               Rahman, Kazi Mujibur, Alam, Tauhidul, and Chowd-
  sessment of the french openstreetmap dataset. Transac-
                                                                 hury, Mashrur. Location based early disaster warning
  tions in GIS, 14(4):435–459, 2010.
                                                                 and evacuation system on mobile phones using open-
Haklay, Mordechai. How good is volunteered geographical          streetmap. In Open Systems (ICOS), 2012 IEEE Con-
  information? a comparative study of openstreetmap and          ference on, pp. 1–6. IEEE, 2012.
  ordnance survey datasets. Environment and Planning B         Tao, Sha, Manolopoulos, Vasileios, Rodriguez, Saul, Rusu,
  Planning and Design, (37):682–703, 2010.                       Ana, et al. Real-time urban traffic state estimation with
                                                                 a-gps mobile phones as probes. Journal of Transporta-
Holone, Harald, Misund, Gunnar, and Holmstedt, Hakon.
                                                                 tion Technologies, 2(01):22, 2012.
  Users are doing it for themselves: Pedestrian navigation
  with user generated content. In Next Generation Mo-          Vetter, Christian. Fast and exact mobile navigation with
  bile Applications, Services and Technologies, 2007. NG-        openstreetmap data. Master’s thesis, Karlsruhe Institute
  MAST’07. The 2007 International Conference on, pp.             of Technology, 2010.
  91–99. IEEE, 2007.

Jilani, Musfira, Corcoran, Padraig, and Bertolotto,
   Michela. Automated quality improvement of road net-
   work in openstreetmap. In Agile Workshop (Action
   and Interaction in Volunteered Geographic Informa-
   tion), 2013a.

Jilani, Musfira, Corcoran, Padraig, and Bertolotto,
   Michela. Multi-granular street network representation
   towards quality assessment of openstreetmap data. In




                                                              j8