<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Extrapolation of Missing Road Network Data in OpenStreetMap</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Funke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabine Storandt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Japan</institution>
          ,
          <addr-line>Germany, North and South America and</addr-line>
          <country country="AU">Australia)</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Freiburg</institution>
          ,
          <addr-line>79110 Freiburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Stuttgart</institution>
          ,
          <addr-line>70569 Stuttgart</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Road network data from OpenStreetMap (OSM) is the basis of various real-world applications such as fleet management or traffic flow estimation, and has become a standard dataset for research on route planning and related subjects. The quality of such applications and conclusiveness of research crucially relies on correctness and completeness of the underlying road network data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>OpenStreetMap (OSM) is a huge collection of
crowdsourced spatial information. The goal of the OSM project
is to map the whole world with all its road networks,
buildings, regions and other kinds of natural and man-made
entities. The OSM data set size increases significantly
every year as more and more parts of the world are covered,
and information becomes more detailed. For example, the
world-wide road network in OSM contained at the
beginning of 2007 less than 30 million data points whereas in
2013 this number has grown to more than two billions.
Nowadays, the quality of OSM data often even exceeds the
quality of proprietary data.</p>
      <p>OSM is the basis for numerous applications and research
projects, concerned e.g. with pedestrian and vehicle
navidk</p>
      <sec id="sec-1-1">
        <title>FUNKE@FMI.UNI-STUTTGART.DE</title>
      </sec>
      <sec id="sec-1-2">
        <title>SCHIRRMR@INFORMATIK.UNI-FREIBURG.DE</title>
      </sec>
      <sec id="sec-1-3">
        <title>STORANDT@INFORMATIK.UNI-FREIBURG.DE</title>
        <p>
          gation
          <xref ref-type="bibr" rid="ref6">(Holone et al., 2007; Vetter, 2010)</xref>
          ,
location-basedservices (Mooney &amp; Corcoran, 2012), disaster warning
          <xref ref-type="bibr" rid="ref4">(Rahman et al., 2012)</xref>
          , fleet management (Efentakis et al.,
2014), traffic estimation
          <xref ref-type="bibr" rid="ref5">(Tao et al., 2012)</xref>
          and many other
related topics1.
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>Completeness of the road network data</title>
        <p>is mandatory for these applications to guarantee usability
in practice. Road networks extracted from OSM (mainly
have also become standard benchmarks in route planning
papers, see (Delling &amp; Werneck, 2013; Baum et al., 2013;
Funke et al., 2014). For research on route planning the
completeness of the data plays an important role, as the
developed algorithms are designed to take typical
connectivity characteristics of road networks into account.
But while the OSM data is of very high quality already,
there is still structural information missing, as e.g. road
or path sections (see Figure 1).</p>
      </sec>
      <sec id="sec-1-5">
        <title>Moreover street names</title>
        <p>are far from being complete. The correct street name is
necessary for specifying start and destination in a route
planning query, for location-based services (’all shops on
Norris Street’) and for answering complex route planning
queries like ’from A to B avoiding Park Street’ accurately.</p>
        <p>In this paper we design a classifier based on learning
topological and semantic characteristics of road networks which
can then be used to identify pairs of candidate locations
where road segments are likely to be missing inbetween.</p>
        <p>We refer to missing structural data as holes in the following.</p>
        <p>Furthermore we show how to instrument machine learning
techniques to identify road segments where the name tag
can be extrapolated with high confidence. Our
experimental results prove the ability of our methods to enhance the
quality of OSM road network data considerably.</p>
        <p>1http://wiki.openstreetmap.org/wiki/List_
of_OSM-based_services
is much more detailed and more building footprints and house numbers are available. For the coloured points in the left image, though,
there are road segments missing as observable on the right.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Studies on completeness and correctness of the OSM data
are numerous, see e.g. (Girres &amp; Touya, 2010) or
(Haklay, 2010). The problem is that there either needs to be a
ground truth one can compare to in an automated way (as
investigated in
        <xref ref-type="bibr" rid="ref1">(Fan et al., 2014)</xref>
        for building footprints in
Munich), or data has to be manually compared to
proprietary data. In both cases, the ground truth sample sizes are
typically limited. Machine learning was applied to OSM
data in order to automatically assess the quality of the road
network data (Jilani et al., 2013b). Here, characteristics
of certain street types (as e.g.
motorways) are learned,
including features such as total street length, number of
dead-ends, number of intersection points and connectivity
to other street types. In the quality analysis, feature vectors
of streets are compared to the learned feature vector for the
respective street type. If they do not resemble each other it
is assumed that the data quality of the considered street is
poor. The authors state that these learned features are also
useful to classify streets with unknown type.
      </p>
      <p>Preliminary results on automated quality improvement of
OSM data were also reported in (Jilani et al., 2013a). Here,
Artificial Neural Networks (ANN) are applied to
distinguish residential and pedestrian streets; features include
node count within a bounding box and betweenness
centrality. In (Jilani et al., 2014), the automated street type
classification for OSM was considered in more detail. In
addition to the above mentioned features, the shape of the
street is considered. About 20 different OSM street types
were used in the experimental evaluation. The
classification accuracy varies widely but for some types even an
accuracy of 100% is achieved.</p>
      <p>Further research on enhancing or correcting OSM data
automatically (not necessarily using machine learning
techniques) include the deduction of turn restrictions from GPS
tracks (Efentakis et al., 2014), the detection of vandalism
using a rule-based approach (Neis et al., 2012), or the
identification of basic spatial units (so called parcels) for
finescale urban modeling (Long &amp; Liu, 2013).</p>
      <p>To the best of our knowledge, tools for detecting missing
parts of the OSM road network automatically were not
investigated before, the quality assurance tools in the OSM
project2 mostly focus on detecting syntactic errors in the
map specification.</p>
      <p>Hole detection in other kind of
networks, as e.g. sensor networks, is an important and well
established problem, though. Here also topological
characteristics of the networks were taken into account (Funke,
2005). But as such networks differ significantly from road
networks in many aspects results are hardly transferable.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Basics</title>
      <p>OSM data comes in form of nodes, ways and relations.
Nodes are single locations with latitude, longitude and
additional tags (like the name of the location). Ways are
ordered sets of nodes, describing e.g. a road or a building
footprint. Relations are compositions of multiple nodes or
ways, e.g. to aggregate all buildings and roads within an
industrial area. Ways and relations are typically also
augmented with tags that provide various information (e.g. the
street or region name).</p>
      <p>To be able to identify meaningful features for hole
classification and street name extrapolation later on, we extracted
all nodes and ways that describe roads in OSM and
mod2http://wiki.openstreetmap.org/wiki/
Quality_assurance
eled them into a directed graph G(V, E) where V is the
⊆ !</p>
      <p>V
2
set of vertices and E
is the set of edges.
Additionally, we define a weight function w : E
provides the Euclidean length (computed on the sphere) of
each edge. For given vertices s, t ∈ V we also define the
shortest path π(s, t) as the path from s to t minimizing the
→ R+ which
summed weight of the edges Pe∈π w(e).</p>
      <p>Moreover we break down name tags associated with ways
by assigning the respective name n(e) to every edge e
making up the way.</p>
      <p>Edges without names are labeled
n(e) = null. Similarly we associate with every edge e
a type t(e) as inherited from the respective way it is part
of. Here, t(e) ∈ {0, 15} and reflects the hierarchy of the
network (small numbers indicate important streets as
motorways, while high numbers refer to living streets).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Hole Detection in Road Networks</title>
      <p>The basic question is how to identify pairs of locations
in the network where road segments are very likely to be
missing inbetween. To be able to deal with the enormous
amount of OSM data, we aim for methods which work
without the need for manually checking large portions of
the data set. Therefore, we will apply machine learning to
design a hole classifier which can be used to automatically
check location pair candidates.</p>
      <p>In the following we discuss several features that are
relevant for hole detection. Obviously, connectivity
characteristics of the network play an important role. Therefore,
we will describe thoroughly how to measure connectivity
between two nodes in a road network reasonably. Finally,
we sketch the complete pipeline for hole detection,
including the generation of suitable training data for the machine
learning approach as well as a redundancy filter.</p>
      <sec id="sec-4-1">
        <title>4.1. Road Network Characteristics</title>
        <p>To identify holes, we make use of several characteristics
of road networks. To be specific, we are interested in the
following features:
• Connectivity. The most important feature is how well
two locations are connected via the street network.
Measuring connectivity is non-trivial and therefore
discussed in detail in the next section.
• Street type difference.</p>
        <p>Missing links between
locations exhibit most likely the same street type as the
mapped streets adjacent to those locations.</p>
        <p>So a
hole/missing link between an interstate and a dirt road
is very unlikely. Therefore we compute the minimal
street type difference for two locations v, w by
iterating over all their adjacent edges E(v), E(w) and
calculating mine1∈E(v),e2∈E(w) |t(e1) − t(e2)|. If the
Nk
street type is not available for one or more of these
edge, we set the feature value to 0.
• Node degree. In typical (OSM) road networks, the
average node out-degree and in-degree (i.e. number
of outgoing/incoming adjacent edges) is about 2, the
maximum rarely exceeds 9. Nodes with a high
degree are typically important intersections and
therefore have a better chance to be in a well mapped area
in OSM. On the other hand, nodes with a low degree
and especially dead-ends might be indicators for poor
data coverage.</p>
        <p>While street type difference and node degree can be
computed quite easily, coming up with a reasonable measure
for connectivity is more difficult. In the following, we
design a measure based on the notion of local stretch that fits
our purpose of hole detection well.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Measuring Connectivity</title>
        <p>Intuitively, two locations in a road network that are in close
proximity of each other should also have a short path within
the street network. If the shortest path in the street network
is much longer than the straight line distance, it might well
be that parts of a street are missing.</p>
        <p>We will first formalize this condition and provide empirical
evidence for the assumption that the shortest path distance
and the straight line distance are highly correlated in road
networks. Subsequently, we describe common exceptions
from this observation and introduce methods to deal with
those.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. LOCAL STRETCH COMPUTATION</title>
          <p>Our goal is to automatically identify pairs of vertices s, t ∈
V for which we assume that there exists a shorter path in
reality than the one derived from the OSM network. We
already outlined that the ratio of the shortest path distance
between s and t and the straight line distance might be
a good indicator.</p>
          <p>This ratio is called local stretch. In
the following we refer to the length of a shortest path by
l(s, t) = |π(s, t)|, and to the straight line or Euclidean
distance by d(s, t). Then the local stretch can be formally
defined as LS(s, t) = dl((ss,,tt)) . As d(s, t) is a lower bound for
the shortest path length, the local stretch is always greater
or equal to 1. The closer it is to 1 the better the connectivity
between s and t in the road network. To compute π(s, t) a
Dijkstra run from s to t is the method of choice (or some
accelerated variant).</p>
          <p>In Figure 2, the correlation of straight line and shortest path
distance is visualized via the local stretch value. We
observe that for large shortest path distances the local stretch
is remarkable small, in fact it converges to about 1.25 (so
almost 8 kilometers, as the next bridge over the river is not
closeby. This results in a local stretch value of 16.
the shortest path is only 25% longer than the straight line
distance). For smaller shortest path distances (&lt; 50 km),
the local stretch values vary more and exceed 2 for some
of the queries. Such point pairs with a higher local stretch
than the average are good candidates for hole indication as
they imply poor connectivity of the road network. Also it
makes only sense to search for holes on a very local level
anyhow. Holes between two far away locations are likely to
be caused by (several) local holes, i.e. many missing road
segments. Furthermore holes between two locations that
are many kilometers apart are at least equally likely to
result from poor infrastructure in the area than from missing
road network data.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. INCORPORATING OBSTACLES</title>
          <p>Unfortunately, local stretch alone is not a sufficient
measure for connectivity in road networks. Consider e.g. a
river which can only be crossed via few bridges, then the
local stretch for two points on opposite sides of the river is
high as well (see Figure 3 for an illustration). Other kinds
of natural or artificial obstacles have the same effect, as e.g.
lakes or interstates.</p>
          <p>One way to overcome this problem would be to add a
postprocessing phase in which for each identified pair of nodes
with high LS it is automatically checked whether there is
some obstacle between them. But this approach imposes
several problems:
• How to decide if an obstacle blocks the hole enough.</p>
          <p>Consider e.g. the two green points in Figure 1 (left):
There is a building on the straight line between them.</p>
          <p>Nevertheless, these two points indicate a real hole.
• Increased Runtime. If the number of pairs is large
(e.g. along every river, we expect a multitude of
candidates), then to check for every single one if there is
a blockage inbetween is very time-consuming even if
a suitable spatial data structure for managing the
obstacles is used.
• Distorted Learning. In the end, we want to use
connectivity as a feature in our machine learning
approach for hole detection. If we consider these
obstacle induced high LS values in the learning process,
it might affect the ability to identify real holes later on.
To overcome these problems, we introduce an approach
that avoids reporting such obstacle induced high LS values
in the first place. The basic idea is to incorporate obstacles
already in the local stretch computation phase.
Comparing the shortest path distance to the straight line distance is
somewhat unfair if the straight line is blocked with
obstacles. Therefore we should rather compare the shortest path
between two locations in the street network to the shortest
path in the plane with movement-blocking obstacles, see
Figure 4 (left) for an illustration. The exact computation of
the shortest path with obstacles is rather complicated and
expensive (see e.g. (Mitchell, 1996)), therefore we
suggest an easy way to get the approximative distance: We
construct a two-dimensional grid graph covering the whole
area with a cell width of e.g. 10 meter. For every obstacle,
we determine all grid points that are blocked by this
obstacle and remove them and all adjacent edges from the grid
graph. Then a conventional Dijkstra computation in the
resulting grid graph provides a feasible path, see again Figure
4 (right). We refer to the length of this path as g(s, t) in the
following.</p>
          <p>On this basis, we redefine local stretch as the ratio of l(s, t)
and g(s, t) – abbreviated by LS′(s, t). Note that due to
the approximative nature of our shortest path length in the
plane and the fact that bridges etc. are not incorporated in
this calculation, LS′ might be smaller than 1 (while LS
always is ≥ 1). Still, the smaller LS′, the better the
connectivity between two locations in the road network.
black locations is much longer than the straight line distance (red).
But it is not much longer than the shortest path in the plane with
the lake considered as obstacle (orange). Right image:
Approximative shortest path (purple) in the plane with obstacles using a
grid approach.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Learning a Hole Classifier</title>
        <p>With our newly designed connectivity measure LS′, we are
now able to compute all described road network features
for hole detection. As already outlined above, we are
going to search for holes only between locations with a small
straight line distance as otherwise we cannot hope for good
accuracy – furthermore, considering every pair of locations
in a large road network is computationally infeasible.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. GENERATING TRAINING DATA</title>
          <p>Manual creation of a ground truth data set large enough for
training the classifier is very time-consuming. Moreover,
one needs to rely on the correctness/completeness of other
data (e.g. GoogleMaps) for this purpose. Hence to
construct a large ground truth set of classified node pairs, we
used the following method: Nodes in the network that are
directly connected with an edge (LS
= 1) are no holes
for sure. Also node pairs with a small local stretch do not
indicate a hole with high confidence (we used 2 as a
threshold in the experiments). We repeatedly selected a node in
the network randomly and then searched for other nodes
in close proximity with small LS value. Among those we
randomly picked one to form a respective pair. For each
such pair we computed the feature vector and added it to
the training data set. To generate training data for actual
holes, we used a similar approach but removed all edges
on the shortest path between the two selected nodes before
computing the feature vector. In this way, we created
artificial holes. For the final evaluation of the accuracy of our
method, real holes will be used.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. CLASSIFIER CHOICE Using the described feature vectors, the goal is to learn a good classifier which distinguishes between holes and nonholes.</title>
          <p>We expect the relationships between our features
and the existence of a hole to be rather simple; for
example, we expect the higher the local stretch the more likely
there is a hole between two nodes. Due to these expected
feature-target correlations, one suitable method for
learning is Logistic Regression. Nevertheless, we also want to
Rj
(e.g. considering node degrees). Therefore, we also used
Random Forest as it might be able to exploit these more
complex relationships.</p>
          <p>Both of these classifiers often work well with default
parameters3 and their learned models are fairly easily
interpretable. This makes them more suitable for our task than
e.g. Artificial Neural Networks.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. EXTREMENESS CHECK</title>
          <p>Feeding all reasonable node pairs into the learned classifier
provides us with the set of potential holes in the network.
Unfortunately, it is very likely that a single missing road
segment leads to a multitude of reported candidate node
pairs. If between two vertices s, t ∈ V a segment is
missing, the classifier might not only declare s, t a hole but also
s′, t′ with s′ in close proximity of s and t′ in close
proximity of t. In the example in Figure 5 the problem is
illustrated. This unnecessarily decreases the accuracy of our
method and leads to more candidate locations that have to
be manually checked in the end.</p>
          <p>To avoid this overhead, we introduce a filter in form of an
extremeness check: For every pair s, t classified as hole,
we inspect LS′ for all vertex pairs s′, t′ with (s, s′) ∈ E
and (t′, t) ∈ E, i.e. all neighbors of s and t in G. If for
one of those pairs LS′(s′, t′) is larger than LS′(s, t), we
prune s, t from the candidate list. The image in Figure 5
on the right shows the result of applying the extremeness
procedure for the considered example.</p>
          <p>The remaining candidate location pairs are then reported as
the result of the automatic hole detection procedure.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Extrapolation of Missing Street Names</title>
      <p>Another important part of the road network data in OSM
are the street name tags. If a user issues a query to a route
planning service, start and destination are often specified
by their respective street names. This only works well if
street names are complete. In OSM, though, unlabeled
or only partially labeled streets are quite frequent.
Often, there are multiple ways in OSM with the same street
name tag but these ways are not connected (as the ways</p>
      <sec id="sec-5-1">
        <title>3using e.g. scikit-learn (Pedregosa et al., 2011)</title>
        <p>street name a random color was chosen and all segments with the
same name share the same colour. Thin gray road segments are
not tagged in OSM. In the second row, close-ups of (a) are shown:
(b) illustrates a set of disconnected road segments with the same
name (orange), and (c) shows small untagged side roads which
are likely to have the same name as the red street.
might be contributed by different volunteers, but none of
them mapped the complete course). If a user searches for
a specific street (e.g. to see which shops are close-by),
he expects a single entity to be returned and not multiple
ways. Also if in a route planning query a user prefers
certain streets or wants to avoid them, their names have to be
fully contained in the data to account for that.</p>
        <p>In the following we try to extrapolate missing street name
tags from given data. We want to connect multiple ways
with the same street name and extend partially tagged roads
to completely tagged roads where possible. We describe
semantic and topological characteristics of the road network
that used as features in machine learning help to decide
whether an untagged road segment can be labeled with high
confidence.</p>
        <sec id="sec-5-1-1">
          <title>5.1. Feature Extraction</title>
          <p>We primarily rely on the assumption that all the road
segments that belong to a street with one name are connected.
According to our study in completely tagged areas this
assumption is true for almost 99% of all streets. The
visualization in Figure 6 (a) also shows typical connectivity
characteristics of road segments with the same name. We refer
to a connected set of edges with the same name as name
component. A first feature we consider is whether an edge
is on a shortest path between two disconnected name
components with the same name (see Figure 6(b) for an
example). We initially set this feature to 0 for all edges. Then
we extract all name components in the network and
identify street names which exhibit multiple name components
in close proximity of each other (as the same street name
might also occur in many villages/cities, as e.g. ’Main
Street’). For every such street name we run Dijkstra
computations between all pairs of nodes in different
components. For all untagged edges on one of the resulting
shortest paths we set the feature value to 1.</p>
          <p>As a second feature we consider the number of close-by
name components. So we run a Dijkstra computation from
each of the two endpoints of an untagged edge until all
nodes in the Dijkstra search tree are either dead-ends or
are only adjacent to unrelaxed edges with n(e) 6= null. For
all nodes in the Dijkstra search trees we compute the set
of name tags of adjacent edges. Figure 6(c) shows a small
example where the feature vector entry equals 1. Being
connected to a single name component might be a strong
indicator for the segment to belong to this component.
But as connectivity to a single name component could also
mean that only one street in the area is tagged, we also
consider the shortest path distance to the closest name
component (retrievable from the two Dijkstra runs described
above) and the number of intersections on the shortest path
from the edge to the nearest name component. The higher
those two values the less likely it is that the edge belongs to
that name component. Finally, we again consider the street
type. Typically, a name component consists only of edges
of the same street type. Hence the feature value is
computed as the absolute street type difference of the edge and
the most frequent street type in the closest name
component.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.2. Training Data and Machine Learning</title>
          <p>Again, we generated a large training data set
automatically. For that purpose we first extracted completely tagged
streets, i.e.</p>
          <p>we searched for name components with all
nodes in that component only being adjacent to tagged
streets, so no surrounding street name data is missing. Then
we randomly deleted less than half of the name tags from
edges on this street and also from edges inside a certain
radius around the street. Afterwards, we computed the
feature vector for each now untagged edge on the selected
street and added the result to the training data set.
Furthermore we selected completely tagged streets in the same
way, but removed all of its tags and some tags on edges in
the neighborhood. These are examples where extrapolation
is not possible. Again, we computed the feature vectors and
added them to the training data.</p>
          <p>Like for hole classification, we deem Logistic Regression
and Random Forest as suitable learning methods to infer
which street segments can be extrapolated.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental Evaluation</title>
      <p>We implemented the described feature extraction methods
in C++. For machine learning, we used the scikit-learn
package for Python (Pedregosa et al., 2011). Experiments
were conducted on a single core of an Intel i5-4300U CPU
with 1.90GHz and 12GB RAM. We used the OSM road
network data of Germany (22.3 million nodes) and Poland
(6.3 million nodes) for learning and evaluation.</p>
      <sec id="sec-6-1">
        <title>6.1. Hole Classification</title>
        <p>We created a data set containing 10,000 feature vectors of
holes and 10,000 feature vectors of non-holes with the
procedure described in Section 4.3.1 on the Germany data set.
For evaluating our machine learning pipeline on this data,
we used 10-fold stratified cross-validation applying
Logistic Regression and Random Forest to our training data.
The outcomes are summarized in Figure 7. We observe
that both Logistic Regression and Random Forest work
remarkably well; both predict correctly in over 99% of the
cases. While Logistic Regression achieves a better overall
accuracy, Random Forest misclassifies slightly less holes
as non-holes.</p>
        <p>Having a closer look at the importance of the considered
features (see Figure 8) for Random Forest, we see that
local stretch is most important followed by the out-degree
of s and the in-degree of t. We also evaluated the AuC
score of the features. Local stretch achieved a score of
0.97 which underpins its importance for classification. The
other features achieved AuC scores below 0.6.
Nevertheless the combination of all features led to a 1-2% higher
accuracy than considering only local stretch. The
hierarchy difference resulting from the street types adjacent to
s, t does not really contribute to the classification process.
One reason might be the way we generated the training
data. Non-holes between e.g. parallel running motorways
and federal streets which exhibit rather high local stretch
but also high hierarchy difference are not likely to be
included in our data set. Manually selecting such examples
and adding them to the training data might increase the
importance of the hierarchy feature. Another problem is that
there are road segments without a type which distorts the
learning process.</p>
        <p>Finally, we used our complete pipeline for real hole
detection on Germany and Poland. For evaluation, we firstly
selected 2000 nodes randomly in Germany and Poland. For
each such node s, we extracted all nodes t within a radius of
500m (straight line distance) to form candidate pairs (s, t).
For each candidate pair we computed the respective
feature vector. We extracted rivers and lakes from OSM and
treated them as obstacles for the LS′ computation. Due to
our efficient implementation of the local stretch
computation, it took less than 5 minutes to process all nodes. Then
we applied our classifier (Random Forest) to decide which
of the candidate pairs are likely to be holes. Afterwards,
we used the extremeness check to filter superfluous
candidates by classifying also node pairs close to the identified
holes and selecting those with the highest LS′ value
(therefore final node pairs not necessarily include one of the
randomly selected nodes in the beginning). Table 1 shows an
overview of the number of pairs resulting from each step.
d(s, t) &lt; 500m
classified
extreme
real holes</p>
        <p>Germany
64,194
18,970
216
7</p>
        <p>Poland
44,332
11,432
128
19
tion pipeline and number of correctly recognized holes in the end.</p>
        <p>Nevertheless, the number of pairs that have to be
manually checked is significantly smaller than the number of
initially created candidate pairs. The percentage of real holes
among the extreme pairs is 3% for Germany and about 15%
for Poland. The difference might result from the much
better overall OSM data quality in Germany or could also be
seen as an indicator for already high data coverage.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Street Name Extrapolation</title>
        <p>We extracted 10,000 feature vectors for edge segments
where we assume extrapolation is possible and 10,000 for
edge segments where we are sure extrapolation is
impossible. Then we used Logistic Regression and Random Forest
to learn feature importance/weights. In our cross-validation
both approaches achieved an accuracy of 99.95%. Also
in both approaches the feature expressing whether the
segment connects two name components with the same name
had most influence. Despite a high AuC score, the number
of close-by name components made only little difference
in the learned classifiers. We assume the feature indicating
the shortest path distance to the closest name component
shadows the number of name components, as a very small
shortest path distance and a small number of close-by
components are highly correlated.</p>
        <p>For real-world validation of our learned classifier, we
selected 2000 unnamed road segments in each Germany and
Poland and computed the feature vectors. Our classifier
declared 235 road segments in Germany extrapolatable and
164 in Poland.</p>
        <p>A visual analysis showed that most of
the segments not declared extrapolatable lied in larger
untagged areas. For the road segments where the classifier
indicated extrapolation might be possible, we selected the
closest name component for name suggestion or, if the
segment connects two components with the same name, this
name is the obvious choice. We relied on a comparison
to GoogleMaps and BingMaps data for evaluation.
Unfortunately, in surprisingly many cases (about 10%) the
street segments in question were unlabeled or unclear or not
even present in GoogleMaps or BingMaps. We excluded
these cases from the evaluation. For the remaining cases,
we achieved an accuracy of 96% in Germany and 91% in</p>
        <sec id="sec-6-2-1">
          <title>Poland (see Figure 10 for a positive example).</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>We showed that machine learning is a useful tool to
detect missing and possibly extrapolatable road network data
in OSM. Making classified holes and nameless street
segments with a good name suggestion available to the OSM
community might raise attention to such locations and
finally lead to a faster improvement of the OSM data quality.
There are various directions for future research. Our
current methods do not work in regions where road data is
completely missing. But e.g. mapped building footprints
could be a strong indicator for the existence of
infrastructure in an area. Considering buildings could also improve
our classifiers. Including (large) buildings as obstacles for
hole detection could lead to more realistic local stretch
values in cities. Moreover, considering house numbers could
significantly help to decide if a street segment should be
tagged with a certain name. If the house numbers of tagged
and untagged segments complement each other there is a
good chance that they share the same name.</p>
      <p>Finally, many other aspects of the OSM data might be
suitable for extrapolation or classification using machine
learning, e.g. distinguishing living and industrial areas or
extrapolating missing house numbers.
In Proceedings of the 21st ACM SIGSPATIAL
International Conference on Advances in Geographic
Information Systems, pp. 54–63. ACM, 2013.</p>
      <p>Delling, Daniel and Werneck, Renato F. Faster
customization of road networks. In Experimental Algorithms, 12th
International Symposium, SEA 2013, Rome, Italy, June
5-7, 2013. Proceedings, pp. 30–42, 2013.</p>
      <sec id="sec-7-1">
        <title>Efentakis,</title>
      </sec>
      <sec id="sec-7-2">
        <title>Alexandros, Brakatsoulas, Sotiris, Grivas,</title>
      </sec>
      <sec id="sec-7-3">
        <title>Nikos, and Pfoser, Dieter. Crowdsourcing turning restrictions for openstreetmap. In EDBT/ICDT Workshops, pp. 355–362, 2014.</title>
        <p>cal. Quality assessment for building footprints data on
openstreetmap. International Journal of Geographical</p>
        <p>Funke, Stefan. Topological hole detection in wireless
sensor networks and its applications. In Proceedings of the
2005 joint workshop on Foundations of mobile
computing, pp. 44–53. ACM, 2005.</p>
        <p>Funke, Stefan, Nusser, Andre´, and Storandt, Sabine. On
kpath covers and their applications. In International
Conference on Very Large Databases (VLDB), 2014.
Girres, Jean-Franc¸ois and Touya, Guillaume. Quality
assessment of the french openstreetmap dataset.
Transactions in GIS, 14(4):435–459, 2010.</p>
        <p>Haklay, Mordechai. How good is volunteered geographical
information? a comparative study of openstreetmap and
ordnance survey datasets. Environment and Planning B
Planning and Design, (37):682–703, 2010.</p>
        <p>Holone, Harald, Misund, Gunnar, and Holmstedt, Hakon.</p>
        <p>Users are doing it for themselves: Pedestrian navigation
with user generated content. In Next Generation
Mobile Applications, Services and Technologies, 2007.
NGMAST’07. The 2007 International Conference on, pp.
91–99. IEEE, 2007.</p>
      </sec>
      <sec id="sec-7-4">
        <title>Jilani,</title>
        <p>Musfira,</p>
      </sec>
      <sec id="sec-7-5">
        <title>Corcoran, Padraig, and Bertolotto, Michela. Automated quality improvement of road network in openstreetmap.</title>
        <p>In Agile Workshop (Action
and Interaction in Volunteered Geographic
Informa</p>
      </sec>
      <sec id="sec-7-6">
        <title>Jilani,</title>
        <p>Musfira,</p>
      </sec>
      <sec id="sec-7-7">
        <title>Corcoran, Padraig, and Bertolotto,</title>
      </sec>
      <sec id="sec-7-8">
        <title>Michela.</title>
      </sec>
      <sec id="sec-7-9">
        <title>Multi-granular street network representation towards quality assessment of openstreetmap data. In</title>
        <p>Proceedings of the Sixth ACM SIGSPATIAL
International Workshop on Computational Transportation
Science, IWCTS ’13, pp. 19:19–19:24. ACM, 2013b.</p>
      </sec>
      <sec id="sec-7-10">
        <title>Jilani,</title>
        <p>Musfira,</p>
      </sec>
      <sec id="sec-7-11">
        <title>Corcoran, Padraig, and</title>
      </sec>
      <sec id="sec-7-12">
        <title>Bertolotto,</title>
        <p>Michela. Automated highway tag assessment of
open</p>
      </sec>
      <sec id="sec-7-13">
        <title>Automated identification and characterization of parcels (AICP) with openstreetmap and points of interest. CoRR, abs/1311.6165, 2013.</title>
        <p>Mitchell, Joseph SB. Shortest paths among obstacles in the
plane. International Journal of Computational
Geometry &amp; Applications, 6(03):309–332, 1996.</p>
      </sec>
      <sec id="sec-7-14">
        <title>Mooney, Peter and Corcoran, Padraig.</title>
        <p>Using OSM for
LBS–an analysis of changes to attributes of spatial
obTowards automatic vandalism detection in openstreetmap.
ISPRS International Journal of Geo-Information, 1(3):</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , Hongchao, Zipf, Alexander, Fu, Qing, and
          <string-name>
            <surname>Neis</surname>
          </string-name>
          , Passtreetmap road networks.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          jects. Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Rahman</surname>
          </string-name>
          , Kazi Mujibur, Alam, Tauhidul, and
          <string-name>
            <surname>Chowdhury</surname>
          </string-name>
          , Mashrur.
          <article-title>Location based early disaster warning and evacuation system on mobile phones using openstreetmap</article-title>
          .
          <source>In Open Systems (ICOS)</source>
          ,
          <source>2012 IEEE Conference on</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Tao</surname>
          </string-name>
          , Sha, Manolopoulos, Vasileios, Rodriguez, Saul, Rusu, Ana, et al.
          <article-title>Real-time urban traffic state estimation with a-gps mobile phones as probes</article-title>
          .
          <source>Journal of Transportation Technologies</source>
          ,
          <volume>2</volume>
          (
          <issue>01</issue>
          ):
          <fpage>22</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Vetter</surname>
            ,
            <given-names>Christian.</given-names>
          </string-name>
          <article-title>Fast and exact mobile navigation with openstreetmap data</article-title>
          .
          <source>Master's thesis</source>
          , Karlsruhe Institute of Technology,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>