<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automation of process to load database from OSM for the design of public routes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>External File</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>G. Bejarano, J. Astuvilca, P. Vega Pattern Recognition and Applied Artificial Intelligence Group Pontifical Catholic University of Peru 1801 Univesitaria Avenue</institution>
          ,
          <addr-line>Lima</addr-line>
          ,
          <country country="PE">Peru</country>
        </aff>
      </contrib-group>
      <fpage>99</fpage>
      <lpage>105</lpage>
      <abstract>
        <p>This paper presents an automatic process to load the transport information of Lima Peru city as network and public transportation routes from OpenStreetMap (OSM) to a PostgreSQL database. Moreover, this work shows how OSM data is transformed in two graphs and in several actual bus routes. The work is a combination of SQL commands and python programs to transform OSM data and combine it with external information to specific structures in a final database. This information was necessary for a second goal that was the approach and study of the Transit Network Design Problem (TNDP). Since OSM is a free collaborative tool, it is subject to manual errors in the map that produce mistaken graphs and routes. These errors are corrected daily because the area information is updated frequently. The obtained results confirm that this process could be automated in a one-click step. Finally, we tried other methods to upload OSM data and none of them got an exact graph except for osm2pgrouting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Until 2014, the city of Lima, capital of Peru, had
4031 formal routes, 1 Bus Rapid Transit2 (BRT)
line and 1 metro line. There are projects to build
a new BRT corridor3 and 5 more metro lines4,
1http://lima.datosabiertos.pe/home/
2https://www.itdp.org/library/
standards-and-guides/the-bus-rapidtransit-standard/what-is-brt/</p>
      <p>3http://archivo.larepublica.pe/11-042015/municipalidad-de-lima-castanedaanuncia-ampliacion-de-la-ruta-limanorte-del-metropolitano
4http://www.aate.gob.pe/metro-de-lima/
in order to improve public transportation.
Actually, some months ago, the number of formal bus
routes was reduced to 322 urban routes, 77
beltway routes and 15 routes in unattended zones5
because of the oversupply of routes that exist in the
city.</p>
      <p>
        The task of deciding which set of bus routes
must continue in order to attend the public
mobility demand and to optimize the cost of the user
and the cost of the operator is a huge and full of
possibilities task
        <xref ref-type="bibr" rid="ref7">(Farahani et al., 2013)</xref>
        . For this
reason, a metaheuristic algorithm might be used
to find a nearly global optimal solution
        <xref ref-type="bibr" rid="ref7">(Farahani
et al., 2013)</xref>
        to this problem defined by Fan and
Machemehl (2004) as the Transit Network Design
Problem (TNDP). Besides, according to Fan and
Machemehl (2004), metaheuristics are more
suitable to work with practical and real cases.
Nevertheless, a real problem was to load a complete
network as an input to the algorithm and form a
graph by which a set of routes will pass. For this
reason, the purpose of this work is to implement
an automatic process that provides the necessary
data input to an algorithm that solves the TNDP.
      </p>
      <p>
        For our algorithm, two levels of graphs would
be tested in order to find a solution progressively.
The first graph represents the adjacent Traffic
Zones
        <xref ref-type="bibr" rid="ref1">(Agency, 2005)</xref>
        , which we call minizones
because they can be smaller boundaries in the
future, while the second graph is the road and
highways network level. First, sequences of minizones
are created to generate general routes that satisfy
soft and hard restrictions. Then, using these first
routes, real bus routes are defined in a network of
roads level. Once we have all the necessary
information for the algorithm, the sequence and calls
of execution are organized in a single executable
command.
      </p>
      <p>5http://www.elperuano.com.pe/
NormasElPeruano/2015/02/28/12055281.html</p>
      <p>This work is organized as follows. Section 2
explains the information sources for the database
created to keep the data needed and the solutions
produced by the algorithm for the TNDP. Section 3
describes the extraction and transformation
processes used to get the information mentioned and
load it to a default database. Section 4 explains the
organization of the files and instructions used in
the single process and the order in which they are
executed to automatically finish the process.
Conclusions of this work are presented in Section 5
along with future recommendations.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Sources of Information</title>
      <p>To solve a basic case of the TNDP, two sets of
data are needed as we can see in Mautone and
Uqhuart’s work (2009). The first group is the
network that generates one of the graphs required to
evaluate solutions at a network road level, and
information about the edges of this graph like the
road owner of those edges and the road’s type. The
second group consists of the demand information,
the number of travels made from an origin zone
to a destiny zone and the adjacency graph among
zones. Using the Pair Insertion algorithm (PIA),
random routes that cover certain percentage of the
total demand of the zones are generated. Besides
those two sets of groups, Lima has an actual real
solution and that is why we decided to test this
one as an initial solution. The actual information
about routes would be combined using a genetic
algorithm in different forms in order to optimize
the set of routes that solve the TNDP in both
levels of graphs: zones and network.</p>
      <p>The next two subsections describe the sources
for all the groups of data mentioned.
2.1</p>
      <sec id="sec-2-1">
        <title>Road Network</title>
        <p>The two sources analyzed to generate the public
transportation network are shown in Figure 1. The
first source was the one given by the Ministry of
Transportation and Communications of Peru. The
problem with this source was that it was difficult
to identify the edges and nodes of the network due
to the format of the file: shapefile6 (.shp), i.e. one
set of edges was represented as a single edge and
vice versa. In order to verify the correctness of
the shapefile, it was loaded into OSM and in some
cases, the roads were displayed crossing a block.</p>
        <p>6Geospatial vector data format for Geographic
Information System (GIS) Software.</p>
        <p>
          The second analyzed source was OSM, which has
user generated geographic content
          <xref ref-type="bibr" rid="ref8">(Janssen and
Crompvoets, 2012)</xref>
          . After some transformation,
OSM produces a valid, but not official, graph that
represents the public transportation network. This
network would be enough for our TNDP studies.
Besides, this contribution on OSM could be used
for future studies.
In order to get the actual public transportation
routes, two complementary sources were
analyzed. The first source was the Metropolitan Lima
Municipality, which has one SHP file with the
whole set of routes. A process to get a SHP file per
route was made through the ogr2ogr7 command
and a batch program. After the command
produces a comma separated values file (.csv) which
contains a WKT8 (or LineString9) for each route,
the batch program creates a SHP file from every
record of the CSV file. Once a route is individually
identified, it is drawn manually in the OSM
interface and matched with all the ways (set of edges)
by which it passes in order to maintain the
geometric consistency with the road network. In this
process, a lot of errors are generated and they would
be solved thanks to daily reports that indicate their
coordinates and details.
        </p>
        <p>7A command line utility from the “Geospatial Data
Abstraction Library” (GDAL).</p>
        <p>8Well-known text, markup language for representing
vector geometry objects.</p>
        <p>9Vector geometry object that represent a line.
Due to a study of demand and transportation in
Lima made by the Japan International
Cooperation Agency (2005), a set of 427 Traffic zones was
established in the city of Lima. This information
implied two types of data: polygons that represent
the zones and the number of travels made among
every pair of O-D zones (origin and destiny). The
polygons are used to delimit a set of edges for
which a route must pass by in order to satisfy the
demand of that zone (like an origin or destiny).
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Transformation process</title>
      <p>This section describes the steps followed to extract
data from OSM sources, upload it into a database
and transform it into a final database called routes.</p>
      <sec id="sec-3-1">
        <title>3.1 Extraction from the sources</title>
        <p>
          In order to get the OSM data about the public
transportation network and routes of Lima, the
files containing the information of Peru and the
boundary of Lima must be downloaded and used
in a command line. There is a server named
Geofabrik10 which has data extracts of countries of all
the continents and is updated daily
          <xref ref-type="bibr" rid="ref10 ref11">(Zielstra and
Zipf, 2010)</xref>
          . However, as this server provides the
information of the whole country, a boundary must
be applied to get the information of only the city
of Lima. This can be done through an OSM
relation ID which can be obtained from the MapIt11
website, through the insertion of a coordinate
(latitude, longitude) of the objective city. After that,
the OSM relation ID is introduced in a polygons
generator page12 and the .poly file is downloaded.
        </p>
        <p>Osmosis is the command line application from
OSM with which a file containing the information
of the primary, secondary and tertiary highways of
Lima is produced as well as the trunk, motorway,
residential and their links. Figure 2 shows how
the osmosis command takes as input the
information of Peru (peru-latest.osm.pbf) and the
boundary of Lima (lima-callao.poly) and produces the
lima-callao.osm file.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Default and final databases</title>
        <p>Once the lima-callao.osm file is produced, it is
necessary to upload this information to a database
to manage the information. For this task, a tool
10http://download.geofabrik.de/
11http://global.mapit.mysociety.org/
12http://polygons.openstreetmap.fr/
like osm2pgrouting13 was used. It provides a
process that converts OSM data into a topology and
it is uploaded in database. First, we must create a
database called pgrouting-workshop14. After that,
the command shown in Figure 3 must be executed.</p>
        <p>This command filters the road types mentioned
in mapconfig.xml and create tables like ways
(edges), classes (types of roads), nodes,
relations (routes for example), relation ways (which
relates the edges of a route), types, way tag,
way vertices pgr. The full schema for this
database is shown in Figure 4.</p>
        <p>Certainly, there are other tools to import OSM
data into a local database (osm2pgsql, imposm and
osmosis). However, only osm2pgrouting builds an
exact graph with edges and nodes, and Section 3
will show how that makes a difference.</p>
        <p>In order to have the direct information of the
sources and the information of the application or
algorithm separated, a new database routes is
created. Table 1 shows the source of every table in the
new database (from default pgrouting-workshop
database or from external data).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Transformation</title>
        <p>In this section, a brief logic of the load of every
table would be shown. See the Figure 5 for more
details. every table would be shown.</p>
        <p>The table road types contains the different types
of roads that are presented in Lima’s OSM data
13http://pgrouting.org/docs/tools/
osm2pgrouting.html</p>
        <p>14ftp://ftp.remotesensing.org/
pgrouting/foss4g2010/workshop/docs/html/
chapters/topology.html</p>
      </sec>
      <sec id="sec-3-4">
        <title>Database</title>
        <p>Classes (default)
Ways (default)
Ways (default)
Ways (defaut)
Ways (default)
routes, routes edges, edges, transit zones (routes)
i4 districts.csv
i8 census zones.csv
demand matrix.csv
lima-callao.osm
list final routes.csv
like primary, secondary and tertiary road among
others. Besides, this table has an additional field
(not from OSM) that indicates the maximum
number of routes that would be used in the future when
solving the TNDP.</p>
        <p>The table roads is filled from the table ways of
the pgrouting-workshop database, which contains
the longitude and latitude of both nodes that form
an edge (known as source and target). However,
as a road is formed by several edges or ways, a
distinct query by the name of the way is made to
obtain unique roads.</p>
        <p>The table nodes is filled by searching every way
and saving each source or target as a unique key
in a hash table. Besides that, a function from
the PostGIS extension is applied to define which
minizone every node belongs to. Additionaly,
the edges table is similar to the table ways (on
pgrouting-workshop database) but the road id is
brought from the previous step (roads table) to
complete its load.</p>
        <p>We use several steps to fill table routes. First, a
file is generated from the lima-callao.osm file
containing just the relations-routes from the users of
the project. After that, just the routes which do not
have any errors are listed in a CSV file and they
are finally uploaded to the table.</p>
        <p>The process to define which edges belong to
certain route has been a continual feedback. First,
a hash table of different sources and targets nodes
from the table ways was created. Second, every
node was linked to its respective previous and next
node. Each node has its own edge’s gid15, relative
to the current edge. After that, a search is made to
identify which nodes (source or target) are used in
the graph.</p>
        <p>15Road link if of the table ways.</p>
        <p>Finally, a whole loop is made to search along
the hash table from the start node and get the
edge’s to which the actual node belongs to and its
respective order. It is important to mention that
there are some mistakes in this logic due to some
errors or missing information in the direction of
the ways (edges).</p>
        <p>Based on the filling of the route’s edges, the
logic to fill the table routes minizones is to analyze
the source and target node of every edge that form
the route. As every node belongs to a minizone,
a list of every minizone that contains a route’s
nodes is made. Moreover, this list is ordered by
the edge’s order calculated in the previous step as
an attribute of the table routes edges. This list is
grouped by the minizone id to avoid the repetition
of a minizone on different edges. This logic is
applied using some functions of the PostGIS
extension like ST Contains(polygon, point) that decides
whether a point is contained in a polygon or not
and ST SetSRID(point, system) that sets the 4326
system reference16 of a point. This logic is better
shown in Figure 6.</p>
        <p>When nodes are on the limit of the polygon like
the boundary of a demand zone, they could belong
to more than one zone. However, this work did not
analyzed which minizone is selected by the
function ST Contains from the PostGIS extension.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results: Automatic process organization</title>
      <p>Before getting a stable database in which you
can execute an algorithm to the TNDP, several
databases loads must be done in order to evaluate
the accuracy of the routes drawn manually. That
16http://suite.opengeo.org/opengeodocs/glossary.html
is the reason why an automatic process was set to
run daily. In Figure 7, a sequence of programs,
commands, input and output files are shown to
explain the process of downloading the information
from OSM, combine it with external information
and load them to a final database.</p>
      <p>Some tools must be installed in the server and
the local computer before executing this process.
These are: gdal, osmosis, PostGIS, pgrouting17
and psycopg2.</p>
      <p>The automated process has a series of steps
called from an executable file (final.sh) in Linux.
It lasts 30 minutes approximately and the topology
(graph) size is about 150000 edges and 100000
nodes; the number of routes is 300. The
content and the structure of the commands and
process called from final.sh are shown in Figure 8.
Also, the process generates error reports about the
current drawn routes, for each one evaluates how
many edges exist in each node, so if more than
two edges exist in one node, then it reports that
the route has an error. This process was carried out
daily until no error is found in the routes. It is
important to mention that this process is
recommendable when working in a local database where the
password could be stored in files to allow the
interaction of calculus inside and outside the database.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and recommended work</title>
      <p>This section presents the conclusions after
implementing an automated process for uploading OSM
data combined with external information. One of
them was the confirmation of a tool that generates
the topology of the map or graph rather than other
tools that also upload the same data but in a
different scheme.
5.1</p>
      <sec id="sec-5-1">
        <title>Conclusions</title>
        <p>In subsection 3.1, it was mentioned that there were
other tools to upload OSM data. Osm2pgsql and
imposm, after installed and executed, generate a
table where the information of ways can be found.
However, the field that represents the geometry
does not generate the sequences of edges.
Actually, osm2pgsql generates edges that are not
necessary for the graph and make impossible to
distinguish which ones are. We could have worked
with edges that were not required but it would
have been an unnecessary addition of data to a
problem that is already complex. In addition to
that, imposm generates the same edges composed
of only the first and last node of the way. The
manner these edges are stored in these schemes
(osm2pgsql and imposm) hinders the recognition
of edges in a way as seen in Figure 9 where just
three edges (osm2pgrouting) should be generated
from the selected way instead one (imposm) or
seven (osm2pgsql).</p>
        <p>
          On the other hand, there are a lot of routing
applications that use OSM data to combine it with
other type of information at some point
          <xref ref-type="bibr" rid="ref5">(Amat et
al., 2014)</xref>
          <xref ref-type="bibr" rid="ref10">(Vetter, 2010)</xref>
          . However, some of them
are implemented just for certain cities, for other
routing problems like private cars and bicycles or
just do not work very well18. We found some
commercial routing applications but obviously they do
not public the process to combine their sources
into a unique database.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Recommended work</title>
        <p>As OSM is a free collaborative tool, more analysis
is recommended in order to establish several
logics that allow us to maintain the coherence of the
data despite of new errors.</p>
        <p>A full review of the possible scheme obtained
and filled from osmosis should be finished in order
to know if there is a faster and simpler way of
loading the information. However, it seems that there
is no direct form to identify the relations routes
with the osmosis’ scheme. This is because of the
different order that the route tag has in a field to
recognized that a relation is a public transportation
route.</p>
        <p>There is also another tool that seems to convert
OSM data into a graph topology that must be
analyzed: OSM2PostGIS19. However, it is still in an
early development.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Japanese</given-names>
            <surname>International Cooperation Agency</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Plan maestro de transporte urbao para el a´rea metropolitana de lima y callao en la repu´blica del peru´ fase 1-9. problemas y temas actuales del transporte urbano</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>18http://wiki.openstreetmap.org/wiki/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>comparison_matrix 19http://pgrouting.org/docs/tools/</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>osm2PostGIS.html</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Guillermo</given-names>
            <surname>Amat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Javier</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          ´ lvaro Arranz, and
          <string-name>
            <given-names>Angel</given-names>
            <surname>Ramos</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Using open street maps data and tools for indoor mapping in a smart city scenario</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Wei</given-names>
            <surname>Fan and Randy B Machemehl</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Optimal transit route network design problem: Algorithms, implementations, and numerical results</article-title>
          .
          <source>Technical report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Reza</given-names>
            <surname>Zanjirani</surname>
          </string-name>
          <string-name>
            <surname>Farahani</surname>
          </string-name>
          , Elnaz Miandoabchi, WY Szeto, and
          <string-name>
            <given-names>Hannaneh</given-names>
            <surname>Rashidi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A review of urban transportation network design problems</article-title>
          .
          <source>European Journal of Operational Research</source>
          ,
          <volume>229</volume>
          (
          <issue>2</issue>
          ):
          <fpage>281</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Katleen</given-names>
            <surname>Janssen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joep</given-names>
            <surname>Crompvoets</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Geographic Data and the Law: Defining New Challenges</article-title>
          . Leuven University Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Mauttone</surname>
          </string-name>
          and Mar´ıa
          <string-name>
            <given-names>E</given-names>
            <surname>Urquhart</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>A route set construction algorithm for the transit network design problem</article-title>
          .
          <source>Computers &amp; Operations Research</source>
          ,
          <volume>36</volume>
          (
          <issue>8</issue>
          ):
          <fpage>2440</fpage>
          -
          <lpage>2449</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Christian</given-names>
            <surname>Vetter</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Fast and exact mobile navigation with openstreetmap data</article-title>
          .
          <source>Master's thesis</source>
          , Karlsruhe Institute of Technology.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Zielstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Zipf</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A comparative study of proprietary geodata and volunteered geographic information for germany</article-title>
          .
          <source>In 13th AGILE international conference on geographic information science</source>
          , volume
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>