<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GovLOD: Towards a Linked Open Data Portal</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Octavian Rinciog</string-name>
          <email>octavian.rinciog@cs.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vlad Posea</string-name>
          <email>vlad.posea@cs.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politehnica University of Bucharest</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Nowadays, governments and public agencies publish open data at an exponentially growing rate on dedicated portals. These open data sources provide semi-structured or unstructured data, because the focus is on publishing data and not on how they are later used. GovLOD is a platform that aims to transform the information found in these heterogeneous files in Linked Open Data using RDF triples.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Open Data</kwd>
        <kwd>RDF</kwd>
        <kwd>OCR</kwd>
        <kwd>SPARQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Further, in this article we present GovLOD, a platform which aims to change
the paradigm of open data portals by publishing the data as Linked Open
Data with a well-defined structure. Thus, we propose a semi-automatic solution
for ingestion, processing and publication of information that is hidden in files
published on open data portals governments. By publishing the data found in
ifles in Linked Open Data format, our platform also proposes a very easy method
for developers to build applications without the need to process the initial files.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Platform</title>
      <p>
        Our platform consists of several layers, each having a clearly defined role. Some of
these layers are automatic, for example: ingestion or processing layer, others are
currently not: vocabulary choosing. The purpose of the platform is to transform
the open data files into Linked Open Data, according to requirements from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.1
      </p>
      <p>Architecture</p>
      <p>The platform architecture is shown in Figure 1 and is structured according to
the following workflow: First, files are taken from open data portals. The second
step is processing these files in order to have a well-defined structure. In the third
step the information from the files is converted to RDF triples and stored in a
semantic repository. Developers who want to use public data can use SPARQL
queries for accessing the data, without the need of processing the initial files.</p>
      <p>Ingestion layer Most open data portals are implemented across CKAN
platform that provides APIs for information retrieval from the portals. Our system
connects to this API and processes the new information published since the last
query. This process is an automatic one: this layer keeps track of all downloaded
ifles and in the current step it only downloads the newly added files from the
last query time.</p>
      <p>Processing layer In the second step, the files retrieved pass through a stage
of processing, whose purpose is to identify and define the structure of each file.
Currently we have defined modules for processing PDF and tabular files, such as
CSV or XLS. From the PDF files, whether they are scanned or not, our platform
extracts the tables and makes them available as structured data. This process
is based on techniques that identify the table structure within the pages of the
ifle, identify the text by applying OCR on each of the table cells and afterwards
export the tables as CSV files. Also in this step data cleaning and normalization
is undergone by identifying the incorrectly formatted numbers and strings.</p>
      <p>
        Semantic layer Data taken from CSV files returned in the previous stage are
classified into two categories: a) statistical data b) physical entities information
(e.g. monuments, hospitals, schools, churches and people). In this phase, existing
data from files are converted to RDF triples. The URI schema is consistent, so
each time when an entity is recognized, it will be mapped to the same URI. Files
containing statistical data are converted using RDF Data Cube Vocabulary1. Thus
we map every dimension of data in one semantic dimension of the cube. Because
all statistical open data must be located in time and space, they must share these
two common dimensions. The other dimensions are variable and the user must
identify the properties to be mapped to each dimension. Also, we must identify
an ontology in which we map each data about physical entities. At this time, in
our system this mapping is done manually, but in the future we plan to have it
done automatically or semi-automatically. If an existing ontology cannot be found
on the given subject the user has the possibility to define new properties inside
the platform. Also in this step, we augment existing data using several external
web services, such as geolocation for physical entities. Then we link the created
resources with existing resources from the Linked Open Data Cloud. Currently
the system supports linking of resources through &lt;owl:sameAs&gt; property, with
existing resources in dbpedia.org. Similar resources are identified by matching
exactly the property &lt;dbp:name&gt; from dbpedia.org with &lt;rdf:name&gt; from our
created resources. In the future we intend to integrate a framework of automatic
retrieval of connections, such as Silk[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Storage layer RDF triples transformed in the above step are stored in Apache
Marmotta platform which contains a SPARQL engine and a reasoning one2. For
efective data storage Apache Marmotta can use a H 2 or a PostgreSQL database.
We use the last option as it provides additional scalability.</p>
      <p>App layer As already mentioned, using the SPARQL engine from the previous
layer, developers can implement applications only by running SPARQL queries,
without having to process the initial files. The major advantage of this layer is
that the data structure is documented and well-defined. Each data set that was
transformed into RDF triples has attached one wiki page3, which explains the
properties of the RDF triples and gives a few SPARQL queries samples about
how to access these data.
1 https://www.w3.org/TR/vocab-data-cube
2 http://opendata.cs.pub.ro/repo/sparql
3 http://opendata.cs.pub.ro/wiki</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In order to test this platform, we used data taken from Romanian national public
data portal4. As already mentioned, in this portal 441 datasets are published,
comprising 3,231 files. From there, we selected a number of 117 documents,
divided into 23 datasets. They share information in the following areas: health
(data about hospitals and pharmacies ), education (schools), culture (museums,
archaeological sites and churches), and statistical data (types of cars registered
in this country). From these files converted to RDF, our platform stored 165,279
resources and 2.4 million RDF triples.</p>
      <p>The easy use of the data provided by this platform is evidenced by the
application Romanian Linked Open Data Map5, whose purpose is to display the
nearest museums, hospitals or pharmacies to the user’s location. This application
was implemented using only data provided by our platform, using SPARQL
queries.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this article, we presented GovLOD, a solution that ingests and transforms
ifles from national open data portals and publishes RDF triples into a semantic
repository. The platform also provides a SPARQL engine that developers can use
for implementing applications without the need to process the initial documents.</p>
      <p>
        In recent years, a number of research projects aimed to transform open data
in Linked Open Data. For example, City Data Pipeline [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] creates a single model
under which it aggregates all available data about a given city and DataLift [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
is a tool that converts CSV files and relational databases taken from city open
data portals into RDF triples.
      </p>
      <p>Our platform difers from these solutions by automatically ingesting files that
contain open data by transforming structured data taken from scanned PDF files
and statistical data into RDF triples.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bischof</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>City data pipeline</article-title>
          .
          <source>Proc. of the I-SEMANTICS</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Böhm</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heise</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Govwild: integrating open government data for transparency</article-title>
          .
          <source>In: Proceedings of the 21st International Conference on World Wide Web</source>
          . pp.
          <fpage>321</fpage>
          -
          <lpage>324</lpage>
          . ACM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hausenblas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>How to publish linked data on the web</article-title>
          .
          <source>In: Tutorial in the 7th International Semantic Web Conference</source>
          , Karlsruhe, Germany (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Scharfe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Atemezing</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Enabling linked-data publication with the datalift platform</article-title>
          .
          <source>In: Proc. AAAI workshop on semantic cities</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Volz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Discovering and maintaining links on the web of data</article-title>
          . In: International Semantic Web Conference. pp.
          <fpage>650</fpage>
          -
          <lpage>665</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>