<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Sustainable view-based Extract-Transform-Load (ETL) Fusion of Open Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kay Muller</string-name>
          <email>kay.mueller@informatik.</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claus Stadler</string-name>
          <email>cstadler@informatik.</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ritesh Kumar Singh</string-name>
          <email>ritesh.kumar.singh@informatik.</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Hellmann</string-name>
          <email>hellmann@informatik.</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AKSW/KILT, Leipzig University &amp; DBpedia Association uni-leipzig.de</institution>
        </aff>
      </contrib-group>
      <fpage>70</fpage>
      <lpage>73</lpage>
      <abstract>
        <p>Openly available datasets originate from di erent data providers which range from government agencies, over commercial enterprises to communities of data enthusiasts. Integrating di erent source datasets into a single RDF graph by using ETL (Extract-Transform-Load) systems which perform o ine transformation, ontology matching and linking techniques usually takes many iterations of revisions until the target dataset is made free of the most obvious mapping, linking and consistency errors. Since ETL systems produce the RDF o ine, any mapping or content change requires a re-ingest of the relevant source data. When dealing with heterogeneous source datasets, creating a uni ed target dataset can be a tedious undertaking. Therefore the paper proposes an RDF view based ingestion approach, which allows real-time \debugging" of the uni ed dataset where mappings and links can be changed with immediate e ect. Once the uni ed graph passes all data quality tests, the RDF can be materialized. This process poses an alternative to existing ETL solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>ETL</kwd>
        <kwd>RDF View</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Traditionally, ETL (Extract-Transform-Load) systems are used to convert di
erent source datasets into a structured target dataset. These systems are commonly
used in the area of data warehousing. As the name suggests these frameworks
follow the work ow of: 1.) Extract required information from source datasets, 2.)
Transform portions of the source data into a target model using schema
mappings, normalizations and deduplications, 3.) Load the data into a store, possibly
with versioning support.</p>
      <p>In the Semantic Web community, tools such as LDIF1, OpenRe ne2 and
Unied Views3 have been developed to allow Linked Data developers to create new</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://ldif.wbsg.de/</title>
    </sec>
    <sec id="sec-3">
      <title>2 http://openrefine.org/</title>
    </sec>
    <sec id="sec-4">
      <title>3 https://www.semantic-web.at/unifiedviews</title>
      <p>
        Linked Data content. Since many Linked Data data publishers do not always
follow existing best practices and (quasi-)standards of the Linked Data
community, it potentially complicates linking tasks and decreases the overall data
quality of source datasets. The LOD Laundromat [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was created due to this
dilemma: This system crawls the Linked Data Cloud4 and creates corresponding
standard-compliant datasets which can be downloaded from the website. Since
data quality can vary between datasets, using an ETL system to convert the
source data into the target format might result in mapping and content errors,
which might only be discovered once all the data is converted. Very often this
might result in a re-ingest of the source data through the whole ETL pipeline.
These trial and error attempts can be very time and resource consuming. When
working on integrating di erent company datasets into a knowledge graph
using the standard ETL approach, we realised that it was necessary to rerun the
pipeline many times due to conversion, linking or mapping errors. Once the ETL
pipeline was nished, we examined the ETL output for errors. We xed the
errors in the ETL pipeline and rerun the pipeline. This process was repeated until
a satis able ETL output was generated. Hence the idea was born to create an
ETL system, which can perform online dataset updates. As was shown in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
RDF views can present a feasible alternative to standard ETL approaches in
the context of data quality management. Based on the concept of RDF views,
this paper proposes a novel architecture which gives the possibility to \debug"
knowledge graphs. The term \debugging" is used in broad sense and illustrates
the idea that the e ects of changes at the linking, mapping or data normalization
pipeline steps can immediately be examined without having to wait for a
complete re-ingest, hence allowing an immediate \debugging/evaluation" of changes
and their impact on the target dataset.
      </p>
      <sec id="sec-4-1">
        <title>2 Design</title>
        <p>In order to ease the quality assessment of heterogeneous datasets, we propose
a view-based RDF transformation design. This design bears the advantage that
many data and transformation related changes can be reviewed promptly instead
of potentially having to re-ingest the data. The proposed design is shown in
Figure 1. In order to support the handling of heterogeneous datasets we identi ed
the following main features:
{ RDF views: Since our proposed architecture uses RDF-based views, we
suggest to convert the source data directly to RDF without cleaning the input
data. As RDF-2-RDF mapping languages, SPARQL CONSTRUCT5 queries
with minor syntax extensions for naming views and quad support could be
exploited. Another option is to adapt RML6 for this purpose. Compared to
SQL-based views, the RDF-based views approach has the advantage that
SPARQL queries of one layer can be transformed into SPARQL queries of
another layer without limitations of SPARQL-2-SQL conversions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 http://lod-cloud.net/</title>
    </sec>
    <sec id="sec-6">
      <title>5 https://www.w3.org/TR/rdf-sparql-query/#construct</title>
    </sec>
    <sec id="sec-7">
      <title>6 http://semweb.mmlab.be/rml/spec.html</title>
      <p>{ Functional Indexes for transformation functions: The input data is rarely
expected to t the target ontology and all de ned requirements. In order to
speed up access to data normalized via views, the proposed system supports
functional indices similar to those supported by traditional RDBMS7.
{ Real-time updates : Due to the virtualization of the RDF and Normalization
Layer, changes applied in these layers have immediate e ect on the
virtualized RDF views. This feature allows ontology engineers to follow an iterative
process when creating a new dataset, rather than re-ingesting and checking
parts of the dataset in order to save time.
{ Uni ed Querying and ETL: All relevant source data is expected to have been
mapped to appropriate target ontologies. Using SPARQL, it is possible to
retrieve portions of the data on-the- y as well as to export all data at once.</p>
      <p>Our design consists of 6 layers: The Source Data Layer is used to ingest
the original source data via the Source Mapping Layer into the RDF backend
using mappings which alter the source data as little as possible. The
Intermediate RDF Layer provides direct access to the non-normalized RDF data. The
Normalization Layer expresses normalizations by means of RDF2RDF views,
possibly supported by functional indices and caches. These indices and caches
are supposed to improve the query performance against the normalized RDF</p>
    </sec>
    <sec id="sec-8">
      <title>7 For example, see Postgres: http://www.postgresql.org/docs/9.5/static/</title>
      <p>indexes-expressional.html
data. Thereby, there are two kinds of views: Term-based views for crafting RDF
terms, especially IRIs, and triple-based views for relating the terms to each other.
If required the Enrichment Layer can be used to trigger the additional
generation of information about existing entities. Finally the Query Federation Layer
can be used to integrate other (possibly virtual) SPARQL endpoints. Note, that
changes made in one layer are automatically propagated to the upper layers.
3</p>
      <sec id="sec-8-1">
        <title>Discussion and Future Work</title>
        <p>
          The authors are aware that RDF views come with performance implications [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
It can be compared to a debug executable which executes slower by an order
of magnitude compared to a release version. But this executable o ers a lot of
features which support the debugging process. In the same way the proposed
design allows unique knowledge graph \debugging/evaluation" capabilities. For
some use cases the performance of this design might be su cient and would
not require a full RDF export. If a \release" version of the dataset is required,
SPARQL queries can be used to retrieve portions of the data on-the- y as well
as export all data at once. This \release" data can then be loaded into a graph
backend. This generic architecture can be used to add di erent data sources to
the knowledge graph. These data sources could be databases, REST APIs, etc.
Since a lot of source data is only accessible via REST APIs or via databases,
this opens new integration opportunities for source data. These backends could
be integrated via the Query Federation Layer. In addition, it would be possible
to add a Meta Data Layer which could store provenance, con dence and other
meta-information which is relevant for relations and entities. In addition it would
be possible to add a Fusion Layer which would use all owl:SameAs relationships
and entity fusion algorithms to show a fused view of all entities for the imported
source datasets. Despite a possible performance loss by using RDF views, the
advantages of such a design surpass the disadvantages, especially since it can be
combined with other existing solutions at each layer. Hence we believe that this
novel design will allow the creation and curation of knowledge graphs in a new,
iterative way.
        </p>
        <p>Acknowledgments. This work was supported by grants from the Federal
Ministry for Economic A airs and Energy of Germany (BMWi) for the
SmartDataWeb Project (GA-01MD15010B)8</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>8 http://smartdataweb.de/</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>W.</given-names>
            <surname>Beek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rietveld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Bazoobandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wielemaker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Schlobach</surname>
          </string-name>
          .
          <article-title>Lod laundromat: A uniform way of publishing other people's dirty data</article-title>
          .
          <source>In ISWC</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>N.</given-names>
            <surname>Konstantinou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Spanos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Mitrou</surname>
          </string-name>
          .
          <article-title>Transient and Persistent RDF Views over Relational Databases in the Context of Digital Repositories</article-title>
          .
          <source>Communications in Computer and Information Science</source>
          ,
          <volume>390</volume>
          CCIS:
          <volume>342</volume>
          {
          <fpage>354</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>