<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anastasia Dimou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Kontokostas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Freudenberg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Mannens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Hellmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rik Van de Walle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University - iMinds - Multimedia Lab</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Leipzig, Institut fur Informatik</institution>
          ,
          <addr-line>AKSW</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>rdf dataset quality assessment is currently performed primarily after data is published. Incorporating its results, by applying corresponding adjustments to the dataset, happens manually and occurs rarely. In the case of (semi-)structured data (e.g., csv, xml), the root of the violations often derives from the mappings that specify how the rdf dataset will be generated. Thus, we suggest shifting the quality assessment from the rdf dataset to the mapping de nitions that generate it. The proposed test-driven approach for assessing mappings relies on rdfunit test cases applied over mappings speci ed with rml. Our evaluation is applied to di erent cases, e.g., dbpedia, and indicates that the overall quality of an rdf dataset is quickly and signi cantly improved.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data Mapping</kwd>
        <kwd>Data Quality</kwd>
        <kwd>rml</kwd>
        <kwd>rrml</kwd>
        <kwd>rdfunit</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Although more and more data is published as Linked Data, there are signi cant
variations in quality [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], commonly conceived as \ tness for use" for a certain
application or use case. Similar violation patterns reoccur frequently, and most
encountered violations are related to the dataset's schema, namely the
vocabularies or ontologies used to annotate the data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. When datasets stem originally
from semi-structured formats (e.g., csv, xml), the schema is derived from the
set of classes and properties speci ed by the mappings which are applied
repeatedly. Consequently, the same violations are repeated in the dataset as well.
Lately, combinations of di erent ontologies and vocabularies are used to
annotate data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This increases the likelihood of such violations, as they often derive
from incorrect usage or incorrect combinations of schemas in the mappings.
      </p>
      <p>
        Taking mappings of data to rdf as a software engineering task, a set of unit
test cases can be assigned to the mappings to ensure the correct generation of
rdf datasets from input data. Incorporating quality assessment as part of the
mapping is essential to prevent the same violations from appearing repeatedly
within the dataset and over distinct entities. After all, structural adjustments
can still be applied in this phase, as violations are identi ed at their root.
Moreover, if mappings are assessed, every other new data source also mapped using
them directly bene ts from the improvements. Therefore, we proposed a uniform
solution [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that assesses the quality of an rdf dataset, covering both the
mappings and the dataset. In this work, we aim to elaborate more on how rdfunit
patterns [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for dataset test cases were arose to cover rml mappings [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], too.
2
      </p>
      <p>
        [R2]RML Mappings Quality Assessment with RDFUnit
Our solution relies on the rdf mapping language (rml) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that allows specifying
mapping de nitions expressed in rdf, and the rdfunit validation framework
due to its associated test-case-based architecture [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For our proof-of-concept
implementation3, rdfunit test cases are applied to mappings de ned with rml.
RML extends rrml4, the wc recommended language for de ning mappings
of data in relational databases to rdf, and also covers mappings from sources in
di erent semi-structured formats, such as csv and json [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. rml documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
specify how the input data can be represented in rdf. The main building blocks
of rml documents are Triples Maps that de ne how triples are generated and
consist of three main parts: the Logical Source, the Subject Map and zero or more
Predicate-Object Maps. Term Maps de ne how rdf terms (iri, blank node or literal)
are generated. A Term Map can be constant-valued that always generates the same
rdf term, reference-valued that is the data value of a referenced data fragment
in a given Logical Source, or template-valued which is a valid string template that
can contain referenced data fragments of a given Logical Source.
      </p>
      <p>
        RDFUnit [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is an rdf validation framework inspired by test-driven software
development. In rdfunit, every vocabulary, ontology, dataset or application can
be associated by a set of data quality test cases. The test case de nition language
of rdfunit is sparql, convenient to directly query for identifying violations. For
rapid test case instantiation, a pattern-based sparql-template engine, running
over a library of common patterns5, is supported where variables can be easily
bound into patterns. rdfunit has a Test Auto Generator (tag) component. tag
searches for schema information and automatically instantiates new test cases.
      </p>
      <p>As [r]rml mappings can be processed as rdf documents, because of their
native rdf representation and viewpoint (written as the generated triples), the
same set of schema validation patterns normally applied on the rdf dataset is
also applicable on the mappings that state how it is generated. Nevertheless,
instead of validating the triple's predicate against its subject and object, the
predicate is extracted from the Predicate Map and is validated against the Term
Maps that generate the subject and object. To achieve this, the properties and
classes are identi ed and their namespaces are used to retrieve the schemas and</p>
    </sec>
    <sec id="sec-2">
      <title>3 https://github.com/mmlab/RMLValidator 4 http://www.w3c.org/TR/R2RML 5 https://github.com/AKSW/RDFUnit/blob/master/configuration/patterns.ttl</title>
      <p>Test-driven Assessment of [R2]RML Mappings
generate the test cases as if they were the actual dataset. The expected value,
as derived from the Predicate Map, is compared to the de ned one, as derived
from the corresponding Subject Map and Object Map. For example, the extracted
predicate is foaf:age normally expects an instance of foaf:Agent type for its
domain and an integer datatype for its range, but the Term Map that generates
the subject is de ned to be of foaf:Project type and the object is de ned to
have a oat value. Its mapping document follows:
&lt;#Mapping&gt; rr:subjectMap [rr:template "http://example.com/{id}"; rr:class foaf:Project];
rr:predicateObjectMap [rr:predicate foaf:age; rr:objectMap [rml:reference "age"]].</p>
      <p>Corresponding rdfunit test cases and patterns were de ned to apply to
the mappings, adjusting the assessment queries.6 The de ned test cases cover
all possible alternative ways of de ning equivalent mappings that generate the
same triples. rdfunit can annotate test cases by requesting additional variables
and binding them to speci c result properties. The test case patterns applied to
the aforementioned example and its instantiation are indicatively presented. On
the left, the where clause of a sparql template query that assesses the datatype
is presented. On the right it is presented how it is instantiated:
?resource %%P1%% ?c.</p>
      <p>FILTER (DATATYPE(?c) != %%D1%%)
?resource foaf:age ?c.</p>
      <p>FILTER (DATATYPE(?c) != xsd:int)
The following is the where clause of the same test case applied to the mapping:
?resource rr:predicateObjectMap ?poMap.
?poMap rr:predicate %%P1%%;</p>
      <p>rr:objectMap ?objM.
?objM rr:datatype ?c.</p>
      <p>FILTER (?c != %%D1%%)
?resource rr:predicateObjectMap ?poMap.
?poMap rr:predicate foaf:age;</p>
      <p>rr:objectMap ?objM.
?objM rr:datatype ?c.</p>
      <p>FILTER (?c != xsd:int)
3</p>
      <p>Evaluation and Discussion
The assessed datasets and corresponding mappings, as well as the assessment
results are summarized in Table 1: dbpedia mappings7, after the mappings were
converted from wikitext markup to rml8, and its dataset were assessed. dblp
6 https://github.com/AKSW/RDFUnit/blob/master/data/tests/Manual/www.w3.org/ns/r2rml/rr.</p>
      <p>tests.Manual.ttl
7 http://mappings.dbpedia.org/
8 https://goo.gl/GPB1Ar
dataset
DBpEn
DBpNL
DBLP
iLastic10
CDFLG11
CEUR-WS12
size
62M
21M
12M
150K
0.6K
2.4K
dataset assessment
time #fail.</p>
      <p>16h 1,128
1.5h 683
12h 7
12s 23
7s 15
6s 7
#viol.</p>
      <p>3.2M
815K
8.1M
37K
678
783
mappings, after the mappings were converted to rml9, and the corresponding
dataset were assessed, too. The results show that the required quality assessment
time is signi cantly reduced if the mappings are assessed instead of the rdf
dataset, especially in the case of medium/large datasets. That happens because
the dataset assessment requires examining each triple separately to identify,
for instance, that 12M triples violated the predicate's range, whereas mapping
assessment requires only 1 triple to be examined. The e ectiveness of mapping
assessments is also high: the identi ed violations can be accurately indicated.
Acknowledgements. This paper's research activities were funded by Ghent
University, iMinds, the Institute for the Promotion of Innovation by Science and
Technology in Flanders, the Fund for Scienti c Research-Flanders and by grants
from the EU's 7th &amp; H2020 Programmes for projects ALIGNED (GA 644055),
GeoKnow (GA 318159) and LIDER (GA 610782).</p>
    </sec>
    <sec id="sec-3">
      <title>9 https://github.com/RMLio/D2RQ_to_R2RML.git</title>
      <p>10 http://explore.ilastic.be/
11 http://ewi.mmlab.be/cd/all
12 http://rml.io/rml/data/SPC2015/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freudenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mannens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          , and R. Van de Walle.
          <article-title>Assessing and Re ning Mappings to RDF to Improve Dataset Quality</article-title>
          .
          <source>In Proceedings of the 14th International Semantic Web Conference</source>
          , Oct.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, and R. Van de Walle.
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          .
          <source>In Workshop on Linked Data on the Web</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          , M. Brummer, S. Hellmann,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          .
          <article-title>NLP data cleansing based on Linguistic Ontology constraints</article-title>
          .
          <source>In Proc. of the Extended Semantic Web Conference</source>
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Test-driven Evaluation of Linked Data Quality</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on World Wide Web</source>
          , pages
          <volume>747</volume>
          {
          <fpage>758</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmachtenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the Linked Data Best Practices in Di erent Topical Domains</article-title>
          . volume
          <volume>8796</volume>
          <source>of LNCS</source>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pietrobon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Quality Assessment for Linked Data: A Survey</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>