<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DBpedia Mappings Quality Assessment?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anastasia Dimou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Kontokostas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Freudenberg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Mannens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Hellmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University - iMinds - Data Science Lab</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Leipzig, Institut fur Informatik</institution>
          ,
          <addr-line>AKSW</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The root of schema violations for rdf data generated from (semi-)structured data, often derives from mappings, which are repeatedly applied and specify how an rdf dataset is generated. The dbpedia dataset, which derives from Wikipedia infoboxes, is no exception. To mitigate the violations, we proposed in previous work to validate the mappings which generate the data, instead of validating the generated data afterwards. In this work, we demonstrate how mappings validation is applied to dbpedia. dbpedia mappings are automatically translated to rml and validated by rdfunit. The dbpedia mappings assessment can be frequently executed, because it requires significantly less time compared to validating the dataset. The validation results become available via a user-friendly interface. The dbpedia community takes them into consideration to refine the dbpedia mappings or ontology and thus, increase the dataset quality.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data Mapping</kwd>
        <kwd>Data Quality</kwd>
        <kwd>dbpedia</kwd>
        <kwd>rml</kwd>
        <kwd>rdfunit</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Although more and more data is published as Linked Data, there are significant
variations in quality [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], commonly conceived as “fitness for use” for a certain
application or use case. When datasets stem originally from semi-structured
formats (e.g., csv, xml), the schema is derived from the set of classes and properties
specified by the mappings which are applied repeatedly. Consequently, if those
mappings contain inaccuracies, the same violations are repeated over and over
in the dataset. Incorporating quality assessment as part of the mapping
activity is therefore essential to prevent most recurring schema-based violations. To
this end, we have proposed a uniform approach for assessing the mappings and
dataset quality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We implemented our approach based on the rdfunit
validation framework [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the rml mapping language [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our solution
incrementally assesses the quality of an rdf dataset, covering both the mappings and the
? This paper’s research activities were funded by Ghent University, iMinds, the Institute for
the Promotion of Innovation by Science and Technology in Flanders, the Fund for Scientific
Research-Flanders and grants from the EU’s 7th &amp; H2020 Programmes for the projects ALIGNED
(GA 644055), GeoKnow (GA 318159) and LIDER (GA 610782).
dataset itself. Since rml mappings are expressed in rdf, the rdfunit validation
framework can apply its test cases to rml mappings similarly to how it applies
them to rdf datasets. Assessing an rdf dataset requires a lot of time, thus it
cannot be frequently executed, and, when it happens, the violations’ root is not
intuitively detected. On the contrary, directly assessing mappings that generates
a dataset requires significantly less time and the violation root is detected.
      </p>
      <p>In this work, we demonstrate how we incorporated our solution in the dbpedia
validation workflow. dbpedia mappings are automatically translated to rml and
subsequently assessed using rdfunit. In this demo, the validation results will be
shown via a user friendly interface and users can directly contribute to improve
the dbpedia mappings. Once they update a mapping or the dbpedia ontology,
users will be able to trigger a new validation round and immediately see the
updated validation results, without the violation they just corrected.
2</p>
      <p>
        Expressing DBpedia Mappings with RML
dbpedia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] provides a collaborative mapping approach of Wikipedia infoboxes
to the dbpedia ontology3. The mappings are maintained and edited through the
dbpedia mappings wiki4, using the same wiki markup syntax as Wikipedia to
define the mappings. However, the quality of wikitext-based mappings cannot
be assessed directly, and certainly not in the same way as the resulting dataset.
      </p>
      <p>rml covers mappings from sources in different (semi-)structured formats.
Furthermore, it is highly scalable towards other structures and formalizations.
Taking advantage of this, we introduced wikitext serialisation as a new Reference
Formulation. A Reference Formulation is used to indicate the grammar which should
be used to refer to data of a certain structure and format. 674 distinct mapping
documents for English, 463 for Dutch and a total of 4,468 for all languages
supported in the dbpedia mappings wiki are translated to rml and are available
at http://mappings.dbpedia.org/server/mappings/en/pages/rdf/.</p>
      <p>A dbpedia mapping follows for the Infobox Person5:
and its corresponding rml mapping, after being translated to rml is 6:
&lt;http://mappings.dbpedia.org/server/mappings/en/Infobox_person&gt;
rr:subjectMap [ rr:class dbpedia:Person ; rr:termType rr:IRI ;</p>
      <p>rr:constant "http://dbpedia.org/resource/Template:Infobox_person" ] ;
rr:predicateObjectMap [ rr:predicate dbpedia:birthPlace ;</p>
      <p>rr:objectMap [ a rr:ObjectMap ; rml:reference "birth_place". ] ] .
3 http://wiki.dbpedia.org/Ontology
4 http://mappings.dbpedia.org
5 http://mappings.dbpedia.org/index.php?title=Mapping_en:Infobox_person&amp;action=edit
6 The example is adjusted to improve reading. A full rml transformation can be found at http:
//mappings.dbpedia.org/server/mappings/en/pages/rdf/Mapping_en%3AInfobox_person
Continuous and Automated Assessment of DBpedia Mappings</p>
      <p>
        DBpedia Mappings Quality Assessment
Since rml mappings can be processed as rdf documents, and are written from
the viewpoint of the generated triples, the same set of schema validation patterns
normally applied to the rdf dataset is also applicable to the mappings that
state how the dataset is generated. rdfunit was extended to also support quality
assessment over rml mappings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Indicatively, instead of validating each triple’s
predicate from the final rdf dataset against its subject and object, the predicate
is extracted from the Predicate Map, that defines what the triple’s predicate will
be in rml, and is validated against the Term Maps that define how the subject
and object will be generated. The expected value, as derived from the dbpedia
ontology, is compared to the specified one, as derived from the corresponding
mapping. To achieve this, the schemas and their namespaces are retrieved and
the test cases are generated as if they were the actual dataset. For instance,
an extracted predicate expects a Literal as object according to the dbpedia
ontology, but the mapping that defines how the object is generated specifies
that a resource should be generated instead; in this case a violation is reported.
      </p>
      <p>To systematically validate dbpedia mappings and have up-to-date reports,
we created a script7 to trigger all dbpedia mappings validation and is executed
every night. The script exports the dbpedia mapping violations as a json file
that, in turn, is visualized (cf. Figure 1) using a user-friendly interface which
is available at http://mappings.dbpedia.org/validation. The assessment
and report generation is automated, streamlined, and frequently executed. The
dbpedia community uses the violations list as feedback to correct violating
mappings or enhance the dbpedia ontology and, thus, improves the dataset’s quality.
DBpedia Mappings and Dataset Assessment
We compared the dbpedia 2014 release assessment to the dbpedia mappings
assessment. English and Dutch dbpedia mappings as well as dbpedia mappings of
7 https://github.com/AKSW/RDFUnit/blob/master/rdfunit-examples/src/main/java/org/aksw/
rdfunit/examples/DBpediaMappingValidator.java
all 27 supported languages were validated. The results show that the quality
assessment time is significantly reduced when assessing the mappings compared to
the complete rdf dataset. It takes only 11 seconds to assess the English dbpedia
mappings, while assessing the whole dbpedia dataset takes 16 hours, because the
dataset assessment requires examining each triple separately to identify, for
instance 12M triples violating the range of foaf:primaryTopic. Mapping assessment
requires only 1 triple to be examined. Indicatively, the evaluation of all mappings
for all 27 language editions resulted in a total of 1316 domain-level violations.
dataset
DBpEn
DBpNl
DBpAll</p>
      <p>dataset assessment
#triples time #viol.</p>
      <p>62M 16.h 3.2M
21M 1.5h 815K
– – –</p>
      <p>mapping assessment
#triples time #viol.</p>
      <p>115K 11s 160
53K 6s 124
511K 32s 1,316</p>
      <p>
        The latest dbpedia releases rely on results of this work8. We currently
incorporate the rml toolchain in the dbpedia extraction framework9 and plan
to integrate the mapping validation in the editing step and, thus, prevent the
creation of violating mappings. This will enable the complete assessment and
refinement workflow use [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to automatically improve the dbpedia dataset quality.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freudenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mannens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          , and R. Van de Walle.
          <article-title>Assessing and Refining Mappings to RDF to Improve Dataset Quality</article-title>
          .
          <source>In Proceedings of the 14th International Semantic Web Conference</source>
          , Oct.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, and R. Van de Walle.
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          .
          <source>In Workshop on Linked Data on the Web</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brümmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          .
          <article-title>NLP data cleansing based on Linguistic Ontology constraints</article-title>
          .
          <source>In Proc. of the Extended Semantic Web Conference</source>
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kleef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer. DBpedia -</surname>
          </string-name>
          <article-title>a Large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          . Sem.
          <source>Web Journal</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pietrobon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Quality Assessment for Linked Data: A Survey</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>