DBpedia Mappings Quality Assessment?

      Anastasia Dimou1 , Dimitris Kontokostas2 , Markus Freudenberg2 ,
 Ruben Verborgh1 , Jens Lehmann2 , Erik Mannens1 , and Sebastian Hellmann2
                1
                 Ghent University – iMinds – Data Science Lab, Belgium
                            {firstname.lastname}@ugent.be
             2
               Universitat Leipzig, Institut fur Informatik, AKSW, Germany
                       {lastname}@informatik.uni-leipzig.de


        Abstract. The root of schema violations for rdf data generated from
        (semi-)structured data, often derives from mappings, which are repeat-
        edly applied and specify how an rdf dataset is generated. The dbpedia
        dataset, which derives from Wikipedia infoboxes, is no exception. To
        mitigate the violations, we proposed in previous work to validate the
        mappings which generate the data, instead of validating the generated
        data afterwards. In this work, we demonstrate how mappings validation
        is applied to dbpedia. dbpedia mappings are automatically translated to
        rml and validated by rdfunit. The dbpedia mappings assessment can be
        frequently executed, because it requires significantly less time compared
        to validating the dataset. The validation results become available via a
        user-friendly interface. The dbpedia community takes them into consid-
        eration to refine the dbpedia mappings or ontology and thus, increase
        the dataset quality.

         Keywords: Linked Data Mapping, Data Quality, dbpedia, rml, rdfunit


1      Introduction
Although more and more data is published as Linked Data, there are significant
variations in quality [5], commonly conceived as “fitness for use” for a certain
application or use case. When datasets stem originally from semi-structured for-
mats (e.g., csv, xml), the schema is derived from the set of classes and properties
specified by the mappings which are applied repeatedly. Consequently, if those
mappings contain inaccuracies, the same violations are repeated over and over
in the dataset. Incorporating quality assessment as part of the mapping activ-
ity is therefore essential to prevent most recurring schema-based violations. To
this end, we have proposed a uniform approach for assessing the mappings and
dataset quality [1]. We implemented our approach based on the rdfunit valida-
tion framework [3] and the rml mapping language [2]. Our solution incremen-
tally assesses the quality of an rdf dataset, covering both the mappings and the
?
    This paper’s research activities were funded by Ghent University, iMinds, the Institute for
    the Promotion of Innovation by Science and Technology in Flanders, the Fund for Scientific
    Research-Flanders and grants from the EU’s 7th & H2020 Programmes for the projects ALIGNED
    (GA 644055), GeoKnow (GA 318159) and LIDER (GA 610782).
2       Anastasia Dimou et al.

dataset itself. Since rml mappings are expressed in rdf, the rdfunit validation
framework can apply its test cases to rml mappings similarly to how it applies
them to rdf datasets. Assessing an rdf dataset requires a lot of time, thus it
cannot be frequently executed, and, when it happens, the violations’ root is not
intuitively detected. On the contrary, directly assessing mappings that generates
a dataset requires significantly less time and the violation root is detected.
    In this work, we demonstrate how we incorporated our solution in the dbpedia
validation workflow. dbpedia mappings are automatically translated to rml and
subsequently assessed using rdfunit. In this demo, the validation results will be
shown via a user friendly interface and users can directly contribute to improve
the dbpedia mappings. Once they update a mapping or the dbpedia ontology,
users will be able to trigger a new validation round and immediately see the
updated validation results, without the violation they just corrected.


2    Expressing DBpedia Mappings with RML
dbpedia [4] provides a collaborative mapping approach of Wikipedia infoboxes
to the dbpedia ontology3 . The mappings are maintained and edited through the
dbpedia mappings wiki4 , using the same wiki markup syntax as Wikipedia to
define the mappings. However, the quality of wikitext-based mappings cannot
be assessed directly, and certainly not in the same way as the resulting dataset.
    rml covers mappings from sources in different (semi-)structured formats.
Furthermore, it is highly scalable towards other structures and formalizations.
Taking advantage of this, we introduced wikitext serialisation as a new Reference
Formulation. A Reference Formulation is used to indicate the grammar which should
be used to refer to data of a certain structure and format. 674 distinct mapping
documents for English, 463 for Dutch and a total of 4,468 for all languages
supported in the dbpedia mappings wiki are translated to rml and are available
at http://mappings.dbpedia.org/server/mappings/en/pages/rdf/.
    A dbpedia mapping follows for the Infobox Person5 :
1   {{TemplateMapping
2   | mapToClass = Person
3   | mappings =
4     {{PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }}
5     {{PropertyMapping | templateProperty = birth_date | ontologyProperty = birthDate }}
6     {{PropertyMapping | templateProperty = birth_place | ontologyProperty = birthPlace }}}}

    and its corresponding rml mapping, after being translated to rml is 6 :
1   <http://mappings.dbpedia.org/server/mappings/en/Infobox_person>
2       rr:subjectMap [ rr:class dbpedia:Person ; rr:termType rr:IRI ;
3           rr:constant "http://dbpedia.org/resource/Template:Infobox_person" ] ;
4       rr:predicateObjectMap [ rr:predicate dbpedia:birthPlace ;
5           rr:objectMap [ a rr:ObjectMap ; rml:reference "birth_place". ] ] .


3
  http://wiki.dbpedia.org/Ontology
4
  http://mappings.dbpedia.org
5
  http://mappings.dbpedia.org/index.php?title=Mapping_en:Infobox_person&action=edit
6
  The example is adjusted to improve reading. A full rml transformation can be found at http:
  //mappings.dbpedia.org/server/mappings/en/pages/rdf/Mapping_en%3AInfobox_person
                 Continuous and Automated Assessment of DBpedia Mappings                     3


Fig. 1. Screenshot of a violations list presented to the dbpedia community. For every violating
mapping, the predicate with the existing rdf term, according to the corresponding dbpedia
mapping, and the expected value, according to the dbpedia ontology are presented.


3     DBpedia Mappings Quality Assessment
Since rml mappings can be processed as rdf documents, and are written from
the viewpoint of the generated triples, the same set of schema validation patterns
normally applied to the rdf dataset is also applicable to the mappings that
state how the dataset is generated. rdfunit was extended to also support quality
assessment over rml mappings [1]. Indicatively, instead of validating each triple’s
predicate from the final rdf dataset against its subject and object, the predicate
is extracted from the Predicate Map, that defines what the triple’s predicate will
be in rml, and is validated against the Term Maps that define how the subject
and object will be generated. The expected value, as derived from the dbpedia
ontology, is compared to the specified one, as derived from the corresponding
mapping. To achieve this, the schemas and their namespaces are retrieved and
the test cases are generated as if they were the actual dataset. For instance,
an extracted predicate expects a Literal as object according to the dbpedia
ontology, but the mapping that defines how the object is generated specifies
that a resource should be generated instead; in this case a violation is reported.
    To systematically validate dbpedia mappings and have up-to-date reports,
we created a script7 to trigger all dbpedia mappings validation and is executed
every night. The script exports the dbpedia mapping violations as a json file
that, in turn, is visualized (cf. Figure 1) using a user-friendly interface which
is available at http://mappings.dbpedia.org/validation. The assessment
and report generation is automated, streamlined, and frequently executed. The
dbpedia community uses the violations list as feedback to correct violating map-
pings or enhance the dbpedia ontology and, thus, improves the dataset’s quality.

DBpedia Mappings and Dataset Assessment
We compared the dbpedia 2014 release assessment to the dbpedia mappings as-
sessment. English and Dutch dbpedia mappings as well as dbpedia mappings of
7
    https://github.com/AKSW/RDFUnit/blob/master/rdfunit-examples/src/main/java/org/aksw/
    rdfunit/examples/DBpediaMappingValidator.java
4          Anastasia Dimou et al.

all 27 supported languages were validated. The results show that the quality as-
sessment time is significantly reduced when assessing the mappings compared to
the complete rdf dataset. It takes only 11 seconds to assess the English dbpedia
mappings, while assessing the whole dbpedia dataset takes 16 hours, because the
dataset assessment requires examining each triple separately to identify, for in-
stance 12M triples violating the range of foaf:primaryTopic. Mapping assessment
requires only 1 triple to be examined. Indicatively, the evaluation of all mappings
for all 27 language editions resulted in a total of 1316 domain-level violations.


                                         dataset assessment         mapping assessment
    dataset                            #triples  time   #viol.     #triples time  #viol.
    DBpEn                                  62M    16.h   3.2M         115K    11s     160
    DBpNl                                  21M    1.5h   815K          53K     6s     124
    DBpAll                                    –      –      –         511K    32s   1,316

Table 1. For each of the dbpedia dataset and mapping assessment, the number of
triples, evaluation time and total individual violations appear respectively.

    The latest dbpedia releases rely on results of this work8 . We currently in-
corporate the rml toolchain in the dbpedia extraction framework9 and plan
to integrate the mapping validation in the editing step and, thus, prevent the
creation of violating mappings. This will enable the complete assessment and re-
finement workflow use [1] to automatically improve the dbpedia dataset quality.


References
1. A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens,
   S. Hellmann, and R. Van de Walle. Assessing and Refining Mappings to RDF to
   Improve Dataset Quality. In Proceedings of the 14th International Semantic Web
   Conference, Oct. 2015.
2. A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de
   Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous
   Data. In Workshop on Linked Data on the Web, 2014.
3. D. Kontokostas, M. Brümmer, S. Hellmann, J. Lehmann, and L. Ioannidis. NLP
   data cleansing based on Linguistic Ontology constraints. In Proc. of the Extended
   Semantic Web Conference 2014, 2014.
4. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hell-
   mann, M. Morsey, P. Kleef, S. Auer, and C. Bizer. DBpedia - a Large-scale, Multi-
   lingual Knowledge Base Extracted from Wikipedia. Sem. Web Journal, 2014.
5. A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality
   Assessment for Linked Data: A Survey. Semantic Web Journal, 2015.


8
     https://github.com/dbpedia/mappings-tracker/issues/57
9
     https://github.com/dbpedia/dbpedia-gsoc/wiki/2016-Integrating-RML-in-the-DBpedia-
     extraction-framework