DBpedia Mappings Quality Assessment? Anastasia Dimou1 , Dimitris Kontokostas2 , Markus Freudenberg2 , Ruben Verborgh1 , Jens Lehmann2 , Erik Mannens1 , and Sebastian Hellmann2 1 Ghent University – iMinds – Data Science Lab, Belgium {firstname.lastname}@ugent.be 2 Universitat Leipzig, Institut fur Informatik, AKSW, Germany {lastname}@informatik.uni-leipzig.de Abstract. The root of schema violations for rdf data generated from (semi-)structured data, often derives from mappings, which are repeat- edly applied and specify how an rdf dataset is generated. The dbpedia dataset, which derives from Wikipedia infoboxes, is no exception. To mitigate the violations, we proposed in previous work to validate the mappings which generate the data, instead of validating the generated data afterwards. In this work, we demonstrate how mappings validation is applied to dbpedia. dbpedia mappings are automatically translated to rml and validated by rdfunit. The dbpedia mappings assessment can be frequently executed, because it requires significantly less time compared to validating the dataset. The validation results become available via a user-friendly interface. The dbpedia community takes them into consid- eration to refine the dbpedia mappings or ontology and thus, increase the dataset quality. Keywords: Linked Data Mapping, Data Quality, dbpedia, rml, rdfunit 1 Introduction Although more and more data is published as Linked Data, there are significant variations in quality [5], commonly conceived as “fitness for use” for a certain application or use case. When datasets stem originally from semi-structured for- mats (e.g., csv, xml), the schema is derived from the set of classes and properties specified by the mappings which are applied repeatedly. Consequently, if those mappings contain inaccuracies, the same violations are repeated over and over in the dataset. Incorporating quality assessment as part of the mapping activ- ity is therefore essential to prevent most recurring schema-based violations. To this end, we have proposed a uniform approach for assessing the mappings and dataset quality [1]. We implemented our approach based on the rdfunit valida- tion framework [3] and the rml mapping language [2]. Our solution incremen- tally assesses the quality of an rdf dataset, covering both the mappings and the ? This paper’s research activities were funded by Ghent University, iMinds, the Institute for the Promotion of Innovation by Science and Technology in Flanders, the Fund for Scientific Research-Flanders and grants from the EU’s 7th & H2020 Programmes for the projects ALIGNED (GA 644055), GeoKnow (GA 318159) and LIDER (GA 610782). 2 Anastasia Dimou et al. dataset itself. Since rml mappings are expressed in rdf, the rdfunit validation framework can apply its test cases to rml mappings similarly to how it applies them to rdf datasets. Assessing an rdf dataset requires a lot of time, thus it cannot be frequently executed, and, when it happens, the violations’ root is not intuitively detected. On the contrary, directly assessing mappings that generates a dataset requires significantly less time and the violation root is detected. In this work, we demonstrate how we incorporated our solution in the dbpedia validation workflow. dbpedia mappings are automatically translated to rml and subsequently assessed using rdfunit. In this demo, the validation results will be shown via a user friendly interface and users can directly contribute to improve the dbpedia mappings. Once they update a mapping or the dbpedia ontology, users will be able to trigger a new validation round and immediately see the updated validation results, without the violation they just corrected. 2 Expressing DBpedia Mappings with RML dbpedia [4] provides a collaborative mapping approach of Wikipedia infoboxes to the dbpedia ontology3 . The mappings are maintained and edited through the dbpedia mappings wiki4 , using the same wiki markup syntax as Wikipedia to define the mappings. However, the quality of wikitext-based mappings cannot be assessed directly, and certainly not in the same way as the resulting dataset. rml covers mappings from sources in different (semi-)structured formats. Furthermore, it is highly scalable towards other structures and formalizations. Taking advantage of this, we introduced wikitext serialisation as a new Reference Formulation. A Reference Formulation is used to indicate the grammar which should be used to refer to data of a certain structure and format. 674 distinct mapping documents for English, 463 for Dutch and a total of 4,468 for all languages supported in the dbpedia mappings wiki are translated to rml and are available at http://mappings.dbpedia.org/server/mappings/en/pages/rdf/. A dbpedia mapping follows for the Infobox Person5 : 1 {{TemplateMapping 2 | mapToClass = Person 3 | mappings = 4 {{PropertyMapping | templateProperty = name | ontologyProperty = foaf:name }} 5 {{PropertyMapping | templateProperty = birth_date | ontologyProperty = birthDate }} 6 {{PropertyMapping | templateProperty = birth_place | ontologyProperty = birthPlace }}}} and its corresponding rml mapping, after being translated to rml is 6 : 1 2 rr:subjectMap [ rr:class dbpedia:Person ; rr:termType rr:IRI ; 3 rr:constant "http://dbpedia.org/resource/Template:Infobox_person" ] ; 4 rr:predicateObjectMap [ rr:predicate dbpedia:birthPlace ; 5 rr:objectMap [ a rr:ObjectMap ; rml:reference "birth_place". ] ] . 3 http://wiki.dbpedia.org/Ontology 4 http://mappings.dbpedia.org 5 http://mappings.dbpedia.org/index.php?title=Mapping_en:Infobox_person&action=edit 6 The example is adjusted to improve reading. A full rml transformation can be found at http: //mappings.dbpedia.org/server/mappings/en/pages/rdf/Mapping_en%3AInfobox_person Continuous and Automated Assessment of DBpedia Mappings 3 Fig. 1. Screenshot of a violations list presented to the dbpedia community. For every violating mapping, the predicate with the existing rdf term, according to the corresponding dbpedia mapping, and the expected value, according to the dbpedia ontology are presented. 3 DBpedia Mappings Quality Assessment Since rml mappings can be processed as rdf documents, and are written from the viewpoint of the generated triples, the same set of schema validation patterns normally applied to the rdf dataset is also applicable to the mappings that state how the dataset is generated. rdfunit was extended to also support quality assessment over rml mappings [1]. Indicatively, instead of validating each triple’s predicate from the final rdf dataset against its subject and object, the predicate is extracted from the Predicate Map, that defines what the triple’s predicate will be in rml, and is validated against the Term Maps that define how the subject and object will be generated. The expected value, as derived from the dbpedia ontology, is compared to the specified one, as derived from the corresponding mapping. To achieve this, the schemas and their namespaces are retrieved and the test cases are generated as if they were the actual dataset. For instance, an extracted predicate expects a Literal as object according to the dbpedia ontology, but the mapping that defines how the object is generated specifies that a resource should be generated instead; in this case a violation is reported. To systematically validate dbpedia mappings and have up-to-date reports, we created a script7 to trigger all dbpedia mappings validation and is executed every night. The script exports the dbpedia mapping violations as a json file that, in turn, is visualized (cf. Figure 1) using a user-friendly interface which is available at http://mappings.dbpedia.org/validation. The assessment and report generation is automated, streamlined, and frequently executed. The dbpedia community uses the violations list as feedback to correct violating map- pings or enhance the dbpedia ontology and, thus, improves the dataset’s quality. DBpedia Mappings and Dataset Assessment We compared the dbpedia 2014 release assessment to the dbpedia mappings as- sessment. English and Dutch dbpedia mappings as well as dbpedia mappings of 7 https://github.com/AKSW/RDFUnit/blob/master/rdfunit-examples/src/main/java/org/aksw/ rdfunit/examples/DBpediaMappingValidator.java 4 Anastasia Dimou et al. all 27 supported languages were validated. The results show that the quality as- sessment time is significantly reduced when assessing the mappings compared to the complete rdf dataset. It takes only 11 seconds to assess the English dbpedia mappings, while assessing the whole dbpedia dataset takes 16 hours, because the dataset assessment requires examining each triple separately to identify, for in- stance 12M triples violating the range of foaf:primaryTopic. Mapping assessment requires only 1 triple to be examined. Indicatively, the evaluation of all mappings for all 27 language editions resulted in a total of 1316 domain-level violations. dataset assessment mapping assessment dataset #triples time #viol. #triples time #viol. DBpEn 62M 16.h 3.2M 115K 11s 160 DBpNl 21M 1.5h 815K 53K 6s 124 DBpAll – – – 511K 32s 1,316 Table 1. For each of the dbpedia dataset and mapping assessment, the number of triples, evaluation time and total individual violations appear respectively. The latest dbpedia releases rely on results of this work8 . We currently in- corporate the rml toolchain in the dbpedia extraction framework9 and plan to integrate the mapping validation in the editing step and, thus, prevent the creation of violating mappings. This will enable the complete assessment and re- finement workflow use [1] to automatically improve the dbpedia dataset quality. References 1. A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Hellmann, and R. Van de Walle. Assessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of the 14th International Semantic Web Conference, Oct. 2015. 2. A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Workshop on Linked Data on the Web, 2014. 3. D. Kontokostas, M. Brümmer, S. Hellmann, J. Lehmann, and L. Ioannidis. NLP data cleansing based on Linguistic Ontology constraints. In Proc. of the Extended Semantic Web Conference 2014, 2014. 4. J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hell- mann, M. Morsey, P. Kleef, S. Auer, and C. Bizer. DBpedia - a Large-scale, Multi- lingual Knowledge Base Extracted from Wikipedia. Sem. Web Journal, 2014. 5. A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A Survey. Semantic Web Journal, 2015. 8 https://github.com/dbpedia/mappings-tracker/issues/57 9 https://github.com/dbpedia/dbpedia-gsoc/wiki/2016-Integrating-RML-in-the-DBpedia- extraction-framework