=Paper=
{{Paper
|id=Vol-1486/paper_108
|storemode=property
|title=Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality
|pdfUrl=https://ceur-ws.org/Vol-1486/paper_108.pdf
|volume=Vol-1486
|dblpUrl=https://dblp.org/rec/conf/semweb/DimouKFVLMHW15a
}}
==Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality==
Test-driven Assessment of [R2]RML Mappings to Improve Dataset Quality Anastasia Dimou1 , Dimitris Kontokostas2 , Markus Freudenberg2 , Ruben Verborgh1 , Jens Lehmann2 , Erik Mannens1 , Sebastian Hellmann2 , and Rik Van de Walle1 1 Ghent University - iMinds - Multimedia Lab, Belgium {firstname.lastname}@ugent.be 2 Universitat Leipzig, Institut fur Informatik, AKSW, Germany {lastname}@informatik.uni-leipzig.de Abstract. rdf dataset quality assessment is currently performed pri- marily after data is published. Incorporating its results, by applying cor- responding adjustments to the dataset, happens manually and occurs rarely. In the case of (semi-)structured data (e.g., csv, xml), the root of the violations often derives from the mappings that specify how the rdf dataset will be generated. Thus, we suggest shifting the quality as- sessment from the rdf dataset to the mapping definitions that generate it. The proposed test-driven approach for assessing mappings relies on rdfunit test cases applied over mappings specified with rml. Our eval- uation is applied to different cases, e.g., dbpedia, and indicates that the overall quality of an rdf dataset is quickly and significantly improved. Keywords: Linked Data Mapping, Data Quality, rml, rrml, rdfunit 1 Introduction Although more and more data is published as Linked Data, there are significant variations in quality [6], commonly conceived as “fitness for use” for a certain application or use case. Similar violation patterns reoccur frequently, and most encountered violations are related to the dataset’s schema, namely the vocabu- laries or ontologies used to annotate the data [4]. When datasets stem originally from semi-structured formats (e.g., csv, xml), the schema is derived from the set of classes and properties specified by the mappings which are applied re- peatedly. Consequently, the same violations are repeated in the dataset as well. Lately, combinations of different ontologies and vocabularies are used to anno- tate data [5]. This increases the likelihood of such violations, as they often derive from incorrect usage or incorrect combinations of schemas in the mappings. Taking mappings of data to rdf as a software engineering task, a set of unit test cases can be assigned to the mappings to ensure the correct generation of rdf datasets from input data. Incorporating quality assessment as part of the mapping is essential to prevent the same violations from appearing repeatedly within the dataset and over distinct entities. After all, structural adjustments 2 Anastasia Dimou et al. can still be applied in this phase, as violations are identified at their root. More- over, if mappings are assessed, every other new data source also mapped using them directly benefits from the improvements. Therefore, we proposed a uniform solution [1] that assesses the quality of an rdf dataset, covering both the map- pings and the dataset. In this work, we aim to elaborate more on how rdfunit patterns [4] for dataset test cases were arose to cover rml mappings [2], too. 2 [R2]RML Mappings Quality Assessment with RDFUnit Our solution relies on the rdf mapping language (rml) [2] that allows specifying mapping definitions expressed in rdf, and the rdfunit validation framework due to its associated test-case-based architecture [3]. For our proof-of-concept implementation3 , rdfunit test cases are applied to mappings defined with rml. RML extends rrml4 , the wc recommended language for defining mappings of data in relational databases to rdf, and also covers mappings from sources in different semi-structured formats, such as csv and json [2]. rml documents [2] specify how the input data can be represented in rdf. The main building blocks of rml documents are Triples Maps that define how triples are generated and consist of three main parts: the Logical Source, the Subject Map and zero or more Predicate-Object Maps. Term Maps define how rdf terms (iri, blank node or literal) are generated. A Term Map can be constant-valued that always generates the same rdf term, reference-valued that is the data value of a referenced data fragment in a given Logical Source, or template-valued which is a valid string template that can contain referenced data fragments of a given Logical Source. RDFUnit [4] is an rdf validation framework inspired by test-driven software development. In rdfunit, every vocabulary, ontology, dataset or application can be associated by a set of data quality test cases. The test case definition language of rdfunit is sparql, convenient to directly query for identifying violations. For rapid test case instantiation, a pattern-based sparql-template engine, running over a library of common patterns5 , is supported where variables can be easily bound into patterns. rdfunit has a Test Auto Generator (tag) component. tag searches for schema information and automatically instantiates new test cases. As [r]rml mappings can be processed as rdf documents, because of their native rdf representation and viewpoint (written as the generated triples), the same set of schema validation patterns normally applied on the rdf dataset is also applicable on the mappings that state how it is generated. Nevertheless, instead of validating the triple’s predicate against its subject and object, the predicate is extracted from the Predicate Map and is validated against the Term Maps that generate the subject and object. To achieve this, the properties and classes are identified and their namespaces are used to retrieve the schemas and 3 https://github.com/mmlab/RMLValidator 4 http://www.w3c.org/TR/R2RML 5 https://github.com/AKSW/RDFUnit/blob/master/configuration/patterns.ttl Test-driven Assessment of [R2]RML Mappings 3 Fig. 1. (i) rml mapping (left) (ii) and corresponding example of a generated triple (right). The range (xsd:integer) of the specified predicate (foaf:age) is different compared to the one in the Object Map (xsd:float) causing a violation. Similarly occurs for the Subject Map. generate the test cases as if they were the actual dataset. The expected value, as derived from the Predicate Map, is compared to the defined one, as derived from the corresponding Subject Map and Object Map. For example, the extracted predicate is foaf:age normally expects an instance of foaf:Agent type for its domain and an integer datatype for its range, but the Term Map that generates the subject is defined to be of foaf:Project type and the object is defined to have a float value. Its mapping document follows: <#Mapping> rr:subjectMap [rr:template "http://example.com/{id}"; rr:class foaf:Project]; rr:predicateObjectMap [rr:predicate foaf:age; rr:objectMap [rml:reference "age"]]. Corresponding rdfunit test cases and patterns were defined to apply to the mappings, adjusting the assessment queries.6 The defined test cases cover all possible alternative ways of defining equivalent mappings that generate the same triples. rdfunit can annotate test cases by requesting additional variables and binding them to specific result properties. The test case patterns applied to the aforementioned example and its instantiation are indicatively presented. On the left, the where clause of a sparql template query that assesses the datatype is presented. On the right it is presented how it is instantiated: ?resource %%P1%% ?c. ?resource foaf:age ?c. FILTER (DATATYPE(?c) != %%D1%%) FILTER (DATATYPE(?c) != xsd:int) The following is the where clause of the same test case applied to the mapping: ?resource rr:predicateObjectMap ?poMap. ?resource rr:predicateObjectMap ?poMap. ?poMap rr:predicate %%P1%%; ?poMap rr:predicate foaf:age; rr:objectMap ?objM. rr:objectMap ?objM. ?objM rr:datatype ?c. ?objM rr:datatype ?c. FILTER (?c != %%D1%%) FILTER (?c != xsd:int) 3 Evaluation and Discussion The assessed datasets and corresponding mappings, as well as the assessment results are summarized in Table 1: dbpedia mappings7 , after the mappings were converted from wikitext markup to rml8 , and its dataset were assessed. dblp 6 https://github.com/AKSW/RDFUnit/blob/master/data/tests/Manual/www.w3.org/ns/r2rml/rr. tests.Manual.ttl 7 http://mappings.dbpedia.org/ 8 https://goo.gl/GPB1Ar 4 Anastasia Dimou et al. dataset assessment mapping assessment dataset size time #fail. #viol. size time #fail. #viol. triples DBpEn 62M 16h 1,128 3.2M 115K 11s 1 160 255K DBpNL 21M 1.5h 683 815K 53K 6s 1 124 106K DBLP 12M 12h 7 8.1M 368 12s 2 8 8M iLastic10 150K 12s 23 37K 825 15s 3 26 37K CDFLG11 0.6K 7s 15 678 558 13s 4 16 631 CEUR-WS12 2.4K 6s 7 783 702 5s 3 12 783 Table 1. The number of triples (size), number of test cases, evaluation time, failed test cases and total individual violations appear for both dataset and mapping assessment. mappings, after the mappings were converted to rml9 , and the corresponding dataset were assessed, too. The results show that the required quality assessment time is significantly reduced if the mappings are assessed instead of the rdf dataset, especially in the case of medium/large datasets. That happens because the dataset assessment requires examining each triple separately to identify, for instance, that 12M triples violated the predicate’s range, whereas mapping assessment requires only 1 triple to be examined. The effectiveness of mapping assessments is also high: the identified violations can be accurately indicated. Acknowledgements. This paper’s research activities were funded by Ghent University, iMinds, the Institute for the Promotion of Innovation by Science and Technology in Flanders, the Fund for Scientific Research-Flanders and by grants from the EU’s 7th & H2020 Programmes for projects ALIGNED (GA 644055), GeoKnow (GA 318159) and LIDER (GA 610782). References 1. A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Hellmann, and R. Van de Walle. Assessing and Refining Mappings to RDF to Improve Dataset Quality. In Proceedings of the 14th International Semantic Web Conference, Oct. 2015. 2. A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, and R. Van de Walle. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In Workshop on Linked Data on the Web, 2014. 3. D. Kontokostas, M. Brümmer, S. Hellmann, J. Lehmann, and L. Ioannidis. NLP data cleansing based on Linguistic Ontology constraints. In Proc. of the Extended Semantic Web Conference 2014, 2014. 4. D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, J. Lehmann, R. Cornelissen, and A. Zaveri. Test-driven Evaluation of Linked Data Quality. In Proceedings of the 23rd International Conference on World Wide Web, pages 747–758, 2014. 5. M. Schmachtenberg, C. Bizer, and H. Paulheim. Adoption of the Linked Data Best Practices in Different Topical Domains. volume 8796 of LNCS. Springer, 2014. 6. A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, and S. Auer. Quality Assessment for Linked Data: A Survey. Semantic Web Journal, 2015. 9 https://github.com/RMLio/D2RQ_to_R2RML.git 10 http://explore.ilastic.be/ 11 http://ewi.mmlab.be/cd/all 12 http://rml.io/rml/data/SPC2015/