Linked Data Quality: Identifying and Tackling the Key Challenges Magnus Knuth Dimitris Kontokostas Harald Sack Hasso Plattner Institute, AKSW, University of Leipzig Hasso Plattner Institute, University of Potsdam Leipzig, Germany University of Potsdam Potsdam, Germany kontokostas@informatik.uni- Potsdam, Germany magnus.knuth@hpi.de leipzig.de harald.sack@hpi.de ABSTRACT 2. LINKED DATA QUALITY The awareness of quality issues in Linked Data is constantly Although quality is a commonly used term in Linked Data, rising as new datasets and applications that consume Linked it’s definition is far from straightforward. The reason is that Data are emerging. In this paper we summarize key prob- Linked Data quality can have different meaning for different lems of Linked Data quality that data consumers are facing people and in different contexts. During the First Work- and propose approaches to tackle these problems. The ma- shop on Linked Data Quality (LDQ2014) 1 , a discussion ses- jority of challenges presented here have been collected in sion was held where people from different backgrounds raised a Lightning Talk Session at the First Workshop on Linked their personal thoughts on Linked Data Quality 2 . It was sur- Data Quality (LDQ2014). prising to notice the variety of definitions and concerns that among others included stalled data, version management, changeset updates, RDF typo identification, and proper on- Keywords tology modeling. linked data, data quality, RDF RDF validation is a core part of Linked Data quality but validation alone cannot solve the quality problem. Quality 1. INTRODUCTION is fitness for use, thus a general methodology [11] is required Since the start of the Linking Open Data initiative, we have to assess the results of a validation. The validation results, seen an unprecedented volume of structured data published along with other factors and based on the application con- on the web, in most cases as RDF and Linked (Open) Data. text can only provide a meaningful quality overview. Such a The quality of these datasets varies a lot and can hardly be quality assessment methodology should be an integral part better than the original data source from which the data has of the Linked Data life-cycle. been created. Datasets may originate from crowdsourcing projects like Wikipedia and OpenStreetMap as well as from RDF version management is an additional quality issue that highly curated sources e. g. the library domain. is not natively covered in the Semantic Web technology stack and can facilitate error provenance and tracking. The non- A consumer’s perception of data quality is highly individual deterministic statement order and blank nodes make graph and strongly depends on the field of application. Therefore, comparison equivalent to the graph isomorphism problem data quality is often regarded as fitness for use, e. g. the and thus, beyond polynomial time computation complex- DBpedia dataset might be appropriate for a simple end-user ity3 . application but should not be used in critical applications such as the medical domain for treatment decisions. How- Reusing popular vocabularies or manually creating a correct ever, quality is a key to the success of the data web and a ontology model can also be seen as a general data quality major barrier for further industry adoption. issue. General purpose vocabularies such as foaf4 , skos5 , schema.org6 or dbpedia ontology7 usually reflect a swallow In this paper we want to highlight contemporary quality depiction of the real world. For many people for example, problems that occur in Linked Data and that already are or the dbo:Actor class is not correct since a profession is a role need to be addressed in future. Likewise, we want to suggest in a person’s life and a person can have many different roles solutions that have been developed in order to tackle these at different stages of his life, e. g. student or spouse. In the difficulties. 1 http://ldq.semanticmultimedia.org/ co-located with 10th SEMANTiCS conference on September 2nd, 2014 in Leipzig, Germany 2 http://tinyurl.com/LDQ14LightningTalks 3 http://mathworld.wolfram.com/GraphIsomorphism. html 4 http://xmlns.com/foaf/0.1/ Copyright is held by the author/owner(s). 5 http://www.w3.org/2004/02/skos/core# LDQ 2014, 1st Workshop on Linked Data Quality Sept. 2, 2014, Leipzig, 6 Germany. http://schema.org/ 7 http://dbpedia.org/ontology/, prefixed dbo: end it depends on the granularity level one wants to reflect Linked Data is primarily made for machine interpretation in his data but granularity usually comes at the cost of data and therefore it needs to comply the technical standards. integration. Typical RDF parser implementations do not cope with –even minor– syntactical errors. Many publishers create RDF data 3. TACKLING THE PROBLEM using scripts or perform changes manually. Such modalities Specific solutions for tackling the problem of Linked Data raise the risk of introducing syntactical errors, which can be Quality as a whole are currently far from reality. Never- avoided by using RDF tools and programming libraries, such theless, in the following subsection we provide an overview as the Redland RDF API9 and Apache Jena10 . Optionally, of existing work and possible future directions to cope with generated RDF data should be checked by subsequent syn- Linked Data Quality. tax validation prior to publication with appropriate tools, such as the Raptor RDF parser utility11 and Apache Jena CLI tools12 . 3.1 Linked Data validation and Quality assess- ment Hogan et al. [5] name prominent problems with publicly Validation is a core part in the quality assessment of Linked available RDF datasets and survey general RDF validation Data. Although RDF exists already for many years there tools. Heath et al. [4] summarize best practices for publish- exists no official standard for Linked Data validation at the ing Linked Data. time of writing and a W3C working group has just been formed to define one8 . Existing Linked Data application Beyond validating the syntax of their RDF serialization, can either rely on ad-hoc options or use independently de- data providers should also keep an eye on the correct usage fined solutions such as: RDF Data Shapes [2], SPIN [6], of vocabularies. RDFUnit [8] checks for proper vocabulary SWRL [12], Dublin Core Profiles [10], RDFUnit [8] or OWL utilisation by creating tests in form of SPARQL queries from in CWA and a weaker form of UNA. the vocabulary specification. These tests are created auto- matically and can be executed also on large scale datasets However, validation alone cannot be adequate. A general that provide a SPARQL endpoint, as DBpedia. assessment methodology has to be built around validation that can interpret the validation results and assess the qual- To ensure that entities within a dataset are described in a ity of the data. Rula et al. [11] propose a general three-phase form that is required for usage by a particular application, and six-step methodology for assessing the quality of Linked such structures can be defined with RDF Data Shapes as it Data involving manual, semi-automatic, and automated step has been done to the WebIndex data portal [2]. RDF Data in the process. On top of an assessment methodology, differ- Shapes allow to express the expected structure of data, e. g. ent applications can be built that automatically evaluate the that a person entity has an xsd:string connected by the quality of a dataset and provide automatic quality overviews property foaf:name. Such shape templates can be used as a or quality certifications [9]. contract between data publisher and data consumer in order to guarantee that an application can digest the given data 3.2 Linked Data Cleansing properly. Data curation in general is a costly process for the publisher. The distributed nature of Linked (Open) Data may demand 3.4 Versioning Linked Data the involvement of multiple data providers to achieve satis- Continuous updates to and curation of datasets raises the factory results. On the other hand those, who suffer from aim for tracking changes in Linked Data. There is rare sup- low data quality and most often identify these issues, are port in Linked Data publishing for versioning as well as for the data consumers. Unfortunately, only in some cases they provenance of changes. Apache Marmotta 13 is one Linked provide feedback to create awareness of particular problems. Data publishing platform that supports versioning. R43ples A typical approach for Linked Data consumers is to dupli- [3] provides versioning for any triplestore implementation, cate a dataset and fix relevant problems within the local it acts as a proxy SPARQL endpoint that allows to refer to copy. These efforts are rarely communicated and hence not prior revisions by extending the SPARQL query language imitated on the original data so that other consumers could while standard SPARQL queries always work transparently benefit equally. The Patch Request vocabulary [7] provides on the master revision. a standardized way to communicate change requests to data publishers and other consumers of a dataset. Additionally, Embury et al. [1] examine the feasibility of identifying data 4. CONCLUSION corrections in revisioned datasets that can than be applied In this paper we have discussed key issues of Linked Data to copies of that dataset. quality aspects and possible ways to tackle them. We see quality as a core –and grey– component of the semantic web The correction of errors in Linked Data should be distributed stack that, if addressed correctly and systematically, will onto the shoulders of many and possibilities to distribute enable further adoption. such changes should be researched. 9 http://librdf.org/ 10 3.3 Best Practices for Linked Data Creation 11 https://jena.apache.org/ http://librdf.org/raptor/rapper.html and Reuse 12 https://jena.apache.org/documentation/io/ 8 #command-line-tools http://www.w3.org/blog/data/2014/09/30/ 13 data-shapes-working-group-launched/ http://marmotta.apache.org/ Acknowledgements • Peter Vandenabeele – self-employed, Belgium We would like to thank all attendees of the 1st Linked Data Quality Workshop (LDQ2014), especially the people con- • Lieke Verhelst – Linked Data Factory, Netherlands tributing to the Lightning Talks Session. These discussions • Amrapali Zaveri – AKSW, University of Leipzig, Ger- delivered valuable input and witnessed that this topic is of many high importance to researchers and practitioners working with Linked Data. In addition we would also like to thank all reviewers that have helped us and the authors with their 5. REFERENCES valuable comments and feedback. [1] S. Embury, B. Jin, S. Sampaio, and I. Eleftheriou. On the feasibility of crawling linked data sets for reusable defect corrections. In Proceedings of the 1st Workshop Program Committee on Linked Data Quality (LDQ2014), volume 1215. • Maribel Acosta – Karlsruhe Institute of Technology, CEUR Workshop Proceedings, 2014. AIFB, Germany [2] J. E. L. Gayo, E. Prud’Hommeaux, H. Solbrig, and • Volha Bryl – University of Mannheim, Germany J. M. A. Rodriguez. Validating and describing linked data portals using RDF shape expressions. In • Ioannis Chrysakis – ICS FORTH, Greece Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014), volume 1215. CEUR Workshop • Stefan Dietze – L3S, Germany Proceedings, 2014. • Marco Fossati – SpazioDati, Italy [3] M. Graube, S. Hensel, and L. Urbas. R43ples: Revisions for triples - an approach for version control • Fumihiro Kato – Kyushu University, Japan in the semantic web. In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014), • Christoph Lange – University of Bonn, Fraunhofer IAIS, volume 1215. CEUR Workshop Proceedings, 2014. Germany [4] T. Heath and C. Bizer. Linked Data: Evolving the web • Maristella Matera – Politecnico di Milano, Italy into a global data space. Morgan & Claypool, 2011. [5] A. Hogan, A. Harth, A. Passant, S. Decker, and • Felix Naumann – Hasso Plattner Institute, Germany A. Polleres. Weaving the pedantic web. In Linked Data on the Web Workshop (LDOW 2010) at WWW 2010, • Matteo Palmonari – University of Milan-Bicocca, Italy volume 628, pages 30–34. CEUR Workshop Proceedings, 2010. • Adrian Paschke – Free University of Berlin, Germany [6] H. Knublauch, J. A. Hendler, and K. Idehen. SPIN - • Heiko Paulheim – University of Mannheim, Germany overview and motivation. W3C Member Submission, W3C, February 2011. • Mariano Rico – Universidad Politécnica de Madrid, [7] M. Knuth, J. Hercher, and H. Sack. Collaboratively Spain patching linked data. In Proceedings of 2nd • Anisa Rula – Università di Milano-Bicocca, Italy International Workshop on Usage Analysis and the Web of Data (USEWOD 2012), co-located with the • Elena Simperl – University of Southampton, United 21st International World Wide Web Conference 2012 Kingdom (WWW 2012), Lyon, France, April 2012. [8] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, • Patrick Westphal – AKSW, University of Leipzig, Ger- J. Lehmann, R. Cornelissen, and A. J. Zaveri. many Test-driven evaluation of linked data quality. In Proceedings of the 23rd International Conference on • Amrapali Zaveri – AKSW, University of Leipzig, Ger- World Wide Web, 2014. many [9] J. P. Mccrae, C. Wiljes, and P. Cimiano. Towards • Jun Zhao – Lancaster University, United Kingdom assured data quality and validation by data certification. In Proceedings of the 1st Workshop on • Antoine Zimmermann – ISCOD / LSTI – École Na- Linked Data Quality (LDQ2014), volume 1215. CEUR tionale Supérieure des Mines de Saint-Étienne, France Workshop Proceedings, 2014. [10] M. Nilsson. Description set profiles: a constraint Lightning Talk presenters language for dublin core application profiles, 2008. • Behshid Behkamal – Ferdowsi University of Mashhad, [11] A. Rula and A. Zaveri. Methodology for assessment of Iran linked data quality: A framework. In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014), • Jeremy Debattista – University of Bonn, Germany volume 1215. CEUR Workshop Proceedings, 2014. • Riccardo Del Gratta – Istituto di linguistica Compu- [12] World Wide Web Consortium (W3C). SWRL: A tazionale CNR, Italy semantic web rule language combining OWL and RuleML, 2004. • Nidhi Kushwaha – Ferdowsi University of Mashhad, Iran • Gerard Kuys – Ordina NV, Netherlands