Linked Data Quality:
                  Identifying and Tackling the Key Challenges

                   Magnus Knuth                         Dimitris Kontokostas                       Harald Sack
               Hasso Plattner Institute,              AKSW, University of Leipzig            Hasso Plattner Institute,
                University of Potsdam                    Leipzig, Germany                     University of Potsdam
                 Potsdam, Germany                   kontokostas@informatik.uni-                Potsdam, Germany
             magnus.knuth@hpi.de                           leipzig.de           harald.sack@hpi.de

ABSTRACT                                                                2.   LINKED DATA QUALITY
The awareness of quality issues in Linked Data is constantly            Although quality is a commonly used term in Linked Data,
rising as new datasets and applications that consume Linked             it’s definition is far from straightforward. The reason is that
Data are emerging. In this paper we summarize key prob-                 Linked Data quality can have different meaning for different
lems of Linked Data quality that data consumers are facing              people and in different contexts. During the First Work-
and propose approaches to tackle these problems. The ma-                shop on Linked Data Quality (LDQ2014) 1 , a discussion ses-
jority of challenges presented here have been collected in              sion was held where people from different backgrounds raised
a Lightning Talk Session at the First Workshop on Linked                their personal thoughts on Linked Data Quality 2 . It was sur-
Data Quality (LDQ2014).                                                 prising to notice the variety of definitions and concerns that
                                                                        among others included stalled data, version management,
                                                                        changeset updates, RDF typo identification, and proper on-
Keywords                                                                tology modeling.
linked data, data quality, RDF
                                                                        RDF validation is a core part of Linked Data quality but
                                                                        validation alone cannot solve the quality problem. Quality
1.    INTRODUCTION                                                      is fitness for use, thus a general methodology [11] is required
Since the start of the Linking Open Data initiative, we have            to assess the results of a validation. The validation results,
seen an unprecedented volume of structured data published               along with other factors and based on the application con-
on the web, in most cases as RDF and Linked (Open) Data.                text can only provide a meaningful quality overview. Such a
The quality of these datasets varies a lot and can hardly be            quality assessment methodology should be an integral part
better than the original data source from which the data has            of the Linked Data life-cycle.
been created. Datasets may originate from crowdsourcing
projects like Wikipedia and OpenStreetMap as well as from               RDF version management is an additional quality issue that
highly curated sources e. g. the library domain.                        is not natively covered in the Semantic Web technology stack
                                                                        and can facilitate error provenance and tracking. The non-
A consumer’s perception of data quality is highly individual            deterministic statement order and blank nodes make graph
and strongly depends on the field of application. Therefore,            comparison equivalent to the graph isomorphism problem
data quality is often regarded as fitness for use, e. g. the            and thus, beyond polynomial time computation complex-
DBpedia dataset might be appropriate for a simple end-user              ity3 .
application but should not be used in critical applications
such as the medical domain for treatment decisions. How-                Reusing popular vocabularies or manually creating a correct
ever, quality is a key to the success of the data web and a             ontology model can also be seen as a general data quality
major barrier for further industry adoption.                            issue. General purpose vocabularies such as foaf4 , skos5 ,
                                                                        schema.org6 or dbpedia ontology7 usually reflect a swallow
In this paper we want to highlight contemporary quality                 depiction of the real world. For many people for example,
problems that occur in Linked Data and that already are or              the dbo:Actor class is not correct since a profession is a role
need to be addressed in future. Likewise, we want to suggest            in a person’s life and a person can have many different roles
solutions that have been developed in order to tackle these             at different stages of his life, e. g. student or spouse. In the
difficulties.
                                                                        1
                                                                          http://ldq.semanticmultimedia.org/ co-located with
                                                                        10th SEMANTiCS conference on September 2nd, 2014 in
                                                                        Leipzig, Germany
                                                                        2
                                                                          http://tinyurl.com/LDQ14LightningTalks
                                                                        3
                                                                          http://mathworld.wolfram.com/GraphIsomorphism.
                                                                        html
                                                                        4
                                                                          http://xmlns.com/foaf/0.1/
Copyright is held by the author/owner(s).                               5
                                                                          http://www.w3.org/2004/02/skos/core#
LDQ 2014, 1st Workshop on Linked Data Quality Sept. 2, 2014, Leipzig,   6
Germany.                                                                  http://schema.org/
                                                                        7
                                                                          http://dbpedia.org/ontology/, prefixed dbo:
end it depends on the granularity level one wants to reflect      Linked Data is primarily made for machine interpretation
in his data but granularity usually comes at the cost of data     and therefore it needs to comply the technical standards.
integration.                                                      Typical RDF parser implementations do not cope with –even
                                                                  minor– syntactical errors. Many publishers create RDF data
3.    TACKLING THE PROBLEM                                        using scripts or perform changes manually. Such modalities
Specific solutions for tackling the problem of Linked Data        raise the risk of introducing syntactical errors, which can be
Quality as a whole are currently far from reality. Never-         avoided by using RDF tools and programming libraries, such
theless, in the following subsection we provide an overview       as the Redland RDF API9 and Apache Jena10 . Optionally,
of existing work and possible future directions to cope with      generated RDF data should be checked by subsequent syn-
Linked Data Quality.                                              tax validation prior to publication with appropriate tools,
                                                                  such as the Raptor RDF parser utility11 and Apache Jena
                                                                  CLI tools12 .
3.1    Linked Data validation and Quality assess-
       ment                                                       Hogan et al. [5] name prominent problems with publicly
Validation is a core part in the quality assessment of Linked     available RDF datasets and survey general RDF validation
Data. Although RDF exists already for many years there            tools. Heath et al. [4] summarize best practices for publish-
exists no official standard for Linked Data validation at the     ing Linked Data.
time of writing and a W3C working group has just been
formed to define one8 . Existing Linked Data application          Beyond validating the syntax of their RDF serialization,
can either rely on ad-hoc options or use independently de-        data providers should also keep an eye on the correct usage
fined solutions such as: RDF Data Shapes [2], SPIN [6],           of vocabularies. RDFUnit [8] checks for proper vocabulary
SWRL [12], Dublin Core Profiles [10], RDFUnit [8] or OWL          utilisation by creating tests in form of SPARQL queries from
in CWA and a weaker form of UNA.                                  the vocabulary specification. These tests are created auto-
                                                                  matically and can be executed also on large scale datasets
However, validation alone cannot be adequate. A general           that provide a SPARQL endpoint, as DBpedia.
assessment methodology has to be built around validation
that can interpret the validation results and assess the qual-    To ensure that entities within a dataset are described in a
ity of the data. Rula et al. [11] propose a general three-phase   form that is required for usage by a particular application,
and six-step methodology for assessing the quality of Linked      such structures can be defined with RDF Data Shapes as it
Data involving manual, semi-automatic, and automated step         has been done to the WebIndex data portal [2]. RDF Data
in the process. On top of an assessment methodology, differ-      Shapes allow to express the expected structure of data, e. g.
ent applications can be built that automatically evaluate the     that a person entity has an xsd:string connected by the
quality of a dataset and provide automatic quality overviews      property foaf:name. Such shape templates can be used as a
or quality certifications [9].                                    contract between data publisher and data consumer in order
                                                                  to guarantee that an application can digest the given data
3.2    Linked Data Cleansing                                      properly.
Data curation in general is a costly process for the publisher.
The distributed nature of Linked (Open) Data may demand           3.4   Versioning Linked Data
the involvement of multiple data providers to achieve satis-      Continuous updates to and curation of datasets raises the
factory results. On the other hand those, who suffer from         aim for tracking changes in Linked Data. There is rare sup-
low data quality and most often identify these issues, are        port in Linked Data publishing for versioning as well as for
the data consumers. Unfortunately, only in some cases they        provenance of changes. Apache Marmotta 13 is one Linked
provide feedback to create awareness of particular problems.      Data publishing platform that supports versioning. R43ples
A typical approach for Linked Data consumers is to dupli-         [3] provides versioning for any triplestore implementation,
cate a dataset and fix relevant problems within the local         it acts as a proxy SPARQL endpoint that allows to refer to
copy. These efforts are rarely communicated and hence not         prior revisions by extending the SPARQL query language
imitated on the original data so that other consumers could       while standard SPARQL queries always work transparently
benefit equally. The Patch Request vocabulary [7] provides        on the master revision.
a standardized way to communicate change requests to data
publishers and other consumers of a dataset. Additionally,
Embury et al. [1] examine the feasibility of identifying data     4.    CONCLUSION
corrections in revisioned datasets that can than be applied       In this paper we have discussed key issues of Linked Data
to copies of that dataset.                                        quality aspects and possible ways to tackle them. We see
                                                                  quality as a core –and grey– component of the semantic web
The correction of errors in Linked Data should be distributed     stack that, if addressed correctly and systematically, will
onto the shoulders of many and possibilities to distribute        enable further adoption.
such changes should be researched.
                                                                  9
                                                                     http://librdf.org/
                                                                  10
3.3    Best Practices for Linked Data Creation                    11
                                                                     https://jena.apache.org/
                                                                     http://librdf.org/raptor/rapper.html
       and Reuse                                                  12
                                                                     https://jena.apache.org/documentation/io/
8                                                                  #command-line-tools
  http://www.w3.org/blog/data/2014/09/30/
                                                                  13
data-shapes-working-group-launched/                                  http://marmotta.apache.org/
Acknowledgements                                                    • Peter Vandenabeele – self-employed, Belgium
We would like to thank all attendees of the 1st Linked Data
Quality Workshop (LDQ2014), especially the people con-              • Lieke Verhelst – Linked Data Factory, Netherlands
tributing to the Lightning Talks Session. These discussions         • Amrapali Zaveri – AKSW, University of Leipzig, Ger-
delivered valuable input and witnessed that this topic is of          many
high importance to researchers and practitioners working
with Linked Data. In addition we would also like to thank
all reviewers that have helped us and the authors with their   5.    REFERENCES
valuable comments and feedback.                                 [1] S. Embury, B. Jin, S. Sampaio, and I. Eleftheriou. On
                                                                    the feasibility of crawling linked data sets for reusable
                                                                    defect corrections. In Proceedings of the 1st Workshop
Program Committee                                                   on Linked Data Quality (LDQ2014), volume 1215.
   • Maribel Acosta – Karlsruhe Institute of Technology,            CEUR Workshop Proceedings, 2014.
     AIFB, Germany                                              [2] J. E. L. Gayo, E. Prud’Hommeaux, H. Solbrig, and
   • Volha Bryl – University of Mannheim, Germany                   J. M. A. Rodriguez. Validating and describing linked
                                                                    data portals using RDF shape expressions. In
   • Ioannis Chrysakis – ICS FORTH, Greece                          Proceedings of the 1st Workshop on Linked Data
                                                                    Quality (LDQ2014), volume 1215. CEUR Workshop
   • Stefan Dietze – L3S, Germany                                   Proceedings, 2014.
   • Marco Fossati – SpazioDati, Italy                          [3] M. Graube, S. Hensel, and L. Urbas. R43ples:
                                                                    Revisions for triples - an approach for version control
   • Fumihiro Kato – Kyushu University, Japan                       in the semantic web. In Proceedings of the 1st
                                                                    Workshop on Linked Data Quality (LDQ2014),
   • Christoph Lange – University of Bonn, Fraunhofer IAIS,         volume 1215. CEUR Workshop Proceedings, 2014.
     Germany                                                    [4] T. Heath and C. Bizer. Linked Data: Evolving the web
   • Maristella Matera – Politecnico di Milano, Italy               into a global data space. Morgan & Claypool, 2011.
                                                                [5] A. Hogan, A. Harth, A. Passant, S. Decker, and
   • Felix Naumann – Hasso Plattner Institute, Germany              A. Polleres. Weaving the pedantic web. In Linked Data
                                                                    on the Web Workshop (LDOW 2010) at WWW 2010,
   • Matteo Palmonari – University of Milan-Bicocca, Italy          volume 628, pages 30–34. CEUR Workshop
                                                                    Proceedings, 2010.
   • Adrian Paschke – Free University of Berlin, Germany
                                                                [6] H. Knublauch, J. A. Hendler, and K. Idehen. SPIN -
   • Heiko Paulheim – University of Mannheim, Germany               overview and motivation. W3C Member Submission,
                                                                    W3C, February 2011.
   • Mariano Rico – Universidad Politécnica de Madrid,         [7] M. Knuth, J. Hercher, and H. Sack. Collaboratively
     Spain                                                          patching linked data. In Proceedings of 2nd
   • Anisa Rula – Università di Milano-Bicocca, Italy              International Workshop on Usage Analysis and the
                                                                    Web of Data (USEWOD 2012), co-located with the
   • Elena Simperl – University of Southampton, United              21st International World Wide Web Conference 2012
     Kingdom                                                        (WWW 2012), Lyon, France, April 2012.
                                                                [8] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann,
   • Patrick Westphal – AKSW, University of Leipzig, Ger-           J. Lehmann, R. Cornelissen, and A. J. Zaveri.
     many                                                           Test-driven evaluation of linked data quality. In
                                                                    Proceedings of the 23rd International Conference on
   • Amrapali Zaveri – AKSW, University of Leipzig, Ger-
                                                                    World Wide Web, 2014.
     many
                                                                [9] J. P. Mccrae, C. Wiljes, and P. Cimiano. Towards
   • Jun Zhao – Lancaster University, United Kingdom                assured data quality and validation by data
                                                                    certification. In Proceedings of the 1st Workshop on
   • Antoine Zimmermann – ISCOD / LSTI – École Na-                 Linked Data Quality (LDQ2014), volume 1215. CEUR
     tionale Supérieure des Mines de Saint-Étienne, France        Workshop Proceedings, 2014.
                                                               [10] M. Nilsson. Description set profiles: a constraint
Lightning Talk presenters                                           language for dublin core application profiles, 2008.
   • Behshid Behkamal – Ferdowsi University of Mashhad,        [11] A. Rula and A. Zaveri. Methodology for assessment of
     Iran                                                           linked data quality: A framework. In Proceedings of
                                                                    the 1st Workshop on Linked Data Quality (LDQ2014),
   • Jeremy Debattista – University of Bonn, Germany                volume 1215. CEUR Workshop Proceedings, 2014.
   • Riccardo Del Gratta – Istituto di linguistica Compu-      [12] World Wide Web Consortium (W3C). SWRL: A
     tazionale CNR, Italy                                           semantic web rule language combining OWL and
                                                                    RuleML, 2004.
   • Nidhi Kushwaha – Ferdowsi University of Mashhad,
     Iran
   • Gerard Kuys – Ordina NV, Netherlands