<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Linked Data Quality Sept.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Linked Data Quality: Identifying and Tackling the Key Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Magnus Knuth</string-name>
          <email>magnus.knuth@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Kontokostas</string-name>
          <email>kontokostas@informatik.uni-</email>
          <email>kontokostas@informatik.unileipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <email>harald.sack@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AKSW, University of Leipzig</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hasso Plattner Institute, University of Potsdam</institution>
          ,
          <addr-line>Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>2</volume>
      <issue>2014</issue>
      <abstract>
        <p>The awareness of quality issues in Linked Data is constantly rising as new datasets and applications that consume Linked Data are emerging. In this paper we summarize key problems of Linked Data quality that data consumers are facing and propose approaches to tackle these problems. The majority of challenges presented here have been collected in a Lightning Talk Session at the First Workshop on Linked Data Quality (LDQ2014). A consumer's perception of data quality is highly individual and strongly depends on the eld of application. Therefore, data quality is often regarded as tness for use, e. g. the DBpedia dataset might be appropriate for a simple end-user application but should not be used in critical applications such as the medical domain for treatment decisions. However, quality is a key to the success of the data web and a major barrier for further industry adoption.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;linked data</kwd>
        <kwd>data quality</kwd>
        <kwd>RDF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper we want to highlight contemporary quality
problems that occur in Linked Data and that already are or
need to be addressed in future. Likewise, we want to suggest
solutions that have been developed in order to tackle these
di culties.</p>
    </sec>
    <sec id="sec-2">
      <title>2. LINKED DATA QUALITY</title>
      <p>Although quality is a commonly used term in Linked Data,
it's de nition is far from straightforward. The reason is that
Linked Data quality can have di erent meaning for di erent
people and in di erent contexts. During the First
Workshop on Linked Data Quality (LDQ2014)1, a discussion
session was held where people from di erent backgrounds raised
their personal thoughts on Linked Data Quality 2. It was
surprising to notice the variety of de nitions and concerns that
among others included stalled data, version management,
changeset updates, RDF typo identi cation, and proper
ontology modeling.</p>
      <p>
        RDF validation is a core part of Linked Data quality but
validation alone cannot solve the quality problem. Quality
is tness for use, thus a general methodology [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is required
to assess the results of a validation. The validation results,
along with other factors and based on the application
context can only provide a meaningful quality overview. Such a
quality assessment methodology should be an integral part
of the Linked Data life-cycle.
      </p>
      <p>RDF version management is an additional quality issue that
is not natively covered in the Semantic Web technology stack
and can facilitate error provenance and tracking. The
nondeterministic statement order and blank nodes make graph
comparison equivalent to the graph isomorphism problem
and thus, beyond polynomial time computation
complexity3.</p>
      <p>Reusing popular vocabularies or manually creating a correct
ontology model can also be seen as a general data quality
issue. General purpose vocabularies such as foaf4, skos5,
schema.org6 or dbpedia ontology7 usually re ect a swallow
depiction of the real world. For many people for example,
the dbo:Actor class is not correct since a profession is a role
in a person's life and a person can have many di erent roles
at di erent stages of his life, e. g. student or spouse. In the
1http://ldq.semanticmultimedia.org/ co-located with
10th SEMANTiCS conference on September 2nd, 2014 in
Leipzig, Germany
2http://tinyurl.com/LDQ14LightningTalks
3http://mathworld.wolfram.com/GraphIsomorphism.
html
4http://xmlns.com/foaf/0.1/
5http://www.w3.org/2004/02/skos/core#
6http://schema.org/
7http://dbpedia.org/ontology/, pre xed dbo:
end it depends on the granularity level one wants to re ect
in his data but granularity usually comes at the cost of data
integration.</p>
    </sec>
    <sec id="sec-3">
      <title>3. TACKLING THE PROBLEM</title>
      <p>Speci c solutions for tackling the problem of Linked Data
Quality as a whole are currently far from reality.
Nevertheless, in the following subsection we provide an overview
of existing work and possible future directions to cope with
Linked Data Quality.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Linked Data validation and Quality assessment</title>
      <p>
        Validation is a core part in the quality assessment of Linked
Data. Although RDF exists already for many years there
exists no o cial standard for Linked Data validation at the
time of writing and a W3C working group has just been
formed to de ne one8. Existing Linked Data application
can either rely on ad-hoc options or use independently
dened solutions such as: RDF Data Shapes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], SPIN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
SWRL [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Dublin Core Pro les [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], RDFUnit [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or OWL
in CWA and a weaker form of UNA.
      </p>
      <p>
        However, validation alone cannot be adequate. A general
assessment methodology has to be built around validation
that can interpret the validation results and assess the
quality of the data. Rula et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] propose a general three-phase
and six-step methodology for assessing the quality of Linked
Data involving manual, semi-automatic, and automated step
in the process. On top of an assessment methodology, di
erent applications can be built that automatically evaluate the
quality of a dataset and provide automatic quality overviews
or quality certi cations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Linked Data Cleansing</title>
      <p>
        Data curation in general is a costly process for the publisher.
The distributed nature of Linked (Open) Data may demand
the involvement of multiple data providers to achieve
satisfactory results. On the other hand those, who su er from
low data quality and most often identify these issues, are
the data consumers. Unfortunately, only in some cases they
provide feedback to create awareness of particular problems.
A typical approach for Linked Data consumers is to
duplicate a dataset and x relevant problems within the local
copy. These e orts are rarely communicated and hence not
imitated on the original data so that other consumers could
bene t equally. The Patch Request vocabulary [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provides
a standardized way to communicate change requests to data
publishers and other consumers of a dataset. Additionally,
Embury et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] examine the feasibility of identifying data
corrections in revisioned datasets that can than be applied
to copies of that dataset.
      </p>
      <p>The correction of errors in Linked Data should be distributed
onto the shoulders of many and possibilities to distribute
such changes should be researched.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Best Practices for Linked Data Creation and Reuse</title>
      <p>8http://www.w3.org/blog/data/2014/09/30/
data-shapes-working-group-launched/
Linked Data is primarily made for machine interpretation
and therefore it needs to comply the technical standards.
Typical RDF parser implementations do not cope with {even
minor{ syntactical errors. Many publishers create RDF data
using scripts or perform changes manually. Such modalities
raise the risk of introducing syntactical errors, which can be
avoided by using RDF tools and programming libraries, such
as the Redland RDF API9 and Apache Jena10. Optionally,
generated RDF data should be checked by subsequent
syntax validation prior to publication with appropriate tools,
such as the Raptor RDF parser utility11 and Apache Jena
CLI tools12.</p>
      <p>
        Hogan et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] name prominent problems with publicly
available RDF datasets and survey general RDF validation
tools. Heath et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] summarize best practices for
publishing Linked Data.
      </p>
      <p>
        Beyond validating the syntax of their RDF serialization,
data providers should also keep an eye on the correct usage
of vocabularies. RDFUnit [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] checks for proper vocabulary
utilisation by creating tests in form of SPARQL queries from
the vocabulary speci cation. These tests are created
automatically and can be executed also on large scale datasets
that provide a SPARQL endpoint, as DBpedia.
      </p>
      <p>
        To ensure that entities within a dataset are described in a
form that is required for usage by a particular application,
such structures can be de ned with RDF Data Shapes as it
has been done to the WebIndex data portal [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. RDF Data
Shapes allow to express the expected structure of data, e. g.
that a person entity has an xsd:string connected by the
property foaf:name. Such shape templates can be used as a
contract between data publisher and data consumer in order
to guarantee that an application can digest the given data
properly.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.4 Versioning Linked Data</title>
      <p>
        Continuous updates to and curation of datasets raises the
aim for tracking changes in Linked Data. There is rare
support in Linked Data publishing for versioning as well as for
provenance of changes. Apache Marmotta13 is one Linked
Data publishing platform that supports versioning. R43ples
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] provides versioning for any triplestore implementation,
it acts as a proxy SPARQL endpoint that allows to refer to
prior revisions by extending the SPARQL query language
while standard SPARQL queries always work transparently
on the master revision.
      </p>
    </sec>
    <sec id="sec-8">
      <title>4. CONCLUSION</title>
      <p>In this paper we have discussed key issues of Linked Data
quality aspects and possible ways to tackle them. We see
quality as a core {and grey{ component of the semantic web
stack that, if addressed correctly and systematically, will
enable further adoption.
9http://librdf.org/
10https://jena.apache.org/
11http://librdf.org/raptor/rapper.html
12https://jena.apache.org/documentation/io/
#command-line-tools
13http://marmotta.apache.org/</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>We would like to thank all attendees of the 1st Linked Data
Quality Workshop (LDQ2014), especially the people
contributing to the Lightning Talks Session. These discussions
delivered valuable input and witnessed that this topic is of
high importance to researchers and practitioners working
with Linked Data. In addition we would also like to thank
all reviewers that have helped us and the authors with their
valuable comments and feedback.</p>
    </sec>
    <sec id="sec-10">
      <title>Program Committee</title>
      <p>Maribel Acosta { Karlsruhe Institute of Technology,
AIFB, Germany
Volha Bryl { University of Mannheim, Germany</p>
      <sec id="sec-10-1">
        <title>Ioannis Chrysakis { ICS FORTH, Greece</title>
      </sec>
      <sec id="sec-10-2">
        <title>Stefan Dietze { L3S, Germany</title>
      </sec>
      <sec id="sec-10-3">
        <title>Marco Fossati { SpazioDati, Italy</title>
      </sec>
      <sec id="sec-10-4">
        <title>Fumihiro Kato { Kyushu University, Japan</title>
        <p>Christoph Lange { University of Bonn, Fraunhofer IAIS,
Germany
Maristella Matera { Politecnico di Milano, Italy
Felix Naumann { Hasso Plattner Institute, Germany
Matteo Palmonari { University of Milan-Bicocca, Italy
Adrian Paschke { Free University of Berlin, Germany
Heiko Paulheim { University of Mannheim, Germany
Mariano Rico { Universidad Politecnica de Madrid,
Spain
Anisa Rula { Universita di Milano-Bicocca, Italy
Elena Simperl { University of Southampton, United
Kingdom
Patrick Westphal { AKSW, University of Leipzig,
Germany
Amrapali Zaveri { AKSW, University of Leipzig,
Germany
Jun Zhao { Lancaster University, United Kingdom
Antoine Zimmermann { ISCOD / LSTI { Ecole
Nationale Superieure des Mines de Saint-Etienne, France</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Lightning Talk presenters</title>
      <p>Behshid Behkamal { Ferdowsi University of Mashhad,
Iran
Jeremy Debattista { University of Bonn, Germany
Riccardo Del Gratta { Istituto di linguistica
Computazionale CNR, Italy</p>
      <sec id="sec-11-1">
        <title>Peter Vandenabeele { self-employed, Belgium Lieke Verhelst { Linked Data Factory, Netherlands Amrapali Zaveri { AKSW, University of Leipzig, Germany</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Embury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sampaio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Eleftheriou</surname>
          </string-name>
          .
          <article-title>On the feasibility of crawling linked data sets for reusable defect corrections</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014)</source>
          , volume
          <volume>1215</volume>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud'Hommeaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Solbrig</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M. A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          .
          <article-title>Validating and describing linked data portals using RDF shape expressions</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014)</source>
          , volume
          <volume>1215</volume>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Graube</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hensel</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. Urbas.</surname>
          </string-name>
          <article-title>R43ples: Revisions for triples - an approach for version control in the semantic web</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014)</source>
          , volume
          <volume>1215</volume>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked Data: Evolving the web into a global data space</article-title>
          . Morgan &amp; Claypool,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          .
          <article-title>Weaving the pedantic web</article-title>
          .
          <source>In Linked Data on the Web Workshop (LDOW 2010) at WWW</source>
          <year>2010</year>
          , volume
          <volume>628</volume>
          , pages
          <fpage>30</fpage>
          {
          <fpage>34</fpage>
          . CEUR Workshop Proceedings,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Knublauch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Hendler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Idehen</surname>
          </string-name>
          .
          <article-title>SPIN - overview and motivation</article-title>
          .
          <source>W3C Member Submission, W3C</source>
          ,
          <year>February 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Knuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hercher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          .
          <article-title>Collaboratively patching linked data</article-title>
          .
          <source>In Proceedings of 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD</source>
          <year>2012</year>
          )
          <article-title>, co-located with the 21st</article-title>
          <source>International World Wide Web Conference 2012 (WWW</source>
          <year>2012</year>
          ), Lyon, France,
          <year>April 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Test-driven evaluation of linked data quality</article-title>
          .
          <source>In Proceedings of the 23rd International Conference on World Wide Web</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Mccrae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wiljes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Towards assured data quality and validation by data certi cation</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014)</source>
          , volume
          <volume>1215</volume>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nilsson</surname>
          </string-name>
          .
          <article-title>Description set pro les: a constraint language for dublin core application pro les</article-title>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Methodology for assessment of linked data quality: A framework</article-title>
          .
          <source>In Proceedings of the 1st Workshop on Linked Data Quality (LDQ2014)</source>
          , volume
          <volume>1215</volume>
          .
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>World</given-names>
            <surname>Wide Web</surname>
          </string-name>
          <article-title>Consortium (W3C). SWRL: A semantic web rule language combining OWL and RuleML</article-title>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>