<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Linked Data Quality Sept.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Methodology for Assessment of Linked Data Quality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anisa Rula</string-name>
          <email>anisa.rula@disco.unimib.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amrapali Zaveri</string-name>
          <email>zaveri@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Leipzig, Institute of Computer Science, AKSW Group</institution>
          ,
          <addr-line>Augustusplatz 10, D-04009 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Milano-Bicocca, Department of Computer Science</institution>
          ,
          <addr-line>Systems and, Communication (DISCo), Viale Sarca 336, Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>2</volume>
      <issue>2014</issue>
      <abstract>
        <p>With the expansion in the amount of data being produced as Linked Data (LD), the opportunity to build use cases has also increased. However, a crippling problem to the reliability of these use cases is the underlying poor data quality. Moreover, the ability to assess the quality of the consumed LD, based on the satisfaction of the consumers' quality requirements, signi cantly in uences usability of such data for a given use case. In this paper, we propose a data quality assessment methodology speci cally designed for LD. This methodology consists of three phases and six steps with speci c emphasis on considering a use case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;data quality</kwd>
        <kwd>linked data</kwd>
        <kwd>assessment</kwd>
        <kwd>improvement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. DATA QUALITY ASSESSMENT METHOD</title>
    </sec>
    <sec id="sec-2">
      <title>OLOGY</title>
      <p>
        A data quality assessment methodology is de ned as the
process of evaluating if a piece of data meets the information
consumers need in a speci c use case [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In a
comprehensive survey [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], it was observed that in the 30 identi ed
approaches, there were no standardized set of steps that were
followed to assess the quality of a dataset. Inspired from the
methodology proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the lack of a standardized
methodology in LD, we propose a methodology consisting
of three phases and nine steps. In particular, from each of
the 30 approaches, we extracted the common steps that were
proposed to assess the quality of a dataset. We then adapted
and revised these steps to propose a data quality assessment
methodology particularly for LD as depicted in Figure 1.
Our methodology thus consists of the following phases and
steps:
1. Phase I: Requirements Analysis
2. Phase II: Quality Assessment
(b) Step III: Statistical and Low-level Analysis
(c) Step IV: Advanced Analysis
      </p>
      <sec id="sec-2-1">
        <title>3. Phase III: Quality Improvement</title>
        <p>
          (a) Step V: Root Cause Analysis
(b) Step VI: Fixing Quality Problems
The following sections describe each of the steps in detail
along with the list of data quality dimensions (from the 18
dimensions identi ed in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] that are applicable for each step.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1 Phase I: Requirements analysis</title>
      <p>The multi-dimensional nature of data quality makes it
dependent on a number of factors that can be determined
by analyzing the users requirements. Thus, the use case
in question is highly important when assessing the quality
of a dataset. This requirement analysis phase thus includes
the gathering of requirements and subsequent analysis of the
requirements based on the use case.</p>
      <sec id="sec-3-1">
        <title>2.1.1 Step I: Use Case analysis</title>
        <p>In this step, the user provides the details of a use case or an
application that best describes the usage of the dataset in
order to provide a tailored quality assessment process. For
this step, we identify two types of users: (a) those who are
already consumers of the dataset and thus provide their data
quality experiences through use cases and (b) those who are
potential consumers of the dataset and thus cannot provide
such experiences. The rst kinds of users already know what
data quality problems they faced or are prone to face. In
this case, the user guides the assessment process since they
know the dataset problems before hand; in the second case
the assessment process guides the user. However, both users
are exploring the tness for use of their dataset. This step
facilitates the choice regarding not only which dataset should
be assessed rst, but also which aspects of individual dataset
should be the initial target.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Phase II: Data Quality Assessment</title>
      <p>
        In the previous phase, we identi ed the user requirements
for her dataset with the particular use case she has in mind.
This second phase involves the actual quality assessment
based on the requirements. In particular, amongst the set
of dimensions and metrics discussed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the most
relevant ones are selected. Thereafter, a quantitative evaluation
of the quality of the dataset is performed using the metrics
speci c for each selected dimension. Thus, this phase
consists of three steps: (II) Identi cation of quality issues (III)
Statistical and Low-level analysis and (IV) Advanced
analysis.
      </p>
      <sec id="sec-4-1">
        <title>2.2.1 Step II: Identification of quality issues</title>
        <p>The goal of this step is to identify a set of the most relevant
data quality issues based on the use case. This identi cation
is done with the help of a checklist, which can be lled by
the user. The questions in the checklist implicitly refer to
quality problems and their related quality dimensions. For
example, questions such as whether the datasets provides
a message board or a mailing list (pointing to the
understandability dimension) or whether the data is provided in
Phase I: Requirement Analysis
Step I: Use Case Analysis
Phase II: Quality Assessment
Step II: Identification of quality issues
Step III: Statistical and Low-level Analysis
Step IV: Advanced Analysis
Phase III: Quality Improvement
Step V: Root Case Analysis</p>
        <p>Step VI: Fixing Quality Problems
di erent serialization formats or languages (pointing to the
versatility dimension), are presented to the user. In this
step, the user involvement is entirely manual where the user
must have knowledge about the details of the dataset to
answer these questions. The output of this step is the result
of the evaluation of the boolean dimensions, that is, a sum
of 0's(no) or 1's(yes) which adds to the nal data quality
assessment score. Using this information, it is then possible
to determine a set of relevant dimensions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>2.2.2 Step III: Statistical and Low-level Analysis</title>
        <p>This step performs basic statistical and low-level analysis
on the dataset. That is, generic statistics that can be
calculated automatically are included in this step. For example,
the number of blank nodes pointing towards the
completeness of the dataset or number of interlinks between datasets
showcasing the interlinking degree of the dataset are
calculated. After the analysis, generic statistics on the dataset
based on certain pre-de ned heuristics are calculated and
provided to the user. The end result is a score indicating
the value for each of the metrics assessed.</p>
      </sec>
      <sec id="sec-4-3">
        <title>2.2.3 Step IV: Advanced Analysis</title>
        <p>
          This step, in combination with steps II and III, is used for
assessing the overall quality of the dataset. The assessment
can be performed in di erent ways for di erent quality
dimensions. For example, in order to assess the accuracy of
data values, a pattern-based approach can be applied, which
generates data quality tests of RDF knowledge bases [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
These patterns will capture incorrect values such as postal
address, phone number, email address, personal identi
cation number, etc.
        </p>
        <p>This step is performed by comparing values from the
transformed dataset to the gold standard values (i.e. values from
the original source) or to a dataset in the same domain.
For example, in case of measuring the population
completeness of a dataset, it needs to be compared with the
original dataset. Thus, this step requires the target or derived
dataset as well as the original or source dataset as input.
The output of this step are (i) evaluation results performed
between target and original datasets or those in the same
domain and (ii) an aggregated value (score) of the results.
The data quality score metrics are based on simple ratio
calculation. The ratio is measured by subtracting the ratio
between the total number of instances that violate a data
quality rule (V) and the total number of relevant instances
(T) from one, as the following formula shows:</p>
        <p>DQscore = 1
(V =T )
This score can be applied for each property of the dataset.
In case we want to calculate the quality of the overall
properties/attributes in a dataset, the above DQscore is multiplied
with a weight wi representing the importance of the intended
task for each property in the dataset and divide the sum of
the weighted DQscore by the sum of all weighting factors of
the regarded properties (W).</p>
        <p>DQweightedscore =
n
X(DQscore wi)=W
i=1
In case of equal importance of the properties for the task
at hand or in case it is not possible to annotate importance
values, all wi are considered equal to 1 and the W value
is gives the number of all properties that are tested in the
dataset. While in the former case, the DQweightedscore is a
contextual metric in the latter case it is considered to be an
intrinsic metric.</p>
        <p>At the end of this phase, the total score from Steps II to IV
are aggregated and provided as a result to the user
indicating the quality of the dataset. A breakdown of the scores
for each of the metrics assessed is provided so that the user
is able to look at each metric separately. Additionally,
explanations of how the assessment was performed i.e. details
of the metrics are available to the user so that she is able to
interpret the results in a meaningful way.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>2.3 Phase III: Quality Improvement</title>
      <p>This phase focuses towards improving the quality of the
datasets based on the analysis performed in Phase II
focusing on the use case identi ed in Phase I. This phase consists
of two steps: (VI) Root Cause Analysis and (VII) Fixing
Quality Problems.</p>
      <sec id="sec-5-1">
        <title>2.3.1 Step V: Root Cause Analysis</title>
        <p>In this step, the main aim is to nd an explanation for the
cause of the detected data quality issues i.e. performing root
cause analysis. This step helps the user interpret and
understand the results of the data quality assessment that is
(1)
(2)
performed on her dataset. Moreover, this step is important
as the decision of whether to trust the assessment results
depends highly on the precise understanding of the evaluation
of the data quality. Essentially, this step involves:
detecting whether the problem occurs in the original
dataset
in case the original dataset is not available, analyze
the dataset to detect the cause
For example, if the data quality assessment reports problem
of inconsistency in the dataset, the data modeling should be
checked or if the problem of completeness is reported, the
values in the original dataset and target dataset should be
compared to nd the cause.</p>
      </sec>
      <sec id="sec-5-2">
        <title>2.3.2 Step VI: Fixing Quality Problems</title>
        <p>In this step, strategies to address the identi ed root cause of
the problems are implemented. There are several strategies
that can be implemented in this step such as:</p>
        <sec id="sec-5-2-1">
          <title>Semi-automatic or automated approaches</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>Crowdsourcing mechanisms</title>
          <p>
            Semi-automated or automated approaches can help detect
quality issues and their causes on a large-scale. For example,
inconsistencies in the ontology can be detected by running
a reasoner on the entire ontology. Crowdsourcing, on the
other hand, is highly appropriate for any assignment
involving large to huge numbers of small tasks requiring human
judgment. In terms of LD, crowdsourcing quality
assessment may involve, for example, verifying the completeness
or correctness of a fact wrt. the original dataset. Such a task
does not require underlying knowledge about the structure
of the data and can be done in a time and cost e ective
manner [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3. RELATED WORK</title>
      <p>
        A number of data quality assessment methodologies and
tools have been introduced, those particularly focusing on
LD. These methodologies can be broadly classi ed into three
categories: (i) automated, (ii) semi-automated and (iii)
manual. There exist data quality assessment tools, which work
completely automatically, such as LinkQA2, which is
designed to assess the quality of links in an automated way
and LODStats3, which gathers comprehensive statistics (no.
of classes, properties, links etc.) about a dataset available
as RDF. On the other hand, there are generic tools for
validating the structure of the RDF document4, which only
provide a high-level analysis of the quality in terms of
representational (or modeling) problems. Tools, which
semiautomatically assess data quality, include Flemming's data
quality assessment tool [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; LODRe ne5; DL-Learner6 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
2https://github.com/cgueret/LinkedData-QA
3http://stats.lod2.eu/
4http://swse.deri.org/RDFAlerts/,
org/RDF/Validator/
5http://code.zemanta.com/sparkica/
6http://dl-learner.org
http://www.w3.
and ORE (Ontology Repair and Enrichment)7 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Tools,
which entail manual assessment, are Sieve [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which
assesses the quality of data using an integration process and
WIQA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which allows users to apply a wide range of
quality-based policies to lter information.
      </p>
      <p>However, the automatic tools are bound to certain datasets
and do not allow the freedom to the user to choose a
particular dataset nor focus on a speci c use case. In case of
semiautomated tools, the user needs to have adequate knowledge
about the dataset in order to use this tool. However, these
tools are not bound to a use case. In case of manual tools,
they demand a huge amount of user involvement and
expertise and are not sensitive towards the use case.</p>
      <p>Our data quality assessment methodology is at the
intersection of these tools as it not only focuses on a particular use
case but also allows the user to obtain low-level as well
aggregated and higher level analysis of the dataset. Moreover, the
methodology supports the interpretation of the results and
allows the user to retrace or, if required, even change the
input metrics to obtain the desired quality for the particular
use case. Furthermore, the methodology incorporates the
one important component missing from the existing ones,
the improvement of data quality problems once identi ed.</p>
    </sec>
    <sec id="sec-7">
      <title>4. CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper, we have introduced a data quality assessment
methodology consisting of three phases and six steps. This
methodology is generic enough to be applied to any use case.
In order to validate its usability, we plan to apply it to
speci c use cases to assess the feasibility and e ectiveness of
the methodology. This validity will also help us measure
its applicability in various domains. Moreover, we plan to
build a tool based on this methodology so as to assist users
to assess the quality of any linked dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          .
          <source>Data Quality: Concepts</source>
          ,
          <source>Methodologies and Techniques (Data-Centric Systems and Applications)</source>
          . Springer-Verlag New York, Inc., Secaucus, NJ, USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          .
          <article-title>Quality-driven information ltering using the WIQA policy framework</article-title>
          .
          <source>Web Semantics</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):1 {
          <fpage>10</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Flemming</surname>
          </string-name>
          .
          <article-title>Quality characteristics of linked data publishing datasources</article-title>
          .
          <source>Master's thesis</source>
          , Humboldt-Universitat zu Berlin,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked Data: Evolving the Web into a Global Data Space, chapter 2, pages 1 { 136. Number 1:1 in Synthesis Lectures on the Semantic Web: Theory and Technology</article-title>
          . Morgan and Claypool,
          <source>1st edition</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>An empirical survey of Linked Data conformance</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Juran. The Quality Control Handbook. McGraw-Hill</surname>
          </string-name>
          , New York,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          .
          <article-title>Test-driven evaluation of linked data quality</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>747</volume>
          {
          <fpage>758</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <article-title>DL-Learner: Learning Concepts in Description Logics</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>10</volume>
          :
          <fpage>2639</fpage>
          {
          <fpage>2642</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Bu</surname>
          </string-name>
          <article-title>hmann. ORE - A Tool for Repairing and Enriching Knowledge Bases</article-title>
          .
          <source>In ISWC, LNCS</source>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , H. Muhleisen, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          . Sieve:
          <article-title>Linked Data Quality Assessment and Fusion</article-title>
          . In LWDM, March
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. A. S.</surname>
          </string-name>
          andLorenz Buhmann, M. Morsey,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Lehmann.</surname>
          </string-name>
          <article-title>User-driven Quality Evaluation of DBpedia</article-title>
          .
          <source>In Proceedings of 9th International Conference on Semantic Systems, I-SEMANTICS '13</source>
          ,
          <string-name>
            <surname>Graz</surname>
          </string-name>
          , Austria, September 4-
          <issue>6</issue>
          ,
          <year>2013</year>
          , pages
          <fpage>97</fpage>
          {
          <fpage>104</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pietrobon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Quality Assessment Methodologies for Linked Data: A Survey. Under review</article-title>
          , available at http://www.semantic
          <article-title>-webjournal.net/content/quality-assessment-methodologieslinked-open-data.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>