=Paper= {{Paper |id=Vol-1215/paper-04 |storemode=property |title=Methodology for Assessment of Linked Data Quality |pdfUrl=https://ceur-ws.org/Vol-1215/paper-04.pdf |volume=Vol-1215 |dblpUrl=https://dblp.org/rec/conf/i-semantics/RulaZ14 }} ==Methodology for Assessment of Linked Data Quality== https://ceur-ws.org/Vol-1215/paper-04.pdf
        Methodology for Assessment of Linked Data Quality

                               Anisa Rula                                               Amrapali Zaveri
                 University of Milano-Bicocca,                                       University of Leipzig,
         Department of Computer Science, Systems and                    Institute of Computer Science, AKSW Group
                   Communication (DISCo)                                Augustusplatz 10, D-04009 Leipzig, Germany
                 Viale Sarca 336, Milan, Italy                                zaveri@informatik.uni-leipzig.de
                   anisa.rula@disco.unimib.it


ABSTRACT                                                                Data quality is usually defined as fitness for use [6] and
With the expansion in the amount of data being produced                 is comprised of several data quality dimensions (e.g. com-
as Linked Data (LD), the opportunity to build use cases has             pleteness, accuracy, conciseness etc.) along with their re-
also increased. However, a crippling problem to the relia-              spective metrics (means to measure the dimension). There
bility of these use cases is the underlying poor data quality.          have been several methodologies, which have been proposed
Moreover, the ability to assess the quality of the consumed             to assess the quality of a dataset [2, 8, 10]. Even though
LD, based on the satisfaction of the consumers’ quality re-             these methodologies provide useful ways to assess the qual-
quirements, significantly influences usability of such data for         ity of a dataset, they often do not address a particular use
a given use case. In this paper, we propose a data quality              case (usually involving several datasets) and demand a con-
assessment methodology specifically designed for LD. This               siderable amount of user involvement and expertise. Also,
methodology consists of three phases and six steps with spe-            most of the output is not interpretable by humans and the
cific emphasis on considering a use case.                               methodologies are bound to one particular dataset and its
                                                                        characteristics.
Keywords                                                                Therefore, in this paper, we propose a data quality assess-
data quality, linked data, assessment, improvement                      ment methodology comprising of three phases and six steps
                                                                        (section 2). In contrast to the previously introduced method-
1.    INTRODUCTION                                                      ologies, our methodology aims to bring an overview of the
Recently, Linked Data (LD) has contributed a sea of infor-              entire assessment methodology right from identifying the
mation to the Web all represented in structured formats,                problems to fixing them. We discuss related work in sec-
linked with one another and made publicly available [4].                tion 3 and provide directions to future work in section 4.
This information belongs to an enormous number of datasets
covering various domains such as life sciences, geographic              2.     DATA QUALITY ASSESSMENT METHOD-
data, or governmental1 . Publication of this information as
Linked Data has enabled users in aggregating data from
                                                                               OLOGY
                                                                        A data quality assessment methodology is defined as the pro-
different sources to build mashups that assist in discover-
                                                                        cess of evaluating if a piece of data meets the information
ing new valuable information. However, recent studies have
                                                                        consumers need in a specific use case [2]. In a comprehen-
shown that majority of these datasets suffer from several
                                                                        sive survey [12], it was observed that in the 30 identified ap-
data quality problems such as representational, inconsis-
                                                                        proaches, there were no standardized set of steps that were
tency or interoperability issues [5]. These problems signifi-
                                                                        followed to assess the quality of a dataset. Inspired from the
cantly hinder the uptake of these datasets in particular use
                                                                        methodology proposed in [1] and the lack of a standardized
cases and affect the results as the poor quality is propa-
                                                                        methodology in LD, we propose a methodology consisting
gated in the aggregated datasets. The ability to assess the
                                                                        of three phases and nine steps. In particular, from each of
quality of the consumed LD, based on the satisfaction of
                                                                        the 30 approaches, we extracted the common steps that were
the consumers’ quality requirements, significantly influences
                                                                        proposed to assess the quality of a dataset. We then adapted
usability of such data for any given use case.
                                                                        and revised these steps to propose a data quality assessment
1                                                                       methodology particularly for LD as depicted in Figure 1.
  http://lod-cloud.net/versions/2011-09-19/
lod-cloud_colored.html
                                                                        Our methodology thus consists of the following phases and
                                                                        steps:


                                                                             1. Phase I: Requirements Analysis

                                                                                (a) Step I: Use Case Analysis
Copyright is held by the author/owner(s).
LDQ 2014, 1st Workshop on Linked Data Quality Sept. 2, 2014, Leipzig,        2. Phase II: Quality Assessment
Germany.
                                                                                (a) Step II: Identification of quality issues
        (b) Step III: Statistical and Low-level Analysis
        (c) Step IV: Advanced Analysis                                      Phase I: Requirement Analysis
  3. Phase III: Quality Improvement
                                                                            Step I: Use Case Analysis
        (a) Step V: Root Cause Analysis
        (b) Step VI: Fixing Quality Problems


The following sections describe each of the steps in detail                 Phase II: Quality Assessment
along with the list of data quality dimensions (from the 18
dimensions identified in [12] that are applicable for each step.            Step II: Identification of quality issues

2.1     Phase I: Requirements analysis                                      Step III: Statistical and Low-level Analysis
The multi-dimensional nature of data quality makes it de-
pendent on a number of factors that can be determined                       Step IV: Advanced Analysis
by analyzing the users requirements. Thus, the use case
in question is highly important when assessing the quality
of a dataset. This requirement analysis phase thus includes
the gathering of requirements and subsequent analysis of the
requirements based on the use case.                                         Phase III: Quality Improvement

2.1.1     Step I: Use Case analysis                                         Step V: Root Case Analysis
In this step, the user provides the details of a use case or an             Step VI: Fixing Quality Problems
application that best describes the usage of the dataset in
order to provide a tailored quality assessment process. For
this step, we identify two types of users: (a) those who are
already consumers of the dataset and thus provide their data
quality experiences through use cases and (b) those who are           Figure 1: The quality assessment methodology
potential consumers of the dataset and thus cannot provide
such experiences. The first kinds of users already know what
data quality problems they faced or are prone to face. In           different serialization formats or languages (pointing to the
this case, the user guides the assessment process since they        versatility dimension), are presented to the user. In this
know the dataset problems before hand; in the second case           step, the user involvement is entirely manual where the user
the assessment process guides the user. However, both users         must have knowledge about the details of the dataset to an-
are exploring the fitness for use of their dataset. This step       swer these questions. The output of this step is the result
facilitates the choice regarding not only which dataset should      of the evaluation of the boolean dimensions, that is, a sum
be assessed first, but also which aspects of individual dataset     of 0’s(no) or 1’s(yes) which adds to the final data quality
should be the initial target.                                       assessment score. Using this information, it is then possible
                                                                    to determine a set of relevant dimensions.
2.2     Phase II: Data Quality Assessment
In the previous phase, we identified the user requirements          2.2.2    Step III: Statistical and Low-level Analysis
for her dataset with the particular use case she has in mind.       This step performs basic statistical and low-level analysis
This second phase involves the actual quality assessment            on the dataset. That is, generic statistics that can be calcu-
based on the requirements. In particular, amongst the set           lated automatically are included in this step. For example,
of dimensions and metrics discussed in [12], the most rele-         the number of blank nodes pointing towards the complete-
vant ones are selected. Thereafter, a quantitative evaluation       ness of the dataset or number of interlinks between datasets
of the quality of the dataset is performed using the metrics        showcasing the interlinking degree of the dataset are calcu-
specific for each selected dimension. Thus, this phase con-         lated. After the analysis, generic statistics on the dataset
sists of three steps: (II) Identification of quality issues (III)   based on certain pre-defined heuristics are calculated and
Statistical and Low-level analysis and (IV) Advanced anal-          provided to the user. The end result is a score indicating
ysis.                                                               the value for each of the metrics assessed.

2.2.1     Step II: Identification of quality issues                 2.2.3    Step IV: Advanced Analysis
The goal of this step is to identify a set of the most relevant     This step, in combination with steps II and III, is used for
data quality issues based on the use case. This identification      assessing the overall quality of the dataset. The assessment
is done with the help of a checklist, which can be filled by        can be performed in different ways for different quality di-
the user. The questions in the checklist implicitly refer to        mensions. For example, in order to assess the accuracy of
quality problems and their related quality dimensions. For          data values, a pattern-based approach can be applied, which
example, questions such as whether the datasets provides            generates data quality tests of RDF knowledge bases [7].
a message board or a mailing list (pointing to the under-           These patterns will capture incorrect values such as postal
standability dimension) or whether the data is provided in          address, phone number, email address, personal identifica-
tion number, etc.                                                 performed on her dataset. Moreover, this step is important
                                                                  as the decision of whether to trust the assessment results de-
This step is performed by comparing values from the trans-        pends highly on the precise understanding of the evaluation
formed dataset to the gold standard values (i.e. values from      of the data quality. Essentially, this step involves:
the original source) or to a dataset in the same domain.
For example, in case of measuring the population complete-
ness of a dataset, it needs to be compared with the origi-             • detecting whether the problem occurs in the original
nal dataset. Thus, this step requires the target or derived              dataset
dataset as well as the original or source dataset as input.            • in case the original dataset is not available, analyze
The output of this step are (i) evaluation results performed             the dataset to detect the cause
between target and original datasets or those in the same
domain and (ii) an aggregated value (score) of the results.
                                                                  For example, if the data quality assessment reports problem
The data quality score metrics are based on simple ratio          of inconsistency in the dataset, the data modeling should be
calculation. The ratio is measured by subtracting the ratio       checked or if the problem of completeness is reported, the
between the total number of instances that violate a data         values in the original dataset and target dataset should be
quality rule (V) and the total number of relevant instances       compared to find the cause.
(T) from one, as the following formula shows:
                    DQscore = 1 − (V /T )                  (1)
                                                                  2.3.2     Step VI: Fixing Quality Problems
                                                                  In this step, strategies to address the identified root cause of
                                                                  the problems are implemented. There are several strategies
This score can be applied for each property of the dataset.       that can be implemented in this step such as:
In case we want to calculate the quality of the overall proper-
ties/attributes in a dataset, the above DQscore is multiplied
with a weight wi representing the importance of the intended           • Semi-automatic or automated approaches
task for each property in the dataset and divide the sum of            • Crowdsourcing mechanisms
the weighted DQscore by the sum of all weighting factors of
the regarded properties (W).
                                                                  Semi-automated or automated approaches can help detect
                                                                  quality issues and their causes on a large-scale. For example,
                              n
                              X                                   inconsistencies in the ontology can be detected by running
          DQweightedscore =     (DQscore ∗ wi )/W          (2)    a reasoner on the entire ontology. Crowdsourcing, on the
                              i=1                                 other hand, is highly appropriate for any assignment involv-
                                                                  ing large to huge numbers of small tasks requiring human
In case of equal importance of the properties for the task        judgment. In terms of LD, crowdsourcing quality assess-
at hand or in case it is not possible to annotate importance      ment may involve, for example, verifying the completeness
values, all wi are considered equal to 1 and the W value          or correctness of a fact wrt. the original dataset. Such a task
is gives the number of all properties that are tested in the      does not require underlying knowledge about the structure
dataset. While in the former case, the DQweightedscore is a       of the data and can be done in a time and cost effective
contextual metric in the latter case it is considered to be an    manner [11].
intrinsic metric.
                                                                  3.    RELATED WORK
At the end of this phase, the total score from Steps II to IV     A number of data quality assessment methodologies and
are aggregated and provided as a result to the user indicat-      tools have been introduced, those particularly focusing on
ing the quality of the dataset. A breakdown of the scores         LD. These methodologies can be broadly classified into three
for each of the metrics assessed is provided so that the user     categories: (i) automated, (ii) semi-automated and (iii) man-
is able to look at each metric separately. Additionally, ex-      ual. There exist data quality assessment tools, which work
planations of how the assessment was performed i.e. details       completely automatically, such as LinkQA2 , which is de-
of the metrics are available to the user so that she is able to   signed to assess the quality of links in an automated way
interpret the results in a meaningful way.                        and LODStats3 , which gathers comprehensive statistics (no.
                                                                  of classes, properties, links etc.) about a dataset available
2.3     Phase III: Quality Improvement                            as RDF. On the other hand, there are generic tools for val-
This phase focuses towards improving the quality of the           idating the structure of the RDF document4 , which only
datasets based on the analysis performed in Phase II focus-       provide a high-level analysis of the quality in terms of rep-
ing on the use case identified in Phase I. This phase consists    resentational (or modeling) problems. Tools, which semi-
of two steps: (VI) Root Cause Analysis and (VII) Fixing           automatically assess data quality, include Flemming’s data
Quality Problems.                                                 quality assessment tool [3]; LODRefine5 ; DL-Learner6 [8]
                                                                  2
                                                                    https://github.com/cgueret/LinkedData-QA
2.3.1    Step V: Root Cause Analysis                              3
                                                                    http://stats.lod2.eu/
In this step, the main aim is to find an explanation for the      4
                                                                    http://swse.deri.org/RDFAlerts/,    http://www.w3.
cause of the detected data quality issues i.e. performing root    org/RDF/Validator/
                                                                  5
cause analysis. This step helps the user interpret and un-          http://code.zemanta.com/sparkica/
                                                                  6
derstand the results of the data quality assessment that is         http://dl-learner.org
and ORE (Ontology Repair and Enrichment)7 [9]. Tools,               [7] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann,
which entail manual assessment, are Sieve [10], which as-               J. Lehmann, R. Cornelissen, and A. Zaveri.
sesses the quality of data using an integration process and             Test-driven evaluation of linked data quality. In
WIQA [2], which allows users to apply a wide range of                   WWW, pages 747–758, 2014.
quality-based policies to filter information.                       [8] J. Lehmann. DL-Learner: Learning Concepts in
                                                                        Description Logics. Journal of Machine Learning
However, the automatic tools are bound to certain datasets              Research, 10:2639–2642, 2009.
and do not allow the freedom to the user to choose a partic-        [9] J. Lehmann and L. Bühmann. ORE - A Tool for
ular dataset nor focus on a specific use case. In case of semi-         Repairing and Enriching Knowledge Bases. In ISWC,
automated tools, the user needs to have adequate knowledge              LNCS. Springer, 2010.
about the dataset in order to use this tool. However, these        [10] P. Mendes, H. Mühleisen, and C. Bizer. Sieve: Linked
tools are not bound to a use case. In case of manual tools,             Data Quality Assessment and Fusion. In LWDM,
they demand a huge amount of user involvement and exper-                March 2012.
tise and are not sensitive towards the use case.                   [11] A. Zaveri, D. Kontokostas, M. A. S. andLorenz
                                                                        Bühmann, M. Morsey, S. Auer, and J. Lehmann.
Our data quality assessment methodology is at the intersec-             User-driven Quality Evaluation of DBpedia. In
tion of these tools as it not only focuses on a particular use          Proceedings of 9th International Conference on
case but also allows the user to obtain low-level as well aggre-        Semantic Systems, I-SEMANTICS ’13, Graz, Austria,
gated and higher level analysis of the dataset. Moreover, the           September 4-6, 2013, pages 97–104. ACM, 2013.
methodology supports the interpretation of the results and
                                                                   [12] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon,
allows the user to retrace or, if required, even change the in-
                                                                        J. Lehmann, and S. Auer. Quality Assessment
put metrics to obtain the desired quality for the particular
                                                                        Methodologies for Linked Data: A Survey. Under
use case. Furthermore, the methodology incorporates the
                                                                        review, available at http://www.semantic-web-
one important component missing from the existing ones,
                                                                        journal.net/content/quality-assessment-methodologies-
the improvement of data quality problems once identified.
                                                                        linked-open-data.

4.     CONCLUSIONS AND FUTURE WORK
In this paper, we have introduced a data quality assessment
methodology consisting of three phases and six steps. This
methodology is generic enough to be applied to any use case.
In order to validate its usability, we plan to apply it to spe-
cific use cases to assess the feasibility and effectiveness of
the methodology. This validity will also help us measure
its applicability in various domains. Moreover, we plan to
build a tool based on this methodology so as to assist users
to assess the quality of any linked dataset.


5.     REFERENCES
    [1] C. Batini and M. Scannapieco. Data Quality:
        Concepts, Methodologies and Techniques
        (Data-Centric Systems and Applications).
        Springer-Verlag New York, Inc., Secaucus, NJ, USA,
        2006.
    [2] C. Bizer and R. Cyganiak. Quality-driven information
        filtering using the WIQA policy framework. Web
        Semantics, 7(1):1 – 10, Jan 2009.
    [3] A. Flemming. Quality characteristics of linked data
        publishing datasources. Master’s thesis,
        Humboldt-Universität zu Berlin, 2010.
    [4] T. Heath and C. Bizer. Linked Data: Evolving the
        Web into a Global Data Space, chapter 2, pages 1 –
        136. Number 1:1 in Synthesis Lectures on the
        Semantic Web: Theory and Technology. Morgan and
        Claypool, 1st edition, 2011.
    [5] A. Hogan, J. Umbrich, A. Harth, R. Cyganiak,
        A. Polleres, and S. Decker. An empirical survey of
        Linked Data conformance. Journal of Web Semantics,
        2012.
    [6] J. Juran. The Quality Control Handbook.
        McGraw-Hill, New York, 1974.
7
    http://ore-tool.net