=Paper=
{{Paper
|id=Vol-1215/paper-04
|storemode=property
|title=Methodology for Assessment of Linked Data Quality
|pdfUrl=https://ceur-ws.org/Vol-1215/paper-04.pdf
|volume=Vol-1215
|dblpUrl=https://dblp.org/rec/conf/i-semantics/RulaZ14
}}
==Methodology for Assessment of Linked Data Quality==
Methodology for Assessment of Linked Data Quality Anisa Rula Amrapali Zaveri University of Milano-Bicocca, University of Leipzig, Department of Computer Science, Systems and Institute of Computer Science, AKSW Group Communication (DISCo) Augustusplatz 10, D-04009 Leipzig, Germany Viale Sarca 336, Milan, Italy zaveri@informatik.uni-leipzig.de anisa.rula@disco.unimib.it ABSTRACT Data quality is usually defined as fitness for use [6] and With the expansion in the amount of data being produced is comprised of several data quality dimensions (e.g. com- as Linked Data (LD), the opportunity to build use cases has pleteness, accuracy, conciseness etc.) along with their re- also increased. However, a crippling problem to the relia- spective metrics (means to measure the dimension). There bility of these use cases is the underlying poor data quality. have been several methodologies, which have been proposed Moreover, the ability to assess the quality of the consumed to assess the quality of a dataset [2, 8, 10]. Even though LD, based on the satisfaction of the consumers’ quality re- these methodologies provide useful ways to assess the qual- quirements, significantly influences usability of such data for ity of a dataset, they often do not address a particular use a given use case. In this paper, we propose a data quality case (usually involving several datasets) and demand a con- assessment methodology specifically designed for LD. This siderable amount of user involvement and expertise. Also, methodology consists of three phases and six steps with spe- most of the output is not interpretable by humans and the cific emphasis on considering a use case. methodologies are bound to one particular dataset and its characteristics. Keywords Therefore, in this paper, we propose a data quality assess- data quality, linked data, assessment, improvement ment methodology comprising of three phases and six steps (section 2). In contrast to the previously introduced method- 1. INTRODUCTION ologies, our methodology aims to bring an overview of the Recently, Linked Data (LD) has contributed a sea of infor- entire assessment methodology right from identifying the mation to the Web all represented in structured formats, problems to fixing them. We discuss related work in sec- linked with one another and made publicly available [4]. tion 3 and provide directions to future work in section 4. This information belongs to an enormous number of datasets covering various domains such as life sciences, geographic 2. DATA QUALITY ASSESSMENT METHOD- data, or governmental1 . Publication of this information as Linked Data has enabled users in aggregating data from OLOGY A data quality assessment methodology is defined as the pro- different sources to build mashups that assist in discover- cess of evaluating if a piece of data meets the information ing new valuable information. However, recent studies have consumers need in a specific use case [2]. In a comprehen- shown that majority of these datasets suffer from several sive survey [12], it was observed that in the 30 identified ap- data quality problems such as representational, inconsis- proaches, there were no standardized set of steps that were tency or interoperability issues [5]. These problems signifi- followed to assess the quality of a dataset. Inspired from the cantly hinder the uptake of these datasets in particular use methodology proposed in [1] and the lack of a standardized cases and affect the results as the poor quality is propa- methodology in LD, we propose a methodology consisting gated in the aggregated datasets. The ability to assess the of three phases and nine steps. In particular, from each of quality of the consumed LD, based on the satisfaction of the 30 approaches, we extracted the common steps that were the consumers’ quality requirements, significantly influences proposed to assess the quality of a dataset. We then adapted usability of such data for any given use case. and revised these steps to propose a data quality assessment 1 methodology particularly for LD as depicted in Figure 1. http://lod-cloud.net/versions/2011-09-19/ lod-cloud_colored.html Our methodology thus consists of the following phases and steps: 1. Phase I: Requirements Analysis (a) Step I: Use Case Analysis Copyright is held by the author/owner(s). LDQ 2014, 1st Workshop on Linked Data Quality Sept. 2, 2014, Leipzig, 2. Phase II: Quality Assessment Germany. (a) Step II: Identification of quality issues (b) Step III: Statistical and Low-level Analysis (c) Step IV: Advanced Analysis Phase I: Requirement Analysis 3. Phase III: Quality Improvement Step I: Use Case Analysis (a) Step V: Root Cause Analysis (b) Step VI: Fixing Quality Problems The following sections describe each of the steps in detail Phase II: Quality Assessment along with the list of data quality dimensions (from the 18 dimensions identified in [12] that are applicable for each step. Step II: Identification of quality issues 2.1 Phase I: Requirements analysis Step III: Statistical and Low-level Analysis The multi-dimensional nature of data quality makes it de- pendent on a number of factors that can be determined Step IV: Advanced Analysis by analyzing the users requirements. Thus, the use case in question is highly important when assessing the quality of a dataset. This requirement analysis phase thus includes the gathering of requirements and subsequent analysis of the requirements based on the use case. Phase III: Quality Improvement 2.1.1 Step I: Use Case analysis Step V: Root Case Analysis In this step, the user provides the details of a use case or an Step VI: Fixing Quality Problems application that best describes the usage of the dataset in order to provide a tailored quality assessment process. For this step, we identify two types of users: (a) those who are already consumers of the dataset and thus provide their data quality experiences through use cases and (b) those who are Figure 1: The quality assessment methodology potential consumers of the dataset and thus cannot provide such experiences. The first kinds of users already know what data quality problems they faced or are prone to face. In different serialization formats or languages (pointing to the this case, the user guides the assessment process since they versatility dimension), are presented to the user. In this know the dataset problems before hand; in the second case step, the user involvement is entirely manual where the user the assessment process guides the user. However, both users must have knowledge about the details of the dataset to an- are exploring the fitness for use of their dataset. This step swer these questions. The output of this step is the result facilitates the choice regarding not only which dataset should of the evaluation of the boolean dimensions, that is, a sum be assessed first, but also which aspects of individual dataset of 0’s(no) or 1’s(yes) which adds to the final data quality should be the initial target. assessment score. Using this information, it is then possible to determine a set of relevant dimensions. 2.2 Phase II: Data Quality Assessment In the previous phase, we identified the user requirements 2.2.2 Step III: Statistical and Low-level Analysis for her dataset with the particular use case she has in mind. This step performs basic statistical and low-level analysis This second phase involves the actual quality assessment on the dataset. That is, generic statistics that can be calcu- based on the requirements. In particular, amongst the set lated automatically are included in this step. For example, of dimensions and metrics discussed in [12], the most rele- the number of blank nodes pointing towards the complete- vant ones are selected. Thereafter, a quantitative evaluation ness of the dataset or number of interlinks between datasets of the quality of the dataset is performed using the metrics showcasing the interlinking degree of the dataset are calcu- specific for each selected dimension. Thus, this phase con- lated. After the analysis, generic statistics on the dataset sists of three steps: (II) Identification of quality issues (III) based on certain pre-defined heuristics are calculated and Statistical and Low-level analysis and (IV) Advanced anal- provided to the user. The end result is a score indicating ysis. the value for each of the metrics assessed. 2.2.1 Step II: Identification of quality issues 2.2.3 Step IV: Advanced Analysis The goal of this step is to identify a set of the most relevant This step, in combination with steps II and III, is used for data quality issues based on the use case. This identification assessing the overall quality of the dataset. The assessment is done with the help of a checklist, which can be filled by can be performed in different ways for different quality di- the user. The questions in the checklist implicitly refer to mensions. For example, in order to assess the accuracy of quality problems and their related quality dimensions. For data values, a pattern-based approach can be applied, which example, questions such as whether the datasets provides generates data quality tests of RDF knowledge bases [7]. a message board or a mailing list (pointing to the under- These patterns will capture incorrect values such as postal standability dimension) or whether the data is provided in address, phone number, email address, personal identifica- tion number, etc. performed on her dataset. Moreover, this step is important as the decision of whether to trust the assessment results de- This step is performed by comparing values from the trans- pends highly on the precise understanding of the evaluation formed dataset to the gold standard values (i.e. values from of the data quality. Essentially, this step involves: the original source) or to a dataset in the same domain. For example, in case of measuring the population complete- ness of a dataset, it needs to be compared with the origi- • detecting whether the problem occurs in the original nal dataset. Thus, this step requires the target or derived dataset dataset as well as the original or source dataset as input. • in case the original dataset is not available, analyze The output of this step are (i) evaluation results performed the dataset to detect the cause between target and original datasets or those in the same domain and (ii) an aggregated value (score) of the results. For example, if the data quality assessment reports problem The data quality score metrics are based on simple ratio of inconsistency in the dataset, the data modeling should be calculation. The ratio is measured by subtracting the ratio checked or if the problem of completeness is reported, the between the total number of instances that violate a data values in the original dataset and target dataset should be quality rule (V) and the total number of relevant instances compared to find the cause. (T) from one, as the following formula shows: DQscore = 1 − (V /T ) (1) 2.3.2 Step VI: Fixing Quality Problems In this step, strategies to address the identified root cause of the problems are implemented. There are several strategies This score can be applied for each property of the dataset. that can be implemented in this step such as: In case we want to calculate the quality of the overall proper- ties/attributes in a dataset, the above DQscore is multiplied with a weight wi representing the importance of the intended • Semi-automatic or automated approaches task for each property in the dataset and divide the sum of • Crowdsourcing mechanisms the weighted DQscore by the sum of all weighting factors of the regarded properties (W). Semi-automated or automated approaches can help detect quality issues and their causes on a large-scale. For example, n X inconsistencies in the ontology can be detected by running DQweightedscore = (DQscore ∗ wi )/W (2) a reasoner on the entire ontology. Crowdsourcing, on the i=1 other hand, is highly appropriate for any assignment involv- ing large to huge numbers of small tasks requiring human In case of equal importance of the properties for the task judgment. In terms of LD, crowdsourcing quality assess- at hand or in case it is not possible to annotate importance ment may involve, for example, verifying the completeness values, all wi are considered equal to 1 and the W value or correctness of a fact wrt. the original dataset. Such a task is gives the number of all properties that are tested in the does not require underlying knowledge about the structure dataset. While in the former case, the DQweightedscore is a of the data and can be done in a time and cost effective contextual metric in the latter case it is considered to be an manner [11]. intrinsic metric. 3. RELATED WORK At the end of this phase, the total score from Steps II to IV A number of data quality assessment methodologies and are aggregated and provided as a result to the user indicat- tools have been introduced, those particularly focusing on ing the quality of the dataset. A breakdown of the scores LD. These methodologies can be broadly classified into three for each of the metrics assessed is provided so that the user categories: (i) automated, (ii) semi-automated and (iii) man- is able to look at each metric separately. Additionally, ex- ual. There exist data quality assessment tools, which work planations of how the assessment was performed i.e. details completely automatically, such as LinkQA2 , which is de- of the metrics are available to the user so that she is able to signed to assess the quality of links in an automated way interpret the results in a meaningful way. and LODStats3 , which gathers comprehensive statistics (no. of classes, properties, links etc.) about a dataset available 2.3 Phase III: Quality Improvement as RDF. On the other hand, there are generic tools for val- This phase focuses towards improving the quality of the idating the structure of the RDF document4 , which only datasets based on the analysis performed in Phase II focus- provide a high-level analysis of the quality in terms of rep- ing on the use case identified in Phase I. This phase consists resentational (or modeling) problems. Tools, which semi- of two steps: (VI) Root Cause Analysis and (VII) Fixing automatically assess data quality, include Flemming’s data Quality Problems. quality assessment tool [3]; LODRefine5 ; DL-Learner6 [8] 2 https://github.com/cgueret/LinkedData-QA 2.3.1 Step V: Root Cause Analysis 3 http://stats.lod2.eu/ In this step, the main aim is to find an explanation for the 4 http://swse.deri.org/RDFAlerts/, http://www.w3. cause of the detected data quality issues i.e. performing root org/RDF/Validator/ 5 cause analysis. This step helps the user interpret and un- http://code.zemanta.com/sparkica/ 6 derstand the results of the data quality assessment that is http://dl-learner.org and ORE (Ontology Repair and Enrichment)7 [9]. Tools, [7] D. Kontokostas, P. Westphal, S. Auer, S. Hellmann, which entail manual assessment, are Sieve [10], which as- J. Lehmann, R. Cornelissen, and A. Zaveri. sesses the quality of data using an integration process and Test-driven evaluation of linked data quality. In WIQA [2], which allows users to apply a wide range of WWW, pages 747–758, 2014. quality-based policies to filter information. [8] J. Lehmann. DL-Learner: Learning Concepts in Description Logics. Journal of Machine Learning However, the automatic tools are bound to certain datasets Research, 10:2639–2642, 2009. and do not allow the freedom to the user to choose a partic- [9] J. Lehmann and L. Bühmann. ORE - A Tool for ular dataset nor focus on a specific use case. In case of semi- Repairing and Enriching Knowledge Bases. In ISWC, automated tools, the user needs to have adequate knowledge LNCS. Springer, 2010. about the dataset in order to use this tool. However, these [10] P. Mendes, H. Mühleisen, and C. Bizer. Sieve: Linked tools are not bound to a use case. In case of manual tools, Data Quality Assessment and Fusion. In LWDM, they demand a huge amount of user involvement and exper- March 2012. tise and are not sensitive towards the use case. [11] A. Zaveri, D. Kontokostas, M. A. S. andLorenz Bühmann, M. Morsey, S. Auer, and J. Lehmann. Our data quality assessment methodology is at the intersec- User-driven Quality Evaluation of DBpedia. In tion of these tools as it not only focuses on a particular use Proceedings of 9th International Conference on case but also allows the user to obtain low-level as well aggre- Semantic Systems, I-SEMANTICS ’13, Graz, Austria, gated and higher level analysis of the dataset. Moreover, the September 4-6, 2013, pages 97–104. ACM, 2013. methodology supports the interpretation of the results and [12] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, allows the user to retrace or, if required, even change the in- J. Lehmann, and S. Auer. Quality Assessment put metrics to obtain the desired quality for the particular Methodologies for Linked Data: A Survey. Under use case. Furthermore, the methodology incorporates the review, available at http://www.semantic-web- one important component missing from the existing ones, journal.net/content/quality-assessment-methodologies- the improvement of data quality problems once identified. linked-open-data. 4. CONCLUSIONS AND FUTURE WORK In this paper, we have introduced a data quality assessment methodology consisting of three phases and six steps. This methodology is generic enough to be applied to any use case. In order to validate its usability, we plan to apply it to spe- cific use cases to assess the feasibility and effectiveness of the methodology. This validity will also help us measure its applicability in various domains. Moreover, we plan to build a tool based on this methodology so as to assist users to assess the quality of any linked dataset. 5. REFERENCES [1] C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [2] C. Bizer and R. Cyganiak. Quality-driven information filtering using the WIQA policy framework. Web Semantics, 7(1):1 – 10, Jan 2009. [3] A. Flemming. Quality characteristics of linked data publishing datasources. Master’s thesis, Humboldt-Universität zu Berlin, 2010. [4] T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space, chapter 2, pages 1 – 136. Number 1:1 in Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan and Claypool, 1st edition, 2011. [5] A. Hogan, J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, and S. Decker. An empirical survey of Linked Data conformance. Journal of Web Semantics, 2012. [6] J. Juran. The Quality Control Handbook. McGraw-Hill, New York, 1974. 7 http://ore-tool.net