-

(Linked) Data Quality Assessment: An Ontological Approach

Aparna Nayak ID

Bojan Bozic ID

Luca Longo ID

luca.longog@tudublin.ie 0 0 SFI Centre for Research Training in Machine Learning, School of Computer Science, Technological University Dublin , Dublin , Republic of Ireland

The e ective functioning of data-intensive applications usually requires that the dataset should be of high quality. The quality depends on the task they will be used for. However, it is possible to identify task-independent data quality dimensions which are solely related to data themselves and can be extracted with the help of rule mining/pattern mining. In order to assess and improve data quality, we propose an ontological approach to report data quality violated triples. Our goal is to provide data stakeholders with a set of methods and techniques to guide them in assessing and improving data quality.

Data quality assessment Data quality improvement Linked data Root cause analysis

Data quality can be perceived as ` tness for use' for a given application or a use case. Data quality is often determined by assessing if it meets the user's requirement. Assessing the data quality usually requires a large number of quality metrics to be computed rather than a single metric for a particular application. A broad range of data quality dimensions and categories of such dimensions as well as metrics for measuring these dimensions are de ned in [ 20 ]. High-quality data leads to better decision-making across the application whereas poor quality data can be re ned using multiple available techniques [ 14 ].

The linked data principles promote publishing data and interlinking them in a machine-readable format using Semantic Web standards. Knowledge graphs are seen as one of the essential components in envisioning the Semantic Web's idea. A knowledge graph is a graph-based knowledge base that can be considered as an RDF graph. Nodes in these graphs represent entities or literal, while edge represents the relation either between entities or between entities and literals. An RDF graph consists of a RDF triple where each triple (s,p,o) is an ordered set of the following RDF terms: a subject s 2 U [ B, a predicate p 2 U , and an object o 2 U [ B [ L. An RDF term is either a Uniform Resource Identi er (URI, U), a blank node (B) or a literal (L). Nodes are usually associated with a type which a class in case of an entity or a datatype in the case of literal.

In the proposed model it is expected to design a uni ed approach to publish data quality along with improved quality dataset while understanding the root causes of data quality violated triples. The following is the general work ow for assessing quality and determining the root causes of violations: 1) identify a dataset, 2) uplift data in case of non-RDF data, 3) choose quality dimensions 4) detect root causes of data quality violated triples 5) apply data quality improvement techniques. Without the use of external knowledge bases, the model provides stakeholders with a set of techniques for identifying quality problems and automatic suggestions for improving the overall quality of the dataset.

In conclusion, the proposed method can be summarised as the de nition of strategies to assess data quality. The remainder of this article is structured as follows. Section 2 describes the applicability of the proposed method. Existing methods and ontologies to assess data quality are discussed in section 3. Section 4 discusses the research questions that the proposed research aims to solve. Section 5 tries to answer the research questions in detail. Section 6 discusses preliminary proposed ontology along with gaps of existing frameworks. Section 7 mentions the evaluation plan. Finally, section 8 re ects the overall strengths of the framework. 2

Relevancy

This research proposal aims to deliver an end-to-end system that uplifts non-RDF to RDF, assesses its quality, and identi es root causes of quality violated triples to improve the quality. The e ort and time required to preprocess the data will be reduced if methods to identify the root causes of data quality violated triples are provided. The proposed system is relevant for all data publishers, contributors, and consumers as the assessment of data quality will also locate the quality violated triples. Moreover, the proposed approach aims to publish the data in RDF which is a machine-understandable format. Users can decide whether or not to x a problem in the dataset by looking at suggested quality improvement suggestions and the exact cause of the problem. 3

Related work

Data quality is described as a multidimensional concept [ 19 ]. Assessment of data quality involves inspecting multiple dimensions. Relevant dimensions for linked data quality have been elucidated exhaustively in literature [ 20 ], [ 15 ], [ 17 ]. Data quality assessment is followed by generating the quality report in a standard format. To manage data quality assessment reports e ectively, several data quality ontologies such as Data Quality Management(DQM) [ 6 ], Reasoning Violation Ontology(RVO) [ 2 ], Data Quality Ontology (daQ) [ 4 ], Data Quality Vocabulary(DQV) 1 [ 1 ] have been proposed. All the ontologies in the literature are helpful to represent data quality assessment reports except for RVO. RVO is a dedicated reasoning error ontology that helps to process errors.

1 https://www.w3.org/TR/vocab-dqv/

There exist several frameworks such as Luzzu [ 3 ], SemQuire [ 10 ], LD Sni er [ 13 ], TripleCheckMate [ 9 ], RDFUnit [ 8 ] and SWIQA [ 5 ] which aims to assess linked dataset. These frameworks di er in terms of scalability, generation of the quality report to publish results, total number of metrics assessed and use of external knowledge base. All the frameworks discussed here lack adoption of data quality improvement techniques as well as identifying root causes of quality violated triples. LiQuate [ 16 ], Sieve [ 12 ] and RDF improvements [ 7 ] are some frameworks that improves data quality after assessing couple of metrics.

The root cause analysis technique aims at identifying triples that have violated data quality. In the literature, there exists multiple data validation techniques such as SHACL 2 and rule-based reasoning [ 11 ]. These validation techniques help to validate the shape of RDF rather than assessment and improvement. An extension of Luzzu [ 18 ] identi es data quality violated triples based on data quality metric evaluation that helps stakeholders to prioritize and x the errors. However, all the data quality assessment or validation approaches require either an external knowledge base or an ontology or both. In our research, we found 14 quality dimensions in the literature that are both quanti able and have quality violated triples. We focus on the intrinsic dimension as these are explicitly relevant to assess the quality at the A-Box level. 4

Research questions

Research questions (RQs) deal with uplifting structured data, assessment and the enhancement steps considering the quality dimensions. The RQ related to the proposed method can be summarised as follows.

RQ1: To what extent the comma separated values be uplifted to linked data format dependent/independent of the domain vocabulary? Multiple mapping languages exist to uplifting the dataset to RDF. When the domain vocabulary is at hand, we aim to reuse the most promising mapping languages and de ne our ontology to uplift data to ll the gaps in missing vocabulary. On contrary, the input data can be used to learn ontology. Ontology learning helps to assess the consistency of the dataset.

RQ2: To what extent can the quality of linked data be assessed to identify the root causes of data quality violated triples without external knowledge base? In order to identify the root causes of the data quality violated triples we have to investigate the quality dimensions in particular focusing on the intrinsic dimension can be de ned independently of the external data sources.

RQ3: To what extent the performance of the proposed model can be signi cantly improved by identifying domain dependent characteristics? The proposed model is compared with baseline methods with diverse datasets to analyze the scalability, domain-dependent/independent behaviour and applicability. It also requires building a synthetic dataset to showcase data quality metrics coverage.

RQ4: To what extent all the quality metrics under consideration can be improved without degrading existing data quality? Data quality improvement technique for each metric is identi ed and applied to understand the overall data

2 https://www.w3.org/TR/shacl

quality. This helps to identify correlated metrics and understand the e cacy of data quality improvement techniques. 5

Approach

The proposed approach aims to help stakeholders to i) assess data quality problems, ii) identify triples that have violated quality, iii) understand e ective strategies in solving detected problems. The entire framework can be implemented as a linear approach, summarised as follows: { Data quality assessment : De nes actions to acquire the values of the identi ed data quality dimensions.

De nition of the quality dimensions to assess Measurement of the selected quality dimensions

Reporting the data quality measurement { Root cause analysis : Analyses the triples for data quality violated triples.

Identi cation of quality dimensions Designing an ontology to report root cause analysis

Representation of the results { Suggestions to improve triples that have violated quality: De nes a process to rectify the violations when quality is poor.

Building a knowledge base to correct the common errors identi ed in root cause analysis Implementation of rule-based method to identify appropriate data quality improvement techniques. 5.1

Data quality assessment

Data quality assessment computation usually requires an understanding of the metrics. One of the aims of the proposed method is to assess data at the A-Box level without considering the external knowledge base. Metrics that belong to the intrinsic dimension and some of the metrics that belong to the representational dimension focus on evaluating both A-box and T-box. Table 1 gives an overview of the metrics that will be considered to assess data quality. Rule mining algorithms will be used to learn frequent patterns in the linked data. This helps to assess the metrics such as no misuse of properties, correct domain and range de nition and many more. The clustering algorithm helps assess the metrics such as no misplaced classes/properties and detection of outliers. Metrics that either does not require an external knowledge base or requires a static external knowledge base are identi ed and considered for the assessment.

Table 2 gives an overview of metrics that requires only computation. Assessment only column refers that the speci ed metric is used only for assessment purposes, and the root cause of quality violation is not applied. When the raw form of RDF data is considered, metrics that are listed under direct use of RDF data can be assessed.

The root cause violations of the triples will be identi ed using data quality metrics focused on evaluating intrinsic dimensions. Of the 24 quality metrics identi ed by [ 20 ], eleven metrics neither requires an ontology nor dynamic knowledge base. The remaining dimensions are not considered mainly due to the factors 1) Input data requires to be stored on client side. 2) Need of external data sources. For example, quality dimensions in the accessibility class (availability, security, performance, interlinking, licensing) are not taken into account because this dimension focuses on and evaluates metrics when data is stored on the server. Contextual dimension focus on the context of the task at hand is also not considered for evaluation. The supported dimensions are intrinsic and representational. Hence the strengths of the proposed model are in enforcing intrinsic and representational data quality dimensions. 5.2

Root cause analysis

Root cause analysis helps to identify the reason for the violations. To develop an e ective correctable measure to correct and prevent such adverse outcomes in the future, it is critical to rst understand the cause. The evaluation of the root cause is followed by the formulation of recommendations. For example, datatype mismatch usually is a quality problem. When the system detects it, a recommendation can be given for the desired datatype along with subject and object information. 5.3

Data quality improvement

This stage aids in the improvement of data quality. The identi cation of quality violated triples is the result of root cause analyses. Depending on the data quality assessment value for each assessed metric, quality improvement techniques can be applied. Data quality improvement method helps to remove outliers, correct malformed datatype literals, infer missing triples and many more. The proposed method also aims to notify the suggestion in the form of add/delete/modify the erroneous triple, which could lead to an overall data quality improvement. 6

Discussion

Existing data quality assessment approaches compute the quality metric, and the assessment result is given as a numeric value. Prevailing assessment methods require an external knowledge base in the form of vocabulary/ontology or a dictionary. Linked data is veri ed against a given external knowledge base to generate an assessment report. The proposed approach makes use of ontology learning, rule mining, and static dictionaries to assess the quality of the data. Ontology learning and rule mining algorithm help to understand the frequent patterns that can be used to assess metrics such as consistency, syntactic validity and outliers. Dataset users should have the privilege to know the root causes of the data quality violated triples. Root cause analysis helps to identify the location of the problem, such as subject, predicate, or object, along with the type of the quality problem. Quality problem type refers to a metric that has failed on a particular triple. The frameworks presented in section 3 does not identify the quality violated triples rather only helps in assessment of metrics. An ontology will be proposed to report data quality violated triples as well as data quality assessment. A preliminary proposed ontology is as shown in gure 1. Moreover, suggestions over quality violated triples will be given that helps to improve data quality. The suggestions in terms of add/modify/delete help to improve data quality. 7

Evaluation Plan

The outcome of this research project will be a tool for supporting data scientists to uplift datasets to RDF format, assess data quality and analysing root causes of data quality violated triples to improve data quality. Multiple features must be evaluated to understand the success of the proposed approach. However, the main contribution of the work focuses on assessing data quality and analysing the root causes of data quality violated triples to improve the quality. The model's performance is also evaluated considering scalability and time taken to identify the root causes of violations of the knowledge graph. Moreover, metrics coverage can be evaluated by considering a) synthetic dataset, b) the results obtained by other tools to compare and c) results which has to be veri ed manually.

The synthetic dataset is carefully designed to verify the coverage of all the implemented metrics. Apart from the synthetic dataset, the model is evaluated by considering multiple datasets to verify applicability. The correctness of the model is veri ed by comparing the proposed model with existing tools and with the help of manual evaluation. Thus, our hypothesis that answers all the research questions de ned in section 4 is stated as follows.

Hypothesis: The time required to improve data quality after identifying the root causes of violations is less than applying random data quality improvement technique directly on the raw data. 8

Re ections

To the best of our knowledge, quality assessment, root cause analysis and quality improvement are rarely managed simultaneously in linked data. Therefore, our goal is to ll this gap by proposing a framework that helps data providers and consumers to assess and improve data quality. Moreover, the features o ered by this framework will be integrated into the ML framework to understand data quality and test the applicability of our proposal.

Acknowledgements This publication has emanated from research conducted with the nancial support of Science Foundation Ireland under Grant number 18/CRT/6183. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

1. Albertoni , R. , Isaac , A. : Introducing the data quality vocabulary (DQV) . Semantic Web 12 ( 1 ), 81 { 97 ( 2021 )

2. Bozic , B. , Brennan , R. , Feeney , K. , Mendel-Gleason , G. : Describing reasoning results with rvo, the reasoning violations ontology . In: MEPDaW and LDQ co-located with ESWC . CEUR Workshop Proceedings , vol. 1585 , pp. 62 { 69 . CEUR-WS.org ( 2016 )

3. Debattista , J. , Auer , S. , Lange , C. : Luzzu - A methodology and framework for linked data quality assessment . ACM J. Data Inf. Qual . 8 ( 1 ), 4: 1 {4: 32 ( 2016 )

4. Debattista , J. , Lange , C. , Auer , S. : daq, an ontology for dataset quality information . In: Proceedings of the Workshop on Linked Data on the Web co-located with WWW. CEUR Workshop Proceedings , vol. 1184 . CEUR-WS.org ( 2014 )

5. Furber , C. , Hepp , M. : Swiqa - a semantic web information quality assessment framework . In: 19th European Conference on Information Systems , ECIS. p. 76 ( 2011 )

6. Furber, C. , Hepp , M. : Towards a vocabulary for data quality management in semantic web architectures . In: Proceedings of the 2011 EDBT/ICDT Workshop on Linked Web Data Management . pp. 1 { 8 . ACM ( 2011 )

7. Hogan , A. , Harth , A. , Passant , A. , Decker , S. , Polleres , A. : Weaving the pedantic web . In: Proceedings of the WWW 2010 Workshop on Linked Data on the Web. CEUR Workshop Proceedings , vol. 628 . CEUR-WS.org ( 2010 )

8. Kontokostas , D. , Westphal , P. , Auer , S. , Hellmann , S. , Lehmann , J. , Cornelissen , R. , Zaveri , A. : Test-driven evaluation of linked data quality . In: 23rd International World Wide Web Conference , WWW. pp. 747 { 758 . ACM ( 2014 )

9. Kontokostas , D. , Zaveri , A. , Auer , S. , Lehmann , J.: Triplecheckmate: A tool for crowdsourcing the quality assessment of linked data. In: Knowledge Engineering and the Semantic Web . Communications in Computer and Information Science , vol. 394 , pp. 265 { 272 . Springer ( 2013 )

10. Langer , A. , Siegert , V., Gopfert, C. , Gaedke , M. : Semquire - assessing the data quality of linked open data sources based on DQV . In: Current Trends in Web Engineering - ICWE. Lecture Notes in Computer Science , vol. 11153 , pp. 163 { 175 . Springer ( 2018 )

11. Meester , B.D. , Heyvaert , P. , Arndt , D. , Dimou , A. , Verborgh , R.: RDF graph validation using rule-based reasoning . Semantic Web 12 ( 1 ), 117 { 142 ( 2021 )

12. Mendes , P.N. , Muhleisen, H., Bizer , C. : Sieve: linked data quality assessment and fusion . In: Proceedings of the 2012 Joint EDBT/ICDT Workshops . pp. 116 { 123 . ACM ( 2012 )

13. Mihindukulasooriya , N. , Garc a-Castro, R. , Gomez-Perez , A. : LD sni er: A quality assessment tool for measuring the accessibility of linked data . In: Knowledge Engineering and Knowledge Management - EKAW. Lecture Notes in Computer Science , vol. 10180 , pp. 149 { 152 . Springer ( 2016 )

14. Paulheim , H.: Knowledge graph re nement: A survey of approaches and evaluation methods . Semantic Web 8 ( 3 ), 489 { 508 ( 2017 )

15. Radulovic , F. , Mihindukulasooriya , N. , Garc a-Castro, R. , Gomez-Perez , A. : A comprehensive quality model for linked data . Semantic Web 9 ( 1 ), 3 { 24 ( 2018 )

16. Ruckhaus , E. , Vidal , M. , Castillo , S. , Burguillos , O. , Baldizan , O. : Analyzing linked data quality with liquate . In: The Semantic Web: ESWC. Lecture Notes in Computer Science , vol. 8798 , pp. 488 { 493 . Springer ( 2014 )

17. Strong , D.M. , Lee , Y.W. , Wang , R.Y. : Data quality in context . Communications of the ACM 40 ( 5 ), 103 { 110 ( 1997 )

18. Vaidyambath , R. , Debattista , J. , Srivatsa , N. , Brennan , R.: An intelligent linked data quality dashboard . In: Proceedings for the 27th AIAI Irish Conference on Arti cial Intelligence and Cognitive Science. CEUR Workshop Proceedings , vol. 2563 , pp. 341 { 352 . CEUR-WS.org ( 2019 )

19. Wand , Y. , Wang , R.Y.: Anchoring data quality dimensions in ontological foundations . Commun. ACM 39 ( 11 ), 86 { 95 ( 1996 )

20. Zaveri , A. , Rula , A. , Maurino , A. , Pietrobon , R. , Lehmann , J. , Auer , S. : Quality assessment for linked data: A survey . Semantic Web 7 ( 1 ), 63 { 93 ( 2016 )