Improving maintenance of community-based knowledge graphs

Improving maintenance of community-based knowledge graphs NicolasFerranti nicolas.ferranti@wu.ac.at Vienna University of Economics and Business

Welthandelspl. 1 1020 Vienna Austria

Improving maintenance of community-based knowledge graphs 1613-0073 83FB6A7008D92375B7B17C6744B22F15 GROBID - A machine learning software for extracting information from scholarly documents Data Quality Knowledge Graph Wikidata Refinement

Data quality is crucial for the effective utilization of knowledge graphs, ensuring it is challenging due to the need for continuous monitoring and maintenance. This research proposal focuses on data quality in open knowledge graphs, with an emphasis on Wikidata. Wikidata, one of the largest collaborative knowledge graphs, has its own approach for data consistency, deviating from regular OWL ontologies or SHACL, the W3C recommendation. The proposal aims to comprehend and formalize Wikidata's approaches for assessing and resolving data inconsistencies. By formalizing constraints, refinement operations, and repair strategies, this research aims to improve the quality of Wikidata and other knowledge graphs also developed based on Wikibase. As one of the contributions, our research proposes a semi-automatic refinement pipeline to empower the Wikidata user community by recommending repairs of constraint violations, combining distance-based refinement approaches and ranking heuristics. Establishing a comprehensive framework and engaging users in knowledge graph maintenance enhances the reliability and usability of open knowledge graphs.

Introduction

A Knowledge Graph (KG) uses a graph-based model to represent real-world entities, their attributes, and relationships [11]. Entities can be anything that can be uniquely identified and described, such as people, places, things, or concepts. Statements, representing relationships between entities are represented as edges in the graph, while the attributes of entities are again graph nodes. As such, KGs can be used to represent a wide range of information, including encyclopedic knowledge, scientific data, or enterprise information [8].

There are different methods that can be used to create a KG. For instance, they can be extracted from semi-structured Web data, like DBpedia KG [9], or edited collaboratively by a community, like the Wikidata KG [16]. Regardless of the approach used in construction, KGs are not perfect. Despite the efforts of some communities or organizations to increase the coverage of their respective KGs, it is complex to represent all the knowledge available about a domain. Therefore, organizations responsible for KGs usually look for a balance between correctness and information coverage.

Since its creation by the Wikimedia Foundation in 2012, Wikidata (WD) has become one of the largest KGs, publicly available on the Web, with more than 100M items1 and 14B triples 2 . One of the main factors responsible for this growth is WD's user community, with more than 24k active users (humans and bots). The large user community is primarily motivated by Wikipedia, as the vast majority of Wikipedia pages incorporate content from WD [3]. The WD KG is available in standard RDF format and can be queried via a public SPARQL endpoint, but it does not adhere to established W3C standards, such as OWL/RDFS or SHACL for express constraints on its schema: unlike other knowledge graphs, WD focuses on the development of the data layer (A-box), and the terminology layer (T-Box) evolves alongside it without a predefined formal ontology [14].

Problem statement. In our proposal, we address the problem of data quality in KGs with an emphasis on WD. WD uses constraints to ensure consistent vocabulary usage, whereby WD projects do not currently utilize SHACL for validating RDF graphs against constraints as recommended by W3C, nor OWL ontologies. Studies to understand the semantics behind the WD constraint projects and allow violations to be easily tested and consumed by third parties are lacking. Our hypothesis is that such studies could leverage the development of approaches to assist users in refining inconsistent data, as well as promote constraint checking in WD through different approaches, such as by mapping constraints to SHACL and SPARQL. Currently, data edits are done manually and without a support tool for the community, which increases the effort to correct inconsistent data.

Contributions. In this proposal, our goal is to initially study the semantics of constraints used by WD, among the main contributions is the formalization of constraints in both SHACL and SPARQL languages, consequently creating one of the biggest benchmarks for SHACL validators and allowing other agents to retrieve inconsistent data in real-time through WD's SPARQL endpoint. The formalization of constraints is considered as the first step to make the following contributions possible: Paper structure. The remainder of this research proposal is structured as follows. Section 2 presents related work focusing on data quality and constraints formalization. Section 3 presents the main research questions and hypotheses organized in steps. Section 4 discuss some of the preliminary results concerning the formalization of WD constraints. In section 5, the methodology for our next steps based on the current findings is introduced. Finally, section 6 points to the main future directions.

-Efficient retrieval of

Related Literature

Ensuring the quality of the data in a KG is fundamental for useful consumption.

Paulheim [11] provided a comprehensive survey of KG refinement approaches, further classified according to their goal: KG completion, or repair detected errors. These approaches apply different methods, ranging from techniques using machine learning to NLP-related techniques. The results showed that the vast majority of approaches focused on DBpedia, indicating a gap when it comes to WD and other Wikibase-based KGs. Furthermore, Shenoy et al. [15] presented a quality analysis of WD focusing on correctness, checking for weak statements under three main indicators: constraint violation, community agreement, and deprecation. The premise is that a statement receives a low-quality score when it violates some constraint, highlighting the importance of constraints for KG refinement. Shenoy and colleagues use a subset of constraints to retrieve violations through a toolkit and also mention the challenges of testing WD constraints using SHACL. However, the focus of the authors is on combining violation indicators with other indicators toward a quality metric, rather than formalizing the semantics of the constraints and promoting practical means for testing and repairs.

Martin and Patel-Schneider [10] discuss the representation of WD property constraints through multi-attributed relational structures (MARS), as a logical framework for WD. Constraints are represented in MARS using extended multi-attributed predicate logic (eMAPL), providing a logical characterization for constraints. Despite covering 26 different constraint types, the authors have not performed practical experiments to evaluate the accuracy of the proposed formalization, nor its efficiency. Ahmetaj et al. [1] propose an approach to provide refinements to SHACL violations. The approach involves encoding the problem as an Answer Set Programming (ASP) program. By transforming the graph and a set of SHACL shapes into the ASP program P, the answer sets or stable models of P represent possible repairs. The use of efficient ASP solvers, such as [7], offers a promising means to generate practical data repairs. One of the major benefits of formalizing WD constraints into SHACL is to enable the use of the various solutions already implemented for SHACL constraints in the context of WD

In conclusion, the existing literature on KG refinement approaches has primarily focused on KGs like DBpedia, leaving a gap in understanding and refining KGs such as WD and other Wikibase-based KGs. While previous studies have highlighted the importance of constraints in KG quality analysis, there is a need to formalize the semantics of these constraints and establish practical means for testing. By formalizing constraints, violations and repairs can be collected and serve as valuable input for refinement models.

Research Questions and Hypotheses

Building on the existing challenge of refining community-based KGs, we see the need for semi-automatic refinement approaches able to provide the community with repair suggestions based on both formal definitions of constraints and statistical analysis. Inconsistencies are the primary input for performing corrective repairs. As pointed out by [15], WD inconsistency reports are calculated within an ad-hoc extension of Wikibase. On top of that, the approach behind the generation of the reports is not public, inconsistency reports are published on an HTML page with a maximum limit of inconsistencies displayed by each constraint type. Therefore, it is crucial to formalize the constraints and create an open and efficient method to retrieve inconsistencies.

In the recent past, embedding approaches have been used to address KG completion and link prediction. Bordes et al. [2] introduced embeddings for KGs. We would like to test whether these approaches and AI models trained based on identified inconsistencies and repairs can provide the community with relevant repair suggestions.

To this end, we summarise the main hypothesis of our research proposal as follows:

Effective maintenance of community-based KGs can be achieved by: (i) formally defined constraints in languages that optimize the process of collecting inconsistent data; (ii) a set of approaches to propose refinement suggestions using inconsistent data and previous repairs as input; (iii) a set of heuristics to rank candidate repairs according to different preferences.

Our hypothesis leads to the following research questions:

1. Can the semantics of Wikidata property constraints be represented with SHACL-core and SPARQL? 2. How can we make use of inconsistencies and historical repairs to propose refinements to knowledge graphs? 3. How can distance-based metrics be used to propose/predict repairs to the knowledge graph? 4. What are the most relevant strategies to rank different repair suggestions?

Preliminary results

In an effort to understand the semantics, formalize, and operationalize WD property constraints, we first investigated, based on available descriptions, whether and how the 30 WD property constraints could be mapped to SHACL's core language and SPARQL [5]. This study made it possible to clarify to which extent SHACL can represent community-defined constraints of a widely used real-world KG. One of our results is a collection of practical SHACL constraints that can be used in a large and growing real-world dataset; indeed the non-availability of practical SHACL performance benchmarks has already been emphasized by [6].

Other results we presented include clarifications of heretofore uncertain issues, such as the representability of permitted entities and exceptions in WD property constraints within SHACL [15]. We also could argue the inexpressibility of certain WD constraints, due to the impossibility to compare values obtained through different paths matching the same regular path expression within SHACL-core. These issues could be addressed when using SPARQL to formalize and validate constraints, where all 30 constraints could in principle be formalized.

Subsequently, we compared the inconsistencies found by the new constraint set against the primary reference, the inconsistency reports system3 . In a recently submitted journal paper [4], we identified that 5 constraint types represent the vast majority of reported inconsistencies, therefore an experiment was designed to compare the top five properties of each constraint type against the results obtained by our SPARQL formalization. The results summarized in Table 1 and detailed in [4] show that there is a difference in inconsistencies identified when an approach considers only the truthy statements from when it considers the whole statements set. Furthermore, we observed that there is still room for discussions on how these constraints should be checked due to the absence of formal, unambiguous specifications in some cases and also due to the existence of qualifiers not considered by WD's report tool when validating the data.

Our first research question can be partially answered, as we found that SPARQL can serve as a viable solution to retrieve inconsistent data in real-time, as well as stimulate discussion to consolidate the semantics of each constraint. Similar tests using SHACL-core are also under execution. 1. Summary of the inconsistencies compared. RA ∩ SPA represents the intersection between inconsistencies provided by the reports page and those found by the SPARQL query. Abbreviations used: Constraint Type (IRS = Item-requires-statement, VRS= Value-requires-statement). Details in [4].

Methodology for next steps

To address our research questions, we employ two research methodologies: (i) a systematic literature review on KG refinement approaches; and (ii) the design science research methodology (DSRM) put forth by [13]. While the first contributes to understanding the state of the art in refinement approaches, the second can be used to consume the set of inconsistencies and repairs and develop innovative refinement technologies. Bellow, DSRM activities are described, as well as how our research proposal fits in them.

Problem and motivation. The research is driven by practical problems and aims to develop solutions that address specific challenges faced in practice. It focuses on creating artifacts or designs that have value and utility in solving real-world problems. In the scope of this research, the problem to be solved is to facilitate the process of corrective editing in collaborative KGs by developing a semi-automatic tool to promote repair suggestions.

Design and Creation. DSRM involves the design and creation of new artifacts, such as models, methods, frameworks, or software prototypes. These artifacts are intended to improve or enable certain aspects of a given problem domain. Through a systematic review of the literature, we can identify the main refinement methods and propose the use of inconsistencies and historical repairs in the creation of a distance-based semi-automatic refinement tool. For instance, making use of KG embedding methods to assess the distance of the violating value to the vector that describes the expected values. This method, in combination with the consumption of historical repairs, can help to predict what would be an optimal repair and present options for the community.

Evaluation. The designed artifacts are rigorously evaluated to assess their effectiveness, efficiency, and utility in solving the identified problems. The evaluation process often includes usability testing, performance measurement, and gathering feedback from relevant stakeholders. It is intended that the artifacts developed in this research can be tested both to predict corrections based on historical data and on the level of interaction with KG user editors. Therefore, two main experiments are expected, one over a benchmark of historical repairs and, once the tool is fully operational, a qualitative study with the WD user community.

Reflection and Future Work

In this research proposal, we explored the problem of refining community-based KGs through the formalization of constraints and the usage of inconsistencies as input for the creation of semi-automatic distance-based refinement approaches.

We note that the main approaches do not focus on community-based KGs [12], where we plan to contribute with a systematic review analyzing more recent studies. Due to the fact that Wikidata, the most popular community-based KG, does not use conventional methods such as OWL ontologies to represent data terminology [14], our first efforts focused on formalizing constraints created by the community itself [4,5]. Our next steps consist of analyzing sets of violations and identifying repair patterns in KG to build semi-automatic repair approaches. In the future, we hope to build a solution capable of suggesting relevant fixes to the community according to different objective functions.

Fig. 1 .1Fig. 1. Main contributions grouped by the workflow step they refer to. https://www.wikidata.org/wiki/Wikidata:Statistics, as of January 2 https://short.wu.ac.at/7t66, last accessed 13 February 2013 https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations

Acknowledgements This research is conducted under the supervision of Prof. Dr. Axel Polleres.

Repairing shacl constraint violations using answer set programming SAhmetaj RDavid APolleres MŠimkus The Semantic Web-ISWC 2022: 21st International Semantic Web Conference, Virtual Event Proceedings Springer October 23-27, 2022. 2022 Translating embeddings for modeling multi-relational data ABordes NUsunier AGarcía-Durán JWeston OYakhnenko Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013

Lake Tahoe, Nevada, United States

December 5-8, 2013. 2013 Proceedings of a meeting held Introducing wikidata to the linked data web FErxleben MGünther MKrötzsch JMendez DVrandečić International semantic web conference Springer 2014 NFerranti JFDe Souza APolleres Formalizing and validating wikidata's property constraints using shacl+ sparql Formalizing property constraints in wikidata NFerranti APolleres JFDe Souza SAhmetaj Proceedings of the Wikidata Workshop co-located with 21st International Semantic Web Conference (ISWC, 2022) the Wikidata Workshop co-located with 21st International Semantic Web Conference (ISWC, 2022)

Hangzhou, China

Oc. tober 23-27, 2022. 2022 Trav-shacl: Efficiently validating networks of SHACL constraints MFiguera PDRohde MVidal 10.1145/3442381.3449877 / IW3C2 WWW '21: The Web Conference 2021, Virtual Event JLeskovec MGrobelnik MNajork JTang LZia

Ljubljana, Slovenia

ACM April 19-23, 2021. 2021 Multi-shot asp solving with clingo MGebser RKaminski BKaufmann TSchaub Theory and Practice of Logic Programming 19 1 2019 Knowledge graphs AHogan EBlomqvist MCochez CAmato GDMelo CGutierrez SKirrane JE LGayo RNavigli SNeumaier Synthesis Lectures on Data, Semantics, and Knowledge 12 2 2021 Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia JLehmann RIsele MJakob AJentzsch DKontokostas PNMendes SHellmann MMorsey PVan Kleef SAuer Semantic web 6 2 2015 Wikidata constraints on mars DMartin PFPatel-Schneider Wikidata@ ISWC 2020 Knowledge graph refinement: A survey of approaches and evaluation methods HPaulheim Semantic web 8 3 2017 Serving dbpedia with dolce-more than just adding a cherry on top HPaulheim AGangemi International semantic web conference Springer 2015 A design science research methodology for information systems research KPeffers TTuunanen MARothenberger SChatterjee Journal of management information systems 24 3 2007 Who models the world? collaborative ontology creation and user roles in wikidata APiscopo ESimperl Proceedings of the ACM on Human-Computer Interaction 2(CSCW) the ACM on Human-Computer Interaction 2(CSCW) 2018 A study of the quality of wikidata KShenoy FIlievski DGarijo DSchwabe PSzekely Journal of Web Semantics 72 100679 2022 Wikidata: a free collaborative knowledgebase DVrandečić MKrötzsch Communications of the ACM 57 10 2014