=Paper=
{{Paper
|id=Vol-3678/paper4
|storemode=property
|title=Improving Maintenance of Community-based Knowledge Graphs
|pdfUrl=https://ceur-ws.org/Vol-3678/paper4.pdf
|volume=Vol-3678
|authors=Nicolas Ferranti
|dblpUrl=https://dblp.org/rec/conf/semweb/Ferranti23
}}
==Improving Maintenance of Community-based Knowledge Graphs==
Improving maintenance of community-based
knowledge graphs
Nicolas Ferranti1[0000−0002−5574−1987]
Vienna University of Economics and Business, Welthandelspl. 1, 1020 Vienna, Austria
nicolas.ferranti@wu.ac.at
Abstract. Data quality is crucial for the effective utilization of knowl-
edge graphs, ensuring it is challenging due to the need for continuous
monitoring and maintenance. This research proposal focuses on data
quality in open knowledge graphs, with an emphasis on Wikidata. Wiki-
data, one of the largest collaborative knowledge graphs, has its own ap-
proach for data consistency, deviating from regular OWL ontologies or
SHACL, the W3C recommendation. The proposal aims to comprehend
and formalize Wikidata’s approaches for assessing and resolving data
inconsistencies. By formalizing constraints, refinement operations, and
repair strategies, this research aims to improve the quality of Wikidata
and other knowledge graphs also developed based on Wikibase. As one
of the contributions, our research proposes a semi-automatic refinement
pipeline to empower the Wikidata user community by recommending
repairs of constraint violations, combining distance-based refinement ap-
proaches and ranking heuristics. Establishing a comprehensive frame-
work and engaging users in knowledge graph maintenance enhances the
reliability and usability of open knowledge graphs.
Keywords: Data Quality · Knowledge Graph · Wikidata · Refinement.
1 Introduction
A Knowledge Graph (KG) uses a graph-based model to represent real-world
entities, their attributes, and relationships [11]. Entities can be anything that can
be uniquely identified and described, such as people, places, things, or concepts.
Statements, representing relationships between entities are represented as edges
in the graph, while the attributes of entities are again graph nodes. As such, KGs
can be used to represent a wide range of information, including encyclopedic
knowledge, scientific data, or enterprise information [8].
There are different methods that can be used to create a KG. For instance,
they can be extracted from semi-structured Web data, like DBpedia KG [9], or
edited collaboratively by a community, like the Wikidata KG [16]. Regardless of
the approach used in construction, KGs are not perfect. Despite the efforts of
some communities or organizations to increase the coverage of their respective
KGs, it is complex to represent all the knowledge available about a domain.
Therefore, organizations responsible for KGs usually look for a balance between
correctness and information coverage.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2 N. Ferranti
Since its creation by the Wikimedia Foundation in 2012, Wikidata (WD)
has become one of the largest KGs, publicly available on the Web, with more
than 100M items1 and 14B triples2 . One of the main factors responsible for
this growth is WD’s user community, with more than 24k active users (humans
and bots). The large user community is primarily motivated by Wikipedia, as
the vast majority of Wikipedia pages incorporate content from WD [3]. The
WD KG is available in standard RDF format and can be queried via a public
SPARQL endpoint, but it does not adhere to established W3C standards, such
as OWL/RDFS or SHACL for express constraints on its schema: unlike other
knowledge graphs, WD focuses on the development of the data layer (A-box),
and the terminology layer (T-Box) evolves alongside it without a predefined
formal ontology [14].
Problem statement. In our proposal, we address the problem of data qual-
ity in KGs with an emphasis on WD. WD uses constraints to ensure consis-
tent vocabulary usage, whereby WD projects do not currently utilize SHACL
for validating RDF graphs against constraints as recommended by W3C, nor
OWL ontologies. Studies to understand the semantics behind the WD constraint
projects and allow violations to be easily tested and consumed by third parties
are lacking. Our hypothesis is that such studies could leverage the development
of approaches to assist users in refining inconsistent data, as well as promote
constraint checking in WD through different approaches, such as by mapping
constraints to SHACL and SPARQL. Currently, data edits are done manually
and without a support tool for the community, which increases the effort to
correct inconsistent data.
Contributions. In this proposal, our goal is to initially study the semantics
of constraints used by WD, among the main contributions is the formalization
of constraints in both SHACL and SPARQL languages, consequently creating
one of the biggest benchmarks for SHACL validators and allowing other agents
to retrieve inconsistent data in real-time through WD’s SPARQL endpoint. The
formalization of constraints is considered as the first step to make the following
contributions possible:
– Efficient retrieval of inconsistent data
– Mapping of historical repairs
– Creation of models for proposing repair suggestions based on inconsistencies
and repairs
– Use of knowledge graph embeddings as a distance-based refinement model
– Introduction of algorithms to improve the ranking of refinement suggestions
according to different criteria, such as terminology data prevailing over in-
stances or minimizing changes
1
https://www.wikidata.org/wiki/Wikidata:Statistics, as of January 2023
2
https://short.wu.ac.at/7t66, last accessed 13 February 2013
Improving maintenance of community-based knowledge graphs 3
– Development of a pipeline to combine multiple refinement models with rank-
ing strategies, for KG editing and maintenance
An overview of the main contributions grouped by the proposed workflow is
available in Figure 1
Workflow
Repair
Violation Repair
Suggestion
KG Detection Ranking
Model
Contributions
Formalization of distance-based Strategies to rank the
suggested repairs
constraints in SPARQL repair models
and re-feed the
and SHACL KG-embedding system with
Mapping historical repair model community feedback
repairs
Fig. 1. Main contributions grouped by the workflow step they refer to.
Paper structure. The remainder of this research proposal is structured as
follows. Section 2 presents related work focusing on data quality and constraints
formalization. Section 3 presents the main research questions and hypotheses
organized in steps. Section 4 discuss some of the preliminary results concerning
the formalization of WD constraints. In section 5, the methodology for our next
steps based on the current findings is introduced. Finally, section 6 points to the
main future directions.
2 Related Literature
Ensuring the quality of the data in a KG is fundamental for useful consumption.
Paulheim [11] provided a comprehensive survey of KG refinement approaches,
further classified according to their goal: KG completion, or repair detected
errors. These approaches apply different methods, ranging from techniques using
machine learning to NLP-related techniques. The results showed that the vast
majority of approaches focused on DBpedia, indicating a gap when it comes to
WD and other Wikibase-based KGs.
Furthermore, Shenoy et al. [15] presented a quality analysis of WD focusing
on correctness, checking for weak statements under three main indicators: con-
straint violation, community agreement, and deprecation. The premise is that
4 N. Ferranti
a statement receives a low-quality score when it violates some constraint, high-
lighting the importance of constraints for KG refinement. Shenoy and colleagues
use a subset of constraints to retrieve violations through a toolkit and also men-
tion the challenges of testing WD constraints using SHACL. However, the focus
of the authors is on combining violation indicators with other indicators toward
a quality metric, rather than formalizing the semantics of the constraints and
promoting practical means for testing and repairs.
Martin and Patel-Schneider [10] discuss the representation of WD property
constraints through multi-attributed relational structures (MARS), as a logi-
cal framework for WD. Constraints are represented in MARS using extended
multi-attributed predicate logic (eMAPL), providing a logical characterization
for constraints. Despite covering 26 different constraint types, the authors have
not performed practical experiments to evaluate the accuracy of the proposed
formalization, nor its efficiency.
Ahmetaj et al. [1] propose an approach to provide refinements to SHACL
violations. The approach involves encoding the problem as an Answer Set Pro-
gramming (ASP) program. By transforming the graph and a set of SHACL
shapes into the ASP program P, the answer sets or stable models of P represent
possible repairs. The use of efficient ASP solvers, such as [7], offers a promising
means to generate practical data repairs. One of the major benefits of formal-
izing WD constraints into SHACL is to enable the use of the various solutions
already implemented for SHACL constraints in the context of WD
In conclusion, the existing literature on KG refinement approaches has pri-
marily focused on KGs like DBpedia, leaving a gap in understanding and refining
KGs such as WD and other Wikibase-based KGs. While previous studies have
highlighted the importance of constraints in KG quality analysis, there is a need
to formalize the semantics of these constraints and establish practical means for
testing. By formalizing constraints, violations and repairs can be collected and
serve as valuable input for refinement models.
3 Research Questions and Hypotheses
Building on the existing challenge of refining community-based KGs, we see the
need for semi-automatic refinement approaches able to provide the community
with repair suggestions based on both formal definitions of constraints and sta-
tistical analysis. Inconsistencies are the primary input for performing corrective
repairs. As pointed out by [15], WD inconsistency reports are calculated within
an ad-hoc extension of Wikibase. On top of that, the approach behind the gen-
eration of the reports is not public, inconsistency reports are published on an
HTML page with a maximum limit of inconsistencies displayed by each con-
straint type. Therefore, it is crucial to formalize the constraints and create an
open and efficient method to retrieve inconsistencies.
In the recent past, embedding approaches have been used to address KG
completion and link prediction. Bordes et al.[2] introduced embeddings for KGs.
We would like to test whether these approaches and AI models trained based on
Improving maintenance of community-based knowledge graphs 5
identified inconsistencies and repairs can provide the community with relevant
repair suggestions.
To this end, we summarise the main hypothesis of our research proposal as
follows:
Effective maintenance of community-based KGs can be achieved
by: (i) formally defined constraints in languages that optimize the pro-
cess of collecting inconsistent data; (ii) a set of approaches to propose
refinement suggestions using inconsistent data and previous repairs
as input; (iii) a set of heuristics to rank candidate repairs according
to different preferences.
Our hypothesis leads to the following research questions:
1. Can the semantics of Wikidata property constraints be represented with
SHACL-core and SPARQL?
2. How can we make use of inconsistencies and historical repairs to propose
refinements to knowledge graphs?
3. How can distance-based metrics be used to propose/predict repairs to the
knowledge graph?
4. What are the most relevant strategies to rank different repair suggestions?
4 Preliminary results
In an effort to understand the semantics, formalize, and operationalize WD prop-
erty constraints, we first investigated, based on available descriptions, whether
and how the 30 WD property constraints could be mapped to SHACL’s core lan-
guage and SPARQL [5]. This study made it possible to clarify to which extent
SHACL can represent community-defined constraints of a widely used real-world
KG. One of our results is a collection of practical SHACL constraints that can
be used in a large and growing real-world dataset; indeed the non-availability of
practical SHACL performance benchmarks has already been emphasized by [6].
Other results we presented include clarifications of heretofore uncertain is-
sues, such as the representability of permitted entities and exceptions in WD
property constraints within SHACL [15]. We also could argue the inexpress-
ibility of certain WD constraints, due to the impossibility to compare values
obtained through different paths matching the same regular path expression
within SHACL-core. These issues could be addressed when using SPARQL to
formalize and validate constraints, where all 30 constraints could in principle be
formalized.
Subsequently, we compared the inconsistencies found by the new constraint
set against the primary reference, the inconsistency reports system3 . In a recently
submitted journal paper [4], we identified that 5 constraint types represent the
vast majority of reported inconsistencies, therefore an experiment was designed
to compare the top five properties of each constraint type against the results
obtained by our SPARQL formalization. The results summarized in Table 1
3
https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations
6 N. Ferranti
and detailed in [4] show that there is a difference in inconsistencies identified
when an approach considers only the truthy statements from when it considers
the whole statements set. Furthermore, we observed that there is still room for
discussions on how these constraints should be checked due to the absence of
formal, unambiguous specifications in some cases and also due to the existence
of qualifiers not considered by WD’s report tool when validating the data.
Our first research question can be partially answered, as we found that
SPARQL can serve as a viable solution to retrieve inconsistent data in real-time,
as well as stimulate discussion to consolidate the semantics of each constraint.
Similar tests using SHACL-core are also under execution.
WD inconsistency reports # found by our
Constraint Type # reported # available (RA) SPARQL (SPA) (RA ∩ SPA)
One-of 2390363 25005 1413369 20122
IRS 1241897 79812 1246789 79812
Single-value 236851 25152 148792 15475
Required Qualifier 809174 26023 807300 26023
VRS 1955726 33418 1949745 33379
Table 1. Summary of the inconsistencies compared. RA ∩ SPA represents the inter-
section between inconsistencies provided by the reports page and those found by the
SPARQL query. Abbreviations used: Constraint Type (IRS = Item-requires-statement,
VRS= Value-requires-statement). Details in [4].
5 Methodology for next steps
To address our research questions, we employ two research methodologies: (i) a
systematic literature review on KG refinement approaches; and (ii) the design sci-
ence research methodology (DSRM) put forth by [13]. While the first contributes
to understanding the state of the art in refinement approaches, the second can
be used to consume the set of inconsistencies and repairs and develop innovative
refinement technologies. Bellow, DSRM activities are described, as well as how
our research proposal fits in them.
Problem and motivation. The research is driven by practical problems and
aims to develop solutions that address specific challenges faced in practice. It
focuses on creating artifacts or designs that have value and utility in solving
real-world problems. In the scope of this research, the problem to be solved is
to facilitate the process of corrective editing in collaborative KGs by developing
a semi-automatic tool to promote repair suggestions.
Design and Creation. DSRM involves the design and creation of new ar-
tifacts, such as models, methods, frameworks, or software prototypes. These
Improving maintenance of community-based knowledge graphs 7
artifacts are intended to improve or enable certain aspects of a given problem
domain. Through a systematic review of the literature, we can identify the main
refinement methods and propose the use of inconsistencies and historical repairs
in the creation of a distance-based semi-automatic refinement tool. For instance,
making use of KG embedding methods to assess the distance of the violating
value to the vector that describes the expected values. This method, in com-
bination with the consumption of historical repairs, can help to predict what
would be an optimal repair and present options for the community.
Evaluation. The designed artifacts are rigorously evaluated to assess their ef-
fectiveness, efficiency, and utility in solving the identified problems. The eval-
uation process often includes usability testing, performance measurement, and
gathering feedback from relevant stakeholders. It is intended that the artifacts
developed in this research can be tested both to predict corrections based on
historical data and on the level of interaction with KG user editors. Therefore,
two main experiments are expected, one over a benchmark of historical repairs
and, once the tool is fully operational, a qualitative study with the WD user
community.
6 Reflection and Future Work
In this research proposal, we explored the problem of refining community-based
KGs through the formalization of constraints and the usage of inconsistencies as
input for the creation of semi-automatic distance-based refinement approaches.
We note that the main approaches do not focus on community-based KGs
[12], where we plan to contribute with a systematic review analyzing more recent
studies. Due to the fact that Wikidata, the most popular community-based KG,
does not use conventional methods such as OWL ontologies to represent data
terminology [14], our first efforts focused on formalizing constraints created by
the community itself [4, 5]. Our next steps consist of analyzing sets of violations
and identifying repair patterns in KG to build semi-automatic repair approaches.
In the future, we hope to build a solution capable of suggesting relevant fixes to
the community according to different objective functions.
Acknowledgements This research is conducted under the supervision of Prof.
Dr. Axel Polleres.
References
1. Ahmetaj, S., David, R., Polleres, A., Šimkus, M.: Repairing shacl constraint vi-
olations using answer set programming. In: The Semantic Web–ISWC 2022: 21st
International Semantic Web Conference, Virtual Event, October 23–27, 2022, Pro-
ceedings. pp. 375–391. Springer (2022)
8 N. Ferranti
2. Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: Advances in Neural Information
Processing Systems 26: 27th Annual Conference on Neural Information Processing
Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe,
Nevada, United States. pp. 2787–2795 (2013)
3. Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D.: Introducing
wikidata to the linked data web. In: International semantic web conference. pp.
50–65. Springer (2014)
4. Ferranti, N., De Souza, J.F., Polleres, A.: Formalizing and validating wiki-
data’s property constraints using shacl+ sparql https://www.semantic-web-
journal.net/system/files/swj3378.pdf, note: Under review
5. Ferranti, N., Polleres, A., de Souza, J.F., Ahmetaj, S.: Formalizing property con-
straints in wikidata. In: Proceedings of the Wikidata Workshop co-located with
21st International Semantic Web Conference (ISWC, 2022), Hangzhou, China, Oc-
tober 23-27, 2022 (2022)
6. Figuera, M., Rohde, P.D., Vidal, M.: Trav-shacl: Efficiently validating
networks of SHACL constraints. In: Leskovec, J., Grobelnik, M., Na-
jork, M., Tang, J., Zia, L. (eds.) WWW ’21: The Web Conference
2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. pp. 3337–
3348. ACM / IW3C2 (2021). https://doi.org/10.1145/3442381.3449877,
https://doi.org/10.1145/3442381.3449877
7. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multi-shot asp solving with
clingo. Theory and Practice of Logic Programming 19(1), 27–82 (2019)
8. Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G.d., Gutierrez, C.,
Kirrane, S., Gayo, J.E.L., Navigli, R., Neumaier, S., et al.: Knowledge graphs.
Synthesis Lectures on Data, Semantics, and Knowledge 12(2), 1–257 (2021)
9. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al.: Dbpedia–a large-scale,
multilingual knowledge base extracted from wikipedia. Semantic web 6(2), 167–195
(2015)
10. Martin, D., Patel-Schneider, P.F.: Wikidata constraints on mars. In: Wikidata@
ISWC (2020)
11. Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation
methods. Semantic web 8(3), 489–508 (2017)
12. Paulheim, H., Gangemi, A.: Serving dbpedia with dolce–more than just adding a
cherry on top. In: International semantic web conference. pp. 180–196. Springer
(2015)
13. Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science
research methodology for information systems research. Journal of management
information systems 24(3), 45–77 (2007)
14. Piscopo, A., Simperl, E.: Who models the world? collaborative ontology creation
and user roles in wikidata. Proceedings of the ACM on Human-Computer Interac-
tion 2(CSCW), 1–18 (2018)
15. Shenoy, K., Ilievski, F., Garijo, D., Schwabe, D., Szekely, P.: A study of the quality
of wikidata. Journal of Web Semantics 72, 100679 (2022)
16. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com-
munications of the ACM 57(10), 78–85 (2014)