Improving maintenance of community-based
                                               knowledge graphs

                                                      Nicolas Ferranti1[0000−0002−5574−1987]

                                Vienna University of Economics and Business, Welthandelspl. 1, 1020 Vienna, Austria
                                                          nicolas.ferranti@wu.ac.at


                                      Abstract. Data quality is crucial for the effective utilization of knowl-
                                      edge graphs, ensuring it is challenging due to the need for continuous
                                      monitoring and maintenance. This research proposal focuses on data
                                      quality in open knowledge graphs, with an emphasis on Wikidata. Wiki-
                                      data, one of the largest collaborative knowledge graphs, has its own ap-
                                      proach for data consistency, deviating from regular OWL ontologies or
                                      SHACL, the W3C recommendation. The proposal aims to comprehend
                                      and formalize Wikidata’s approaches for assessing and resolving data
                                      inconsistencies. By formalizing constraints, refinement operations, and
                                      repair strategies, this research aims to improve the quality of Wikidata
                                      and other knowledge graphs also developed based on Wikibase. As one
                                      of the contributions, our research proposes a semi-automatic refinement
                                      pipeline to empower the Wikidata user community by recommending
                                      repairs of constraint violations, combining distance-based refinement ap-
                                      proaches and ranking heuristics. Establishing a comprehensive frame-
                                      work and engaging users in knowledge graph maintenance enhances the
                                      reliability and usability of open knowledge graphs.

                                      Keywords: Data Quality · Knowledge Graph · Wikidata · Refinement.


                                1   Introduction
                                A Knowledge Graph (KG) uses a graph-based model to represent real-world
                                entities, their attributes, and relationships [11]. Entities can be anything that can
                                be uniquely identified and described, such as people, places, things, or concepts.
                                Statements, representing relationships between entities are represented as edges
                                in the graph, while the attributes of entities are again graph nodes. As such, KGs
                                can be used to represent a wide range of information, including encyclopedic
                                knowledge, scientific data, or enterprise information [8].
                                    There are different methods that can be used to create a KG. For instance,
                                they can be extracted from semi-structured Web data, like DBpedia KG [9], or
                                edited collaboratively by a community, like the Wikidata KG [16]. Regardless of
                                the approach used in construction, KGs are not perfect. Despite the efforts of
                                some communities or organizations to increase the coverage of their respective
                                KGs, it is complex to represent all the knowledge available about a domain.
                                Therefore, organizations responsible for KGs usually look for a balance between
                                correctness and information coverage.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2         N. Ferranti

    Since its creation by the Wikimedia Foundation in 2012, Wikidata (WD)
has become one of the largest KGs, publicly available on the Web, with more
than 100M items1 and 14B triples2 . One of the main factors responsible for
this growth is WD’s user community, with more than 24k active users (humans
and bots). The large user community is primarily motivated by Wikipedia, as
the vast majority of Wikipedia pages incorporate content from WD [3]. The
WD KG is available in standard RDF format and can be queried via a public
SPARQL endpoint, but it does not adhere to established W3C standards, such
as OWL/RDFS or SHACL for express constraints on its schema: unlike other
knowledge graphs, WD focuses on the development of the data layer (A-box),
and the terminology layer (T-Box) evolves alongside it without a predefined
formal ontology [14].


Problem statement. In our proposal, we address the problem of data qual-
ity in KGs with an emphasis on WD. WD uses constraints to ensure consis-
tent vocabulary usage, whereby WD projects do not currently utilize SHACL
for validating RDF graphs against constraints as recommended by W3C, nor
OWL ontologies. Studies to understand the semantics behind the WD constraint
projects and allow violations to be easily tested and consumed by third parties
are lacking. Our hypothesis is that such studies could leverage the development
of approaches to assist users in refining inconsistent data, as well as promote
constraint checking in WD through different approaches, such as by mapping
constraints to SHACL and SPARQL. Currently, data edits are done manually
and without a support tool for the community, which increases the effort to
correct inconsistent data.


Contributions. In this proposal, our goal is to initially study the semantics
of constraints used by WD, among the main contributions is the formalization
of constraints in both SHACL and SPARQL languages, consequently creating
one of the biggest benchmarks for SHACL validators and allowing other agents
to retrieve inconsistent data in real-time through WD’s SPARQL endpoint. The
formalization of constraints is considered as the first step to make the following
contributions possible:

    – Efficient retrieval of inconsistent data
    – Mapping of historical repairs
    – Creation of models for proposing repair suggestions based on inconsistencies
      and repairs
    – Use of knowledge graph embeddings as a distance-based refinement model
    – Introduction of algorithms to improve the ranking of refinement suggestions
      according to different criteria, such as terminology data prevailing over in-
      stances or minimizing changes
1
    https://www.wikidata.org/wiki/Wikidata:Statistics, as of January 2023
2
    https://short.wu.ac.at/7t66, last accessed 13 February 2013
                        Improving maintenance of community-based knowledge graphs                 3

   – Development of a pipeline to combine multiple refinement models with rank-
     ing strategies, for KG editing and maintenance

   An overview of the main contributions grouped by the proposed workflow is
available in Figure 1
Workflow


                                                    Repair
                              Violation                                 Repair
                                                  Suggestion
                  KG          Detection                                 Ranking
                                                    Model
Contributions


                     Formalization of             distance-based         Strategies to rank the
                                                                         suggested repairs
                     constraints in SPARQL        repair models
                                                                         and re-feed the
                     and SHACL                    KG-embedding           system with
                     Mapping historical           repair model           community feedback
                     repairs


                 Fig. 1. Main contributions grouped by the workflow step they refer to.


Paper structure. The remainder of this research proposal is structured as
follows. Section 2 presents related work focusing on data quality and constraints
formalization. Section 3 presents the main research questions and hypotheses
organized in steps. Section 4 discuss some of the preliminary results concerning
the formalization of WD constraints. In section 5, the methodology for our next
steps based on the current findings is introduced. Finally, section 6 points to the
main future directions.


2               Related Literature

Ensuring the quality of the data in a KG is fundamental for useful consumption.
Paulheim [11] provided a comprehensive survey of KG refinement approaches,
further classified according to their goal: KG completion, or repair detected
errors. These approaches apply different methods, ranging from techniques using
machine learning to NLP-related techniques. The results showed that the vast
majority of approaches focused on DBpedia, indicating a gap when it comes to
WD and other Wikibase-based KGs.
    Furthermore, Shenoy et al. [15] presented a quality analysis of WD focusing
on correctness, checking for weak statements under three main indicators: con-
straint violation, community agreement, and deprecation. The premise is that
4      N. Ferranti

a statement receives a low-quality score when it violates some constraint, high-
lighting the importance of constraints for KG refinement. Shenoy and colleagues
use a subset of constraints to retrieve violations through a toolkit and also men-
tion the challenges of testing WD constraints using SHACL. However, the focus
of the authors is on combining violation indicators with other indicators toward
a quality metric, rather than formalizing the semantics of the constraints and
promoting practical means for testing and repairs.
    Martin and Patel-Schneider [10] discuss the representation of WD property
constraints through multi-attributed relational structures (MARS), as a logi-
cal framework for WD. Constraints are represented in MARS using extended
multi-attributed predicate logic (eMAPL), providing a logical characterization
for constraints. Despite covering 26 different constraint types, the authors have
not performed practical experiments to evaluate the accuracy of the proposed
formalization, nor its efficiency.
    Ahmetaj et al. [1] propose an approach to provide refinements to SHACL
violations. The approach involves encoding the problem as an Answer Set Pro-
gramming (ASP) program. By transforming the graph and a set of SHACL
shapes into the ASP program P, the answer sets or stable models of P represent
possible repairs. The use of efficient ASP solvers, such as [7], offers a promising
means to generate practical data repairs. One of the major benefits of formal-
izing WD constraints into SHACL is to enable the use of the various solutions
already implemented for SHACL constraints in the context of WD
    In conclusion, the existing literature on KG refinement approaches has pri-
marily focused on KGs like DBpedia, leaving a gap in understanding and refining
KGs such as WD and other Wikibase-based KGs. While previous studies have
highlighted the importance of constraints in KG quality analysis, there is a need
to formalize the semantics of these constraints and establish practical means for
testing. By formalizing constraints, violations and repairs can be collected and
serve as valuable input for refinement models.


3   Research Questions and Hypotheses

Building on the existing challenge of refining community-based KGs, we see the
need for semi-automatic refinement approaches able to provide the community
with repair suggestions based on both formal definitions of constraints and sta-
tistical analysis. Inconsistencies are the primary input for performing corrective
repairs. As pointed out by [15], WD inconsistency reports are calculated within
an ad-hoc extension of Wikibase. On top of that, the approach behind the gen-
eration of the reports is not public, inconsistency reports are published on an
HTML page with a maximum limit of inconsistencies displayed by each con-
straint type. Therefore, it is crucial to formalize the constraints and create an
open and efficient method to retrieve inconsistencies.
    In the recent past, embedding approaches have been used to address KG
completion and link prediction. Bordes et al.[2] introduced embeddings for KGs.
We would like to test whether these approaches and AI models trained based on
               Improving maintenance of community-based knowledge graphs          5

identified inconsistencies and repairs can provide the community with relevant
repair suggestions.
    To this end, we summarise the main hypothesis of our research proposal as
follows:
    Effective maintenance of community-based KGs can be achieved
by: (i) formally defined constraints in languages that optimize the pro-
cess of collecting inconsistent data; (ii) a set of approaches to propose
refinement suggestions using inconsistent data and previous repairs
as input; (iii) a set of heuristics to rank candidate repairs according
to different preferences.
    Our hypothesis leads to the following research questions:

 1. Can the semantics of Wikidata property constraints be represented with
    SHACL-core and SPARQL?
 2. How can we make use of inconsistencies and historical repairs to propose
    refinements to knowledge graphs?
 3. How can distance-based metrics be used to propose/predict repairs to the
    knowledge graph?
 4. What are the most relevant strategies to rank different repair suggestions?


4     Preliminary results
In an effort to understand the semantics, formalize, and operationalize WD prop-
erty constraints, we first investigated, based on available descriptions, whether
and how the 30 WD property constraints could be mapped to SHACL’s core lan-
guage and SPARQL [5]. This study made it possible to clarify to which extent
SHACL can represent community-defined constraints of a widely used real-world
KG. One of our results is a collection of practical SHACL constraints that can
be used in a large and growing real-world dataset; indeed the non-availability of
practical SHACL performance benchmarks has already been emphasized by [6].
    Other results we presented include clarifications of heretofore uncertain is-
sues, such as the representability of permitted entities and exceptions in WD
property constraints within SHACL [15]. We also could argue the inexpress-
ibility of certain WD constraints, due to the impossibility to compare values
obtained through different paths matching the same regular path expression
within SHACL-core. These issues could be addressed when using SPARQL to
formalize and validate constraints, where all 30 constraints could in principle be
formalized.
    Subsequently, we compared the inconsistencies found by the new constraint
set against the primary reference, the inconsistency reports system3 . In a recently
submitted journal paper [4], we identified that 5 constraint types represent the
vast majority of reported inconsistencies, therefore an experiment was designed
to compare the top five properties of each constraint type against the results
obtained by our SPARQL formalization. The results summarized in Table 1
3
    https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations
6      N. Ferranti

and detailed in [4] show that there is a difference in inconsistencies identified
when an approach considers only the truthy statements from when it considers
the whole statements set. Furthermore, we observed that there is still room for
discussions on how these constraints should be checked due to the absence of
formal, unambiguous specifications in some cases and also due to the existence
of qualifiers not considered by WD’s report tool when validating the data.
    Our first research question can be partially answered, as we found that
SPARQL can serve as a viable solution to retrieve inconsistent data in real-time,
as well as stimulate discussion to consolidate the semantics of each constraint.
Similar tests using SHACL-core are also under execution.


                     WD inconsistency reports # found by our
Constraint Type # reported # available (RA) SPARQL (SPA) (RA ∩ SPA)
One-of                   2390363                25005          1413369         20122
IRS                      1241897                79812          1246789         79812
Single-value              236851                25152           148792         15475
Required Qualifier        809174                26023           807300         26023
VRS                      1955726                33418          1949745         33379
Table 1. Summary of the inconsistencies compared. RA ∩ SPA represents the inter-
section between inconsistencies provided by the reports page and those found by the
SPARQL query. Abbreviations used: Constraint Type (IRS = Item-requires-statement,
VRS= Value-requires-statement). Details in [4].


5   Methodology for next steps
To address our research questions, we employ two research methodologies: (i) a
systematic literature review on KG refinement approaches; and (ii) the design sci-
ence research methodology (DSRM) put forth by [13]. While the first contributes
to understanding the state of the art in refinement approaches, the second can
be used to consume the set of inconsistencies and repairs and develop innovative
refinement technologies. Bellow, DSRM activities are described, as well as how
our research proposal fits in them.

Problem and motivation. The research is driven by practical problems and
aims to develop solutions that address specific challenges faced in practice. It
focuses on creating artifacts or designs that have value and utility in solving
real-world problems. In the scope of this research, the problem to be solved is
to facilitate the process of corrective editing in collaborative KGs by developing
a semi-automatic tool to promote repair suggestions.

Design and Creation. DSRM involves the design and creation of new ar-
tifacts, such as models, methods, frameworks, or software prototypes. These
              Improving maintenance of community-based knowledge graphs           7

artifacts are intended to improve or enable certain aspects of a given problem
domain. Through a systematic review of the literature, we can identify the main
refinement methods and propose the use of inconsistencies and historical repairs
in the creation of a distance-based semi-automatic refinement tool. For instance,
making use of KG embedding methods to assess the distance of the violating
value to the vector that describes the expected values. This method, in com-
bination with the consumption of historical repairs, can help to predict what
would be an optimal repair and present options for the community.


Evaluation. The designed artifacts are rigorously evaluated to assess their ef-
fectiveness, efficiency, and utility in solving the identified problems. The eval-
uation process often includes usability testing, performance measurement, and
gathering feedback from relevant stakeholders. It is intended that the artifacts
developed in this research can be tested both to predict corrections based on
historical data and on the level of interaction with KG user editors. Therefore,
two main experiments are expected, one over a benchmark of historical repairs
and, once the tool is fully operational, a qualitative study with the WD user
community.


6   Reflection and Future Work

In this research proposal, we explored the problem of refining community-based
KGs through the formalization of constraints and the usage of inconsistencies as
input for the creation of semi-automatic distance-based refinement approaches.
    We note that the main approaches do not focus on community-based KGs
[12], where we plan to contribute with a systematic review analyzing more recent
studies. Due to the fact that Wikidata, the most popular community-based KG,
does not use conventional methods such as OWL ontologies to represent data
terminology [14], our first efforts focused on formalizing constraints created by
the community itself [4, 5]. Our next steps consist of analyzing sets of violations
and identifying repair patterns in KG to build semi-automatic repair approaches.
In the future, we hope to build a solution capable of suggesting relevant fixes to
the community according to different objective functions.


Acknowledgements This research is conducted under the supervision of Prof.
Dr. Axel Polleres.


References

 1. Ahmetaj, S., David, R., Polleres, A., Šimkus, M.: Repairing shacl constraint vi-
    olations using answer set programming. In: The Semantic Web–ISWC 2022: 21st
    International Semantic Web Conference, Virtual Event, October 23–27, 2022, Pro-
    ceedings. pp. 375–391. Springer (2022)
8       N. Ferranti

 2. Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating
    embeddings for modeling multi-relational data. In: Advances in Neural Information
    Processing Systems 26: 27th Annual Conference on Neural Information Processing
    Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe,
    Nevada, United States. pp. 2787–2795 (2013)
 3. Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D.: Introducing
    wikidata to the linked data web. In: International semantic web conference. pp.
    50–65. Springer (2014)
 4. Ferranti, N., De Souza, J.F., Polleres, A.: Formalizing and validating wiki-
    data’s property constraints using shacl+ sparql https://www.semantic-web-
    journal.net/system/files/swj3378.pdf, note: Under review
 5. Ferranti, N., Polleres, A., de Souza, J.F., Ahmetaj, S.: Formalizing property con-
    straints in wikidata. In: Proceedings of the Wikidata Workshop co-located with
    21st International Semantic Web Conference (ISWC, 2022), Hangzhou, China, Oc-
    tober 23-27, 2022 (2022)
 6. Figuera, M., Rohde, P.D., Vidal, M.: Trav-shacl: Efficiently validating
    networks of SHACL constraints. In: Leskovec, J., Grobelnik, M., Na-
    jork, M., Tang, J., Zia, L. (eds.) WWW ’21: The Web Conference
    2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. pp. 3337–
    3348. ACM / IW3C2 (2021). https://doi.org/10.1145/3442381.3449877,
    https://doi.org/10.1145/3442381.3449877
 7. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multi-shot asp solving with
    clingo. Theory and Practice of Logic Programming 19(1), 27–82 (2019)
 8. Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G.d., Gutierrez, C.,
    Kirrane, S., Gayo, J.E.L., Navigli, R., Neumaier, S., et al.: Knowledge graphs.
    Synthesis Lectures on Data, Semantics, and Knowledge 12(2), 1–257 (2021)
 9. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
    Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al.: Dbpedia–a large-scale,
    multilingual knowledge base extracted from wikipedia. Semantic web 6(2), 167–195
    (2015)
10. Martin, D., Patel-Schneider, P.F.: Wikidata constraints on mars. In: Wikidata@
    ISWC (2020)
11. Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation
    methods. Semantic web 8(3), 489–508 (2017)
12. Paulheim, H., Gangemi, A.: Serving dbpedia with dolce–more than just adding a
    cherry on top. In: International semantic web conference. pp. 180–196. Springer
    (2015)
13. Peffers, K., Tuunanen, T., Rothenberger, M.A., Chatterjee, S.: A design science
    research methodology for information systems research. Journal of management
    information systems 24(3), 45–77 (2007)
14. Piscopo, A., Simperl, E.: Who models the world? collaborative ontology creation
    and user roles in wikidata. Proceedings of the ACM on Human-Computer Interac-
    tion 2(CSCW), 1–18 (2018)
15. Shenoy, K., Ilievski, F., Garijo, D., Schwabe, D., Szekely, P.: A study of the quality
    of wikidata. Journal of Web Semantics 72, 100679 (2022)
16. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Com-
    munications of the ACM 57(10), 78–85 (2014)