<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semantic Framework for Evaluating Post-hoc Explanations in Link Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Balbi</string-name>
          <email>lbalbi@lasige.di.fc.ul.pt</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felix Bindt</string-name>
          <email>felix.bindt@wur.nl</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katja Breitenfelder</string-name>
          <email>katja.breitenfelder@ibp.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Campi</string-name>
          <email>riccardo.campi@polimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jitse De Smet</string-name>
          <email>jitse.desmet@ugent.be</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia d'Amato</string-name>
          <email>claudia.damato@uniba.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Lab, Polytechnic University of Milan</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Bari</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fraunhofer Institute for Building Physics</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>LASIGE, Faculty of Sciences, University of Lisbon</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Wageningen University &amp; Research</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Knowledge Graphs (KGs) are often noisy or incomplete, and Link Prediction (LP) methods, especially those based on black-box KG-Embeddings, are employed to predict missing facts. Pushed by the need for trust in inferred facts, many methods for LP explanation (LP-X) have been created. However, comparing them is still an open issue due to multiple existing protocols. To address this gap, we envision the design of an automated and unified evaluation framework for post-hoc LP-Xs that allows for a systematic and operationalized computation and comparison of LP explanations. To ofer a pragmatic view of our proposition, we extend the Explanation Ontology (EO) by enriching it with evaluation-specific constructs, thus providing a shared semantic model (i.e., a structured knowledge representation such as an ontology) that unifies LP-X methods, evaluation dimensions, and associated metrics. The model could be further extended to broader XAI methods. As a proof-of-concept, we instantiate the proposed EO extension with LP-DIXIT, a user-aware algorithmic explanation evaluation method, demonstrating the ontology's ability to address the targeted problem. Furthermore, we draw a solution for exploiting the semantic model, besides for the annotation and retrieval of diferent evaluation approaches based on multiple dimensions, but also for automating/operationalizing LP-Xs, given interest dimensions. The paper ofers a view towards the foundation for a unified evaluation of post-hoc LP-Xs, and drafts the ground for automated user-centric assessment workflows.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;explainable AI</kwd>
        <kwd>post-hoc explanations</kwd>
        <kwd>link prediction</kwd>
        <kwd>explanation evaluation</kwd>
        <kwd>ontologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KGs) are formal machine-processable representations of knowledge consisting of
entities (nodes) and binary relations (edges) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite their proven utility, KGs are often noisy and/or
incomplete, as they come as a result of a complex building process. Hence, Link Prediction (LP) methods,
aiming at predicting missing facts, are leveraged for completing KGs. Mostly, LP tasks are solved by
Knowledge Graph Embeddings (KGEs) models, which showed good performance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Nevertheless, their
black-box nature has raised the need for (post-hoc) explanations, especially in fields where LP may imply
critical decisions, such as finance or pharmacology. For example, LP can be used to find new targets for
existing drugs, reducing drug development costs. LP eXplanations (LP-X) would then clarify why and
how predictions are made, enhancing trust and helping stakeholders to make informed decisions.
      </p>
      <p>
        LP-X methods for KGs [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] often adopt evaluation protocols and metrics tailored to specific LP
algorithms and benchmarks, with limited emphasis on enabling systematic evaluation and comparison
among them. Explanations are sometimes assessed across various dimensions, such as their impact
on predictive task performance, their usefulness to users, and their overall clarity. This hampers their
reproducibility and makes it dificult to identify their trade-ofs, determine best practices, and benchmark
new ones against established baselines.
      </p>
      <p>An appealing solution involves using ontologies to broadly model LP-X methods and their evaluations.
By representing them into a shared conceptual space, such conceptualisation may not only categorise
existing approaches but also drive to an operationalised design of a systematic and unified evaluation
protocol with potential to be expanded to novel/additional evaluation dimensions. This would allow
users to assess LP-X methods and evaluation protocols across domains, model types, and explanation
styles, while covering all evaluation perspectives.</p>
      <p>
        One promising resource is the Explanation Ontology (EO) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which describing XAI methods in terms
of their inputs, outputs, and underlying assumptions, but lacks any formalisation of their evaluation
processes and protocols. As such, without any conceptualisation for defining and selecting evaluation
dimensions and corresponding metrics, it remains challenging to make a comprehensive and consistent
comparison of these methods.
      </p>
      <p>In this position paper, we argue that a unified, ontology-driven approach can overcome this
fragmentation. We further draw a solution that sees the extension of the EO to define LP-Xs evaluation
dimensions jointly with corresponding metrics, with the final goal of providing a comprehensive
conceptualisation for automatising evaluating explanations, particularly coming from post-hoc LP-X
solutions. Indeed, encoding explanation methods, metrics, and the evaluation protocol within the same
semantic model (i.e., a structured knowledge representation like an ontology) facilitates retrieving and
applying the most appropriate metrics for any given dimension for evaluating post-hoc explanation
solutions, allowing comparability, reproducibility, and coverage across LP scenarios and beyond.</p>
      <p>We frame two Research Questions (RQs) to support our position:</p>
      <p>RQ 1. What are the dimensions for assessing/evaluating post-hoc LP explanations?
RQ 2. Can a semantic model be adopted to realize a unified automated framework for evaluating
diferent LP-Xs methods?</p>
      <p>RQ 1 lays the groundwork by identifying and organising the essential evaluation dimensions. RQ 2
tests the hypothesis that an ontology-driven automatised system would be capable of employing the
right evaluation methods and settings across diverse dimensions.</p>
      <p>The paper is organised as follows: Sect. 2 provides an overview of post-hoc LP-X methods and
evaluation protocols. Sect. 3 introduces some key notions for our proposal. Sect. 4 outlines our
suggested direction and design of the solution Sect. 5 showcases a proof-of-concept to validate the
solution. Sect. 6 recaps our position and suggestions and outlines directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>In this work, we specifically target post-hoc LP-X solutions. As such, in this section, we survey the
main state-of-the-art in this direction and the class of metrics adopted for their evaluation.</p>
      <p>
        Post-hoc LP-X methods difer in (1) the form of explanation they produce (e.g., facts, paths, rules,
subgraphs), (2) the compatibility with underlying KGE models, and (3) the evaluation protocol applied
often without standardisation across studies. Early approaches [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] generate single-fact explanations
via perturbations or influence functions, but are typically limited to specific model types. Successive
solutions [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] enhance flexibility by introducing post-training modules applicable to most embedding
models. Other works extract relevant sets of facts or neighborhood subgraphs, e.g. Baltatzis and
Costabello [9] employs knowledge distillation on sampled subgraphs, while Zhao et al. [10] identifies
substructures based on information gain. Ma et al. [11] uses greedy search to isolate subgraphs most
relevant to a prediction. Broader techniques extend beyond (set of) triples. Amador-Domínguez
et al. [12] outputs ontological axioms or factual triples, supported by template-based natural language
generation. Betz et al. [13] uses adversarial abduction over learned rules while Ismaeil et al. [14] extracts
interpretable features from embedding vectors for downstream tasks. Path-based approaches [
        <xref ref-type="bibr" rid="ref6">6, 15</xref>
        ]
identify semantically similar paths using relation and entity similarity. Other solutions [16] adopt rule
mining evaluated through classification performance.
      </p>
      <p>Existing protocols for evaluating LP-X typically fall under one of the following aspects [17, 18]:
1. Functionally grounded: Assess LP-X without human subjects, by probing over the KG and the
model’s scoring. Their measures capture model response, either on changes in model decision
(faithfulness), agreement with a surrogate model of the explanation (fidelity), similarity of
explanations across perturbations (stability) and axiom violations or entailments (consistency), among
others.
2. Human grounded: Evaluation over tasks that probe comprehensibility and practical usefulness,
with focus on measures of accuracy, time-to-decision, preference proportions, trust rate and
user-agreement among others.
3. Application grounded: With human-involvement for explanation evaluation in the context of a
given domain application, with measures that capture whether explanations actually improve
decisions and workflows (e.g., utility, expert acceptance rate).</p>
      <p>Each measure has a metric score it can be tied to to operationalize its evaluation aspect (e.g., measure
of faithfulness - necessity/suficiency rate).</p>
      <p>Despite this rich landscape, evaluation approaches have typically been applied only to a subset of
these metrics in isolation, without a standardized and unified protocol for selecting and combining
them. Similar to how human-grounded and application-grounded metrics have been split, recent work
highlights the benefits of using large language models as a proxy for human-grounded metrics [ 19, 20].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Basics</title>
      <p>This section surveys the various dimensions of the state of the art for evaluating explanations [21] and
their relevance to the general evaluation of LP-X. Evaluation dimensions that are not deemed relevant
in the context of this work are marked†. Evaluation dimensions that require some, to be further defined,
underlying distance metric are marked diferently Δ. We conclude this section by arguing on the need
of actual computable metrics that are currently mostly missing.</p>
      <sec id="sec-3-1">
        <title>Functionally grounded</title>
        <p>Faithfulness measures the accuracy of the explanation regarding the prediction. With respect to LP-X
this means that if the explanation of triple ‘x‘ is a set of triples ‘Y‘, when you remove a subset of ‘Y‘, the
LP-model no longer predicts triple ‘x‘. As such, the LP-X model is faithful to the LP model, since a
removal of the justification does indeed mean that the prediction would not have been made. Location
accuracy†measures the ability for an explanation model to localize the explanation correctly with
respect to some points of interests within the ground truth. Since the concrete meaning of points of
interest is ill-defined within the context of LP-X, we ignore this dimension in the remainder of our work.
Completeness measures how much of the actual reason is covered by the explanation, necessitating a
ground truth. Overlap†is a dimension specifically targeting rule-based systems. Since we want to
operate in the general case, we will not consider it further. Accuracy†a metric usable when you use a
surrogate model to provide explainability. Again, since we want to model the general evaluation, we
will not consider this metric further. Architectural complexity†and algorithmic complexity†are
two metrics that are hard, if not impossible, to compare across diferent explainability settings.
StabilityΔmeasures the stability of an explanation given a change in the underlying data. Within LP-X,
this would mean that you measure how diferent your explanation is given an independent modification
in the underlying graph the explanator can use. Consistency†Δmeasures the change of the explanator
given a small change in the to be explained information. Since the general LP problem does not predict
literals e.g. strings, dates, numbers, etc), this is not relevant in the general setting because a ’slightly
diferent’ prediction does not exist. Each diference in relation is similarly diferent because it talks
about a diferent resource. SensitivityΔin the context of LP-X measures how diferent the explanation
of the X-model is when provided with a diferent input triple to explain. Basically, it punishes models
that would always provide the same explanation regardless of what they need to explain.
Expressiveness measures the level of detail used by the X-model within the formal model (e.g. triple count for LP-X).</p>
      </sec>
      <sec id="sec-3-2">
        <title>Human grounded</title>
        <p>Interoperability/ complexity: measures how well a user can make a mental model of the explanation.
Efectiveness is the accuracy of human thinking for the predicted triple after seeing the explanation.
It functions as a proxy for interoperability. Time eficiency †measures how long it takes for a user to
build a viable mental model. Since there is no explanation feedback loop, information amount acts
as a proxy to this measurement. Degree of understanding†: measures in interactive contexts the
current status of understanding. This is not generally applicable in post-hoc LP-X, since the model
will typically explain once, and not be asked to generate more detailed explanations. Information
Amount is the amount of information conveyed through the explanation. It could be measured by
something like ’triple count of the explanation’ but this is incomplete since the explanation triple could,
for example, have a singleton property, in which case the human in the loop is still getting a lot of
information in reality.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Application grounded</title>
        <p>Satisfaction measures how content the explainee is with the system. A well-known metric is for
example the System Usability Scale (SUS) score [22]. Persuasiveness measures how persuasive the
generated explanations are. Whether high persuasiveness is a good or bad thing mostly depends on
the context - the extremes are often avoided. Improvement of human judgment assesses to what
degree the user gets to trust the system. Correct explanations should be trusted more. Improvement
of human-AI system performance, measures the total system, why you want the link predicted,
who wants it and whether the explanation they get improves the situation. Automation capability is
a metric that tries to uncover whether a human actually spends less time identifying missing relations/
links. It asks to what extend the overall system reduces manual labour. Novelty measures whether
the links predicted, and the provided explanations, highlight novel discoveries. An example of a novel
discovery would be the prediction and justification of a triple ‘:somePill a :cureToCancer‘.</p>
        <p>We conclude by highlighting that even though functionally grounded evaluation dimensions focus
on enabling automated testing, only a limited set of dimensions are actually applicable in the context
of LP-X. Specifically, only faithfulness, completeness and expressiveness. Moreover, both completeness
and expressiveness present challenges: completeness relies on well-defined ground truth, while
expressiveness lacks robust evaluation metrics. This analysis underscores the need for better functionally
grounded dimensions and metrics, while also motivating the use of LLMs as a proxy for both human
and application-grounded dimensions, as described by Barile et al. [19].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Semantic Framework</title>
      <p>
        The proposed semantic evaluation framework is grounded on a conceptualisation that describes LP-X
evaluation methods and settings. This framework will be used for enabling systems/agents to select and
execute evaluation protocols in a unified manner and potentiate the automation and standardisation of
LP-X evaluations. Specifically, given the EO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we showcase its extension to serve as the exoskeleton
of an evaluation system (with evaluation terms that are missing in EO), by adding high-level concepts
of LP-X evaluation into ontology classes and properties.
      </p>
      <p>In Sect. 4.1 we describe the EO and in Sect. 4.2 we illustrate its extension to support our proposal of a
semantic-driven solution to build a unified and automated LP-X evaluation framework. Sect. 4.3 drafts
the envisioned solution for automatizing LP-X evaluation.</p>
      <sec id="sec-4-1">
        <title>4.1. Explanation Ontology</title>
        <p>
          The EO [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a general purpose semantic model to represent and connect user-centric explanations
to the underlying data and knowledge with the end-goal of making model recommendations more
explainable. As illustrated in Fig. 2 reported in Annex A.1, the ontology is organised into three
conceptual layers: the User Layer, that models user-centric goals and preferences (e.g. ExplanationGoal
and UserProfile); the Interface Layer that captures explanation modalities and presentation formats (e.g.
ExplanationModality); and the System Layer that describes the internal representation of explainers,
linking to data sources, models and provenance (e.g. ExplanationMethod, SystemRecommendation).
        </p>
        <p>Although its current format covers explanation generation thoroughly and supports a broad range
of state-of-the-art explainer methods, EO still lacks formal constructs to describe the evaluation of
(post-hoc) LP-X solutions. For this reason, we propose an extension of the EO in Sect. 4.2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Extending the Explanation Ontology</title>
        <p>One common strategy to guide ontology development and enrichment, and to assess the quality of
an ontology with respect to a specific application is using competency questions (CQs), i.e. questions
formulated in natural language, representing the requirements to be answered using data structured
according to the ontology [23]. Since these are questions with established or verifiable answers, they also
function as a form of content validation techniques to determine if the ontology fits the requirements
and is structurally sound.</p>
        <p>We therefore propose the following CQs as a first step to assess the capability of our targeted extended
EO (EEO) to support the proposed evaluation system:</p>
        <p>CQ1: Which measure(s) are available for an evaluation aspect/dimension or set of aspects YY?
CQ2: Which method(s) should be used to perform an explanation evaluation on the dimension/
set of aspects YY and set of measures ZZ?</p>
        <p>The extension of EO1 (EEO) follows a top-down approach by defining the high-level concepts of
LP-X evaluation into classes and properties and then specifying into the more specific concepts they
may contain. These concepts were selected from the dimensions illustrated in Sect. 3.</p>
        <p>We list a minimal set of currently missing high-level classes and properties that we deem as essential
for encoding the evaluation process in the ontology and for allowing the instantiation of LP-X methods
and procedures. The full list of classes and properties is reported and documented in Appendix A.3,
while Figure A.2 of Appendix A.2 illustrates the schema that results from extending the EO. At a high
level, we introduce classes for Explanation Evaluation, Evaluation Measure and Quantitative Measure
that in turn contain subclasses pertaining to the dimensions, measures and metrics identified in the
SOTA [21] (and summarized in Sect. 3). We deem these classes as necessary but possibly extensible
for modeling further LP-X/XAI evaluation dimensions. We have therefore adopted Protegé 5.6.3 to
introduce the listed classes in the EO within a novel eeo: namespace corresponding to the EEO. These
classes as well as the newly created object properties resulted logically consistent with the existing EO.
The EEO retains full backward compatibility with EO while providing the semantic hooks needed to
model evaluation workflows. Sect. 5 illustrates the instantiation of the EEO, showcasing the support of
existing (and newly developed) LP-X solutions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Automated LP-X explanation evaluation</title>
        <p>The proposed EEO should enable the automation of evaluation protocols for LP-X models. To support
this, the EEO could be integrated in a holistic (agenting) solution that, given a user-inputted LP
problem, and LP-X method(s) output data, allows: 1) the querying of EEO for diferent explanation
evaluation dimension(s); 2) the collection of LP-X evaluation methods and metrics supporting the
queried dimension(s) and input LP problem; and 3) the automated execution of the LP-X evaluation
protocols that support those methods (see Figure 1). In particular, the agent should be able to recognise
the LP-X method and explanation type provided by the user and translate them into a SPARQL query to
retrieve all relevant dimensions, metrics and protocols from the EEO. The query results should serve as
input to the evaluation module along with the original explanation data. The agent would then return
a complete LP-X evaluation report to the user, with evaluation methods’ results organized along the
1The extension of EO is avaiable at https://doi.org/10.5281/zenodo.15658539
diferent dimensions. A unified, end-to-end system of this kind could not only reduce the cognitive
burden for researchers but also standardize LP-X evaluation protocols.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Proof-of-Concept</title>
      <p>Our proof-of-concept covers validating EEO and verifying its compliance with predefined requirements
through answering the CQs established in Sect. 4 formalising corresponding SPARQL queries. For
this purpose, we instantiate our ontology considering LP-DIXIT [19], an algorithmic and user-centric
LP-X evaluation solution. The method provides an evaluation dimension and measure relevant for
testing this approach: it assesses the user-grounded aspect, over the utility of explanations, with a
quantitative forward simulatability variation metric on the improvement of a user’s prediction accuracy
when given an explanation. The method was used to specifically populate the EEO through the
eeo:EvaluationMethod class and annotated to the pertaining Evaluation Measure class, a subclass
of eeo:EvaluationMeasure.</p>
      <p>To answer our previously defined CQs, we designed SPARQL queries (listing 1 and 2) specified for the
aspect and measure evaluated in the LP-DIXIT method. Through these we demonstrate that the utility
of explanations - reflected by user agreement - can be captured within the user perspective dimension
of our semantic model. Furthermore, not only do they provide the classes linking dimensions and
metrics for evaluating explanations (listing 1), but also return LP-DIXIT as an instance for describing
the methodology (listing 2), confirming successful mapping.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Position Summary</title>
      <p>Despite the emerging works on evaluating current XAI methods on their diferent characteristics and
along several axes of explainability, there is still a lack of systematized benchmarks and of unified
comparison systems between the diferent evaluation approaches.</p>
      <p>After a broad review of the SOTA in XAI evaluation, we identified several of the evaluation
dimensions/aspects that have been approached so far and propose an ontology-driven evaluation system
for post-hoc explanations in KG LP tasks. Our proposed system maps current methodologies to their
evaluation protocols, aspects and measures to enable structured and unified evaluation workflows.</p>
      <p>To do so, we suggest extending an existing ontology, EO, to incorporate XAI evaluation constructs.
Furthermore, we provide a proof-of-concept approach to validate the EEO and its incorporation into
the proposed evaluation framework via answering an initial pair of CQs. These CQs are able to obtain
protocols and metrics for evaluating post-hoc LP-X methods in the dimension of user-experience.
Moreover, we propose an automated system integrating the EEO together with an agent to relieve
workload and standardize evaluations.</p>
      <p>To conclude, we encourage researchers to adopt XAI ontology-driven or similar approaches to
building universal and systematic evaluation frameworks that allow for consistent comparison and
benchmarking across XAI methods.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Laura Balbi is funded by FCT through grant number 2024.01208.BD and partially funded by the EU
through the KATY project (grant agreement No 101017453). Jitse De Smet is a predoctoral fellow of the
Research Foundation – Flanders (FWO) (1SB8525N). Felix Bindt is funded by the Strategic Program of the
Netherlands National Institute for Public Health and the Environment (RIVM) (No S/133030/01). Katja
Breitenfelder acknowledges the support of Fraunhofer IBP, Valley, Germany, which contributed to the
completion of this research. Riccardo Campi is funded by the European Commission’s Horizon Europe
project ENERGENIUS, Project ID 101160720. Claudia d’Amato was partially supported by project FAIR
Future AI Research (PE00000013), spoke 6 - Symbiotic AI (https://future-ai-research.it/) under the PNRR
MUR program funded by the European Union - NextGenerationEU, and by PRIN project HypeKG
Hybrid Prediction and Explanation with Knowledge Graphs (Prot. 2022Y34XNM, CUP H53D23003700006)
under the PNRR MUR program funded by the European Union - NextGenerationEU.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
14664 of Lecture Notes in Computer Science, Springer, 2024, pp. 180–198. URL: https://doi.org/10.
1007/978-3-031-60626-7_10. doi:10.1007/978-3-031-60626-7\_10.
[9] V. Baltatzis, L. Costabello, Kgex: Explaining knowledge graph embeddings via subgraph sampling
and knowledge distillation, in: S. Villar, B. Chamberlain (Eds.), Learning on Graphs Conference,
27-30 November 2023, Virtual Event, volume 231 of Proceedings of Machine Learning Research,
PMLR, 2023, p. 27. URL: https://proceedings.mlr.press/v231/baltatzis24a.html.
[10] D. Zhao, G. Wan, Y. Zhan, Z. Wang, L. Ding, Z. Zheng, B. Du, KE-X: towards subgraph explanations
of knowledge graph embedding based on knowledge information gain, Knowl. Based Syst. 278
(2023) 110772. URL: https://doi.org/10.1016/j.knosys.2023.110772. doi:10.1016/J.KNOSYS.2023.
110772.
[11] T. Ma, X. Song, W. Tao, M. Li, J. Zhang, X. Pan, J. Lin, B. Song, X. Zeng, Kgexplainer:
Towards exploring connected subgraph explanations for knowledge graph completion, CoRR
abs/2404.03893 (2024). URL: https://doi.org/10.48550/arXiv.2404.03893. doi:10.48550/ARXIV.
2404.03893. arXiv:2404.03893.
[12] E. Amador-Domínguez, E. Serrano, D. Manrique, Geni: A framework for the generation of
explanations and insights of knowledge graph embedding predictions, Neurocomputing 521 (2023)
199–212. URL: https://doi.org/10.1016/j.neucom.2022.12.010. doi:10.1016/J.NEUCOM.2022.12.
010.
[13] P. Betz, C. Meilicke, H. Stuckenschmidt, Adversarial explanations for knowledge graph embeddings,
in: L. D. Raedt (Ed.), Proceedings of the Thirty-First International Joint Conference on Artificial
Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, ijcai.org, 2022, pp. 2820–2826. URL:
https://doi.org/10.24963/ijcai.2022/391. doi:10.24963/IJCAI.2022/391.
[14] Y. Ismaeil, D. Stepanova, T. Tran, H. Blockeel, Feabi: A feature selection-based framework for
interpreting KG embeddings, in: T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos,
L. Hollink, Z. Kaoudi, G. Cheng, J. Li (Eds.), The Semantic Web - ISWC 2023 - 22nd International
Semantic Web Conference, Athens, Greece, November 6-10, 2023, Proceedings, Part I, volume
14265 of Lecture Notes in Computer Science, Springer, 2023, pp. 599–617. URL: https://doi.org/10.
1007/978-3-031-47240-4_32. doi:10.1007/978-3-031-47240-4\_32.
[15] C. d’Amato, P. Masella, N. Fanizzi, An approach based on semantic similarity to explaining link
predictions on knowledge graphs, in: J. He, R. Unland, E. S. Jr., X. Tao, H. Purohit, W. van den
Heuvel, J. Yearwood, J. Cao (Eds.), WI-IAT ’21: IEEE/WIC/ACM International Conference on Web
Intelligence, Melbourne VIC Australia, December 14 - 17, 2021, ACM, 2021, pp. 170–177. URL:
https://doi.org/10.1145/3486622.3493956. doi:10.1145/3486622.3493956.
[16] N. A. Krishnan, C. R. Rivero, A model-agnostic method to interpret link prediction evaluation of
knowledge graph embeddings, in: I. Frommholz, F. Hopfgartner, M. Lee, M. Oakes, M. Lalmas,
M. Zhang, R. L. T. Santos (Eds.), Proceedings of the 32nd ACM International Conference on
Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October
21-25, 2023, ACM, 2023, pp. 1107–1116. URL: https://doi.org/10.1145/3583780.3614763. doi:10.
1145/3583780.3614763.
[17] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning, arXiv
preprint arXiv:1702.08608 (2017).
[18] A. B. Arrieta, N. D. Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-Lopez,
D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial intelligence (XAI): concepts,
taxonomies, opportunities and challenges toward responsible AI, Inf. Fusion 58 (2020) 82–115.</p>
      <p>URL: https://doi.org/10.1016/j.infus.2019.12.012. doi: 10.1016/J.INFFUS.2019.12.012.
[19] R. Barile, C. d’Amato, N. Fanizzi, LP-DIXIT: evaluating explanations for link predictions on
knowledge graphs using large language models, in: G. Long, M. Blumestein, Y. Chang, L.
LewinEytan, Z. H. Huang, E. Yom-Tov (Eds.), Proceedings of the ACM on Web Conference 2025, WWW
2025, Sydney, NSW, Australia, 28 April 2025- 2 May 2025, ACM, 2025, pp. 4034–4042. URL: https:
//doi.org/10.1145/3696410.3714667. doi:10.1145/3696410.3714667.
[20] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, in: A. Oh, T.
Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information
Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS
2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL: http://papers.nips.cc/paper_files/
paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.
[21] G. Schwalbe, B. Finzel, A comprehensive taxonomy for explainable artificial intelligence: a
systematic survey of surveys on methods and concepts, Data Min. Knowl. Discov. 38 (2024) 3043–
3101. URL: https://doi.org/10.1007/s10618-022-00867-8. doi:10.1007/S10618-022-00867-8.
[22] J. Brooke, SUS – a quick and dirty usability scale, Taylor &amp; Francis, 1996, pp. 189–194.
[23] C. Bezerra, F. Freitas, F. Santana, Evaluating ontologies with competency questions, in: 2013
IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology,
Atlanta, Georgia, USA, 17-20 November 2013, Workshop Proceedings, IEEE Computer Society, 2013,
pp. 284–285. URL: https://doi.org/10.1109/WI-IAT.2013.199. doi:10.1109/WI-IAT.2013.199.</p>
      <sec id="sec-8-1">
        <title>A.1. High Level Picture of the Explanation Ontology</title>
        <p>
          This section reports figure 2 providing the high-level description of the EO [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. High Level Picture of the Extended Explanation Ontology</title>
        <p>This section reports the figure 3 providing the high-level description of the EEO ontology.
eo:Evaluation - Denoting an assessment activity.
eo:ExplanationEvaluation - Represents an evaluation with an explanation as input.
eo:PostHocExplanationEvaluation - Sub-class of Explanation Evaluation that denotes
evaluation of XAI approaches employed over ML after training.
eo:EvaluationMethod - class describing the evaluation procedure with subclasses that capture
local/global settings and task types.
eo:EvaluationAspect - top-level class for the dimension or quality being assessed
(Functiongrounded, Application-grounded or User-grounded).
eo:FunctionallyGroundedAspect - Aspects measurable without human interference, e.g.
ifdelity, monotonicity.
eo:ApplicationGroundedAspect - Aspects stemming from evaluation over domain-expert
tasks.
eo:UserGroundedAspect - Aspects that require a human study or a simulation study with an
agent component, e.g. on interpretability.
eo:EvaluationMeasure - A measure or metric used to assess explanations on a given
evaluation aspect.
eo:QuantitativeMeasure - defining a concrete evaluation metric of quantitative nature (e.g.
accuracy, recall, information content score).
eo:EvaluationResult - top-level class that represents the outcome of an Evaluation Method,
linking evaluation measures to their quantitative values.
eo:EvaluationAgent - defining the actor conducting the evaluation, of human or automated
nature.</p>
      </sec>
      <sec id="sec-8-3">
        <title>A.4. Property Definition</title>
        <p>List of added Object Properties between high-level classes
eeo:evaluatesExplanation Domain:
eeo:Explanation
eeo:ExplanationEvaluation</p>
        <p>Range:
eeo:hasMethod Domain: eeo:ExplanationEvaluation Range: eeo:EvaluationMethod
Axiom: eeo:ExplanationEvaluation SubClassOf (eeo:hasMethod min 1
eeo:EvaluationMethod)
eeo:addressesAspect Domain: eeo:EvaluationMethod Range: eeo:EvaluationAspect
Axiom: eeo:EvaluationMethod SubClassOf (eeo:addressesAspect min 1
eeo:EvaluationAspect)
eeo:usesMeasure Domain: eeo:EvaluationMethod Range: eeo:EvaluationMeasure
Axiom: eeo:EvaluationMethod SubClassOf (eeo:usesMeasure min 1
eeo:EvaluationMeasure)
eeo:hasAgent Domain: eeo:EvaluationMethod Range: eeo:EvaluationAgent Axiom:
eeo:EvaluationMethod SubClassOf (eeo:hasAgent min 1 eeo:EvaluationAgent)
eeo:producesResult Domain: eeo:EvaluationMethod Range: eeo:EvaluationResult
Axiom: eeo:EvaluationMethod SubClassOf (eeo:producesResult min 1
eeo:EvaluationResult)
eeo:measuresAspect Domain: eeo:EvaluationMeasure Range: eeo:EvaluationAspect
eeo:quantifiedBy Domain: eeo:EvaluationMeasure Range: eeo:QuantitativeMeasure
List of added Data Properties
eeo:hasValue Domain: eeo:QuantitativeMeasure Range: xsd:decimal
eeo:hasUnit Domain: eeo:QuantitativeMeasure Range: xsd:string
PREFIX owl: &lt;http://www.w3.org/2002/07/owl#&gt;
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX eeo: &lt;https://purl.org/heals/eo#&gt;
SELECT DISTINCT ?EvalMeasure ?Metric
WHERE {</p>
        <p>VALUES (?EvalAspect) {(Aspect_IRI)}
?Evalmethod rdf:type eeo:EvaluationMethod .
?Evalmethod eeo:hasAspect ?EvalAspect .
?Evalmethod eeo:usesMeasure ?EvalMeasure .
?EvalMeasure eeo:measuresAspect ?EvalAspect .</p>
        <p>?EvalMeasure eeo:usesQuantitativeMeasure ?Metric .
}
Listing 1: SPARQL query for proof-of-concept answering CQ1, "Which measure(s) are available for a
specified evaluation aspect or set of aspects YY?"
PREFIX owl: &lt;http://www.w3.org/2002/07/owl#&gt;
PREFIX rdf: &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX rdfs: &lt;http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX eeo: &lt;https://purl.org/heals/eo#&gt;
SELECT DISTINCT ?Evalmethod ?Metric
WHERE {</p>
        <p>VALUES (?EvalMeasure) {(Utility_IRI)}
?Evalmethod rdf:type eeo:EvaluationMethod .
?Evalmethod eeo:hasAspect ?EvalAspect .
?Evalmethod eeo:usesMeasure ?EvalMeasure .
?EvalMeasure eeo:measuresAspect ?EvalAspect .</p>
        <p>?EvalMeasure eeo:usesQuantitativeMeasure ?Metric .</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. de Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmelzeisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>54</volume>
          (
          <year>2022</year>
          )
          <volume>71</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>71</lpage>
          :
          <fpage>37</fpage>
          . URL: https://doi.org/10.1145/3447772. doi:
          <volume>10</volume>
          .1145/3447772.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Knowledge graph embedding: A survey from the perspective of representation spaces</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>56</volume>
          (
          <year>2024</year>
          )
          <volume>159</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>159</lpage>
          :
          <fpage>42</fpage>
          . URL: https://doi.org/10.1145/ 3643806. doi:
          <volume>10</volume>
          .1145/3643806.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezeshkpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Investigating robustness and interpretability of link prediction via adversarial modifications</article-title>
          ,
          <source>in: 1st Conference on Automated Knowledge Base Construction, AKBC</source>
          <year>2019</year>
          , Amherst, MA, USA, May
          <volume>20</volume>
          -22,
          <year>2019</year>
          ,
          <year>2019</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Hkg7rbcp67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Zheng,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Data poisoning attack against knowledge graph embedding</article-title>
          , in: S. Kraus (Ed.),
          <source>Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2019</year>
          , Macao, China,
          <source>August 10-16</source>
          ,
          <year>2019</year>
          , ijcai.org,
          <year>2019</year>
          , pp.
          <fpage>4853</fpage>
          -
          <lpage>4859</lpage>
          . URL: https://doi.org/10.24963/ijcai.
          <year>2019</year>
          /674. doi:
          <volume>10</volume>
          .24963/IJCAI.
          <year>2019</year>
          /674.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Ghalwash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shirai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Gruen</surname>
          </string-name>
          , P. Meyer, P. Chakraborty,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <article-title>Explanation ontology: A general-purpose, semantic representation for supporting user-centered explanations</article-title>
          ,
          <source>Semantic Web</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>959</fpage>
          -
          <lpage>989</lpage>
          . URL: https://doi.org/10.3233/ SW-233282. doi:
          <volume>10</volume>
          .3233/SW-233282.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Paudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Interaction embeddings for prediction and explanation in knowledge graphs</article-title>
          , in: J. S. Culpepper,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          Lerman (Eds.),
          <source>Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM</source>
          <year>2019</year>
          ,
          <article-title>Melbourne</article-title>
          ,
          <string-name>
            <surname>VIC</surname>
          </string-name>
          , Australia,
          <source>February 11-15</source>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>96</fpage>
          -
          <lpage>104</lpage>
          . URL: https://doi.org/10.1145/3289600.3291014. doi:
          <volume>10</volume>
          .1145/3289600.3291014.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          , T. Teofili,
          <article-title>Explaining link prediction systems based on knowledge graph embeddings</article-title>
          , in: Z. G. Ives,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. E.</surname>
          </string-name>
          Abbadi (Eds.),
          <source>SIGMOD '22: International Conference on Management of Data</source>
          , Philadelphia, PA, USA, June 12 - 17,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>2062</fpage>
          -
          <lpage>2075</lpage>
          . URL: https://doi.org/10.1145/3514221.3517887. doi:
          <volume>10</volume>
          .1145/3514221.3517887.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barile</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fanizzi</surname>
          </string-name>
          ,
          <article-title>Explanation of link predictions on knowledge graphs via levelwise filtering and graph summarization</article-title>
          , in: A.
          <string-name>
            <surname>Meroño-Peñuela</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , P. Lisena (Eds.),
          <source>The Semantic Web - 21st International Conference, ESWC</source>
          <year>2024</year>
          , Hersonissos, Crete, Greece, May
          <volume>26</volume>
          -30,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <article-title>Listing 2: SPARQL query for proof-of-concept answering CQ2, "Which method(s) should be used to perform an explanation evaluation on the dimension/set of aspects YY and set of measures ZZ?"</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>