<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Study of the Consistency between Protocols for Evaluating Explanations of Predicted Links in Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Barile</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia d'Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Santovito</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Fanizzi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CILA, Università degli Studi di Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Informatica, Università degli Studi di Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Since knowledge graphs are often incomplete, link prediction methods are adopted to predict missing facts. Although scalable embedding models are commonly used for this purpose, they lack comprehensibility, which may be crucial in several domains. Explanation methods address this issue by identifying pieces of knowledge that support the predicted facts. Regretfully, comparing quantitatively the resulting explanations is challenging because there are diferent protocols and no insights on their consistency when evaluating the same explanation method. Filling this important gap, we measure their consistency particularly as the correlation between the metrics resulting from evaluating the same explanation methods via diferent protocols. This requires evaluating the LP-X method CrossE in terms of a diferent protocol in addition to the ones introduced specifically for CrossE. We conduct experiments with diferent widely known knowledge graphs and embedding models. The outcomes suggest an overall consistency.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graphs</kwd>
        <kwd>Link Prediction</kwd>
        <kwd>Explanation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge Graphs (KGs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are formal machine-processable representations of knowledge that conform
to graph-based data models consisting of entities (nodes) and binary relations (edges). KGs deliver
not only facts, but also intensional knowledge, which enables sound reasoning and is typically
represented through ontologies. Despite their proven utility in academic and business [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], KGs are often
noisy and/or incomplete because the activities characterizing their life-cycle are often semi-automatic,
incremental, and distributed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Link Prediction (LP) methods aim at completing KGs by predicting
missing facts and they mostly ground on Knowledge Graph Embedding (KGE) models that lead to
competitive accuracy and scalability [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. KGE models are representation learning solutions that encode
the elements of a KG as low-dimensional vectors (embeddings), preserving their structural properties,
that can be leveraged for tackling complex downstream tasks, such as LP, using eficient linear algebra
operations. Despite such advantages, these models lack comprehensibility, i.e., are not traceable in
terms of operations on symbolic/explicit knowledge. This problem hampers the use of LP via KGE
models particularly in fields where it is paramount that stakeholders comprehend predictions before
relying on them for making decisions with critical consequences. For example, the prediction of side
efects for a drug can be framed as a LP task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] but it is crucial that stakeholders comprehend the
predictions before relying on them for making decisions about funding of research on the drug.
      </p>
      <p>
        LP eXplanation (LP-X) methods [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] address this issue. Specifically, a post-hoc (after the prediction)
LPX method, works with a generic LP method and explains a prediction by selecting pieces of knowledge
(e.g., sets of facts) that are associated to the prediction.
      </p>
      <p>
        Nevertheless, multiple protocols exists for evaluating explanations, making it dificult the comparison
of solutions coming from diferent LP-X methods. The prominent protocols for evaluating LP-X
methods are re-training, introduced for evaluating Criage [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and also used for evaluating Kelpie [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
and Kelpie++ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and recall and support, proposed for evaluating CrossE [10] and also used for
evaluating SemanticCrossE [11]. Conducting such evaluations is challenging because there is not yet
consensus on a standard evaluation protocol for evaluating explanations and there are no insights on
the consistency/reliability of the inter-rater (inter-protocol) evaluation, i.e., the evaluation of the same
LP-X method via diferent protocols. For this purpose, we investigate the following Research Question
(RQ):
Are the evaluation protocols re-training, recall, and support consistent when evaluating the same LP-X
method?
      </p>
      <p>Measuring the consistency between such evaluation protocols requires comparing the values they
produce when applied to the same LP-X method. However, no state-of-the-art LP-X method has been
evaluated with all of the prominent protocols. This is because the lack of a standard evaluation protocol
has led to a proliferation of diferent ones, sometimes also tailored to the specific LP-X method to be
evaluated. Specifically, in this paper, we address the RQ via evaluating the LP-X methods CrossE and
SemanticCrossE, that can be readily evaluated via recall and support, also via re-training. This results
in an evaluation of the same methods (CrossE and SemanticCrossE) under three diferent protocols,
thus allowing to answer our RQ particularly by computing the correlation between the metrics resulting
from the diferent protocols.</p>
      <p>The rest of the paper is organized as follows. In § 2, we review state-of-the-art methods for computing
and evaluating explanations. In § 3, we illustrate essential basics. In § 4, we detail the method for
measuring the consistency between the evaluation protocols, while in § 5, we illustrate the experiments.
In § 6, we summarize the achievements and suggest future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        This section analyzes the state-of-the-art LP-X methods along with the methods or protocols used for
evaluating their performance. The LP-X methods, that we target because they are generic with respect
to the LP method, explain a prediction by computing pieces of knowledge (e.g., sets of facts) that are
associated to the prediction. The first proposals explain a prediction by returning exactly one fact (within
the KG), as in the case of DP [12], applying perturbations, or Criage [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] computing (approximate)
influence functions. The latter can be restricted to a limited set of facts and to specific classes of KGE
models. More recent methods explain a prediction by returning a set of facts. Kelpie [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Kelpie++ [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
employ a post-training process. KE-X [13] is based on information gain and KGExplainer [14] adopts
greedy search and perturbations. Notably, GEnI [15] returns explanations also including ontological
axioms based on numerical criteria on (specific classes of) KGEs. Conversely, the method introduced
in [16] grounds on abduction via learned rules. The resulting explanations are mainly evaluated by
re-training the KGE model, i.e., by comparing the LP performance of the original model with that of
a model trained on a modified KG where the facts in the explanations have been added, removed, or
isolated.
      </p>
      <p>CrossE [10] and SemanticCrossE [11] explain a prediction by identifying a path between the entities
in the prediction. They rely on similarity measures and evaluate explanations as the number of similar
paths connecting similar entities. Other methods return explanations other than sets of facts or paths.
For example, in [17] logical rules are mined to explain a set of predictions and are evaluated in terms
of classification performance on the explained predictions and synthetic negative (false) facts. With
FeaBI [18], interpretable vectors are extracted from KGEs via feature selection and are compared to those
learned with an interpretable LP method. The evaluation measures the influence of the LP explanations
on the solution of related tasks, without considering the user’s perspective. These evaluation protocols
do not allow comparing the explanations coming from the diferent approaches.</p>
      <p>Another direction for evaluating explanations is to provide datasets containing ground-truth
explanations to be compared with the computed ones. FR200K [19], FRUNI and FTREE [20], include
hand-crafted rules that reflect domain knowledge and explain a fact by identifying those facts that
underpin the rules generating it. In FR200K each explanation is also rated by users in terms of (subjective)
intuitiveness, whereas in FRUNI and FTREE explanations are assumed to be valuable. Hence, FR200K
enables user guided evaluation; however, its construction process hardly generalizes to large scale due
to the required manual intervention.</p>
      <p>A complementary direction is represented by interpretable LP methods, which are LP methods with
a more understandable functioning. A comparison of diferent interpretable methods would mean to
compare their functioning, and is beyond our purpose.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Basics</title>
      <p>A KG (, ℛ) is a graph-based data structure, where  is a set of nodes representing entities, and ℛ is
a set of predicates, representing binary relations between entities. A KG can be seen as a collection of
triples ⟨, , ⟩ ∈  × ℛ ×  , with a subject , a predicate  and an object , where ,  ∈  and  ∈ ℛ.</p>
      <p>LP methods calculate a ranking function rank :  × ℛ ×  → N that computes the position of a
given triple ⟨, , ⟩ in the set of triples { ⟨, , ⟩ |  ∈  } according to the confidence/plausibility
score computed via a KGE model. A triple in the KG  is correctly predicted by the LP method if it is
top-ranked. The LP performance is typically evaluated in terms of the metrics:
•  : the average of the inverse of the obtained ranks
• @1: the ratio of predictions for which the rank is 1</p>
      <p>Next, let  :  →  be the function denoting a LP-X method, where  is the set of all possible
explanations. For example, the LP-X methods CrossE and SemanticCrossE that we specifically target,
compute explanations as paths (maximum length 2) connecting the entities in the prediction. There are
6 possible types of path for a prediction ⟨, , ⟩ ∈ :
1. { ⟨, ′, ⟩ };
2. { ⟨, ′, ⟩ };
3. { ⟨, ′, ⟩, ⟨, , ⟩ };
4. { ⟨, ′, ⟩, ⟨, , ⟩ };
5. { ⟨, ′, ⟩, ⟨, , ⟩ };
6. { ⟨, ′, ⟩, ⟨, , ⟩ }.
where ′ is a predicate similar to ,  is any other predicate  ∈ ℛ, and  is any other entity  ∈ .
CrossE and SemanticCrossE return the empty set ∅ when they fail to explain the prediction ⟨, , ⟩.
For computing similar predicates ′, CrossE adopts the euclidean distance, whereas SemanticCrossE
can adopt either the cosine distance or a semantic similarity measure.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The Proposed Approach</title>
      <p>In this section, we illustrate the evaluation protocols to be compared, namely: re-training (§ 4.1), recall
and support (§ 4.2), and how we verify the consistency, if any, between them in order to answer RQ
(§ 4.3). We specifically consider and compare explanations only of the  ⊂  of correct predictions
made via a KGE model , since explanations for wrong predictions may be misleading.</p>
      <sec id="sec-4-1">
        <title>4.1. Re-training</title>
        <p>
          The re-training protocol (introduced for evaluating the LP-X method Criage [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) measures the
importance of the explanations by measuring the impact of explanations removal from the KG in solving the
very same LP task via a KGE model . Specifically, it is based on comparing the LP performance of the
model used for computing the predictions with that of the model trained on a modified KG from which
the triples in the explanations have been removed. If the removal significantly worsens performance, it
indicates that the explanations are important for the predictions and as such they could be considered
in principle as valid explanations.
        </p>
        <p>In the following, we formalize1 the re-training process, by considering a KG (, ℛ). First, let
remove : 2 →  × ℛ ×  be the function that removes the triples in each explanation  in a set of
explanations  from the KG, formally:
∀ ∈ 2 , ′ := remove () =  ∖ ⋃︁ 
∈
Second, let ′ denote the perturbed KGE model, with the same architecture and hyperparameters as
the KGE model , but trained on the modified KG ′ (instead of ). MRR′ and H @1 ′ denote the
LP performance metrics of the perturbed KGE model ′. Since the LP performance metrics MRR and
H @1  of the original KGE model  are both 1.0 (since only the correct predictions are considered),
the re-training metrics ΔMRR and ΔH @1 can be computed as follows:
ΔMRR = 1 −</p>
        <p>
          MRR′
Both fall within the interval [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], where higher values indicate more efective explanations.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Recall and Support</title>
        <p>The recall (introduced for evaluating the LP-X method CrossE) is the ratio of predictions for which the
LP-X method generated an explanation, formally:
∀  ⊂  , recall( ) = |{  |  ∈ , explain() ̸= ∅ }|
| |</p>
        <p>The support of an explanation for a prediction, introduced for evaluating the LP-X method CrossE, is
the number of supports, i.e., triples in the KG that are similar to the prediction and have an explanation
similar to the one of the prediction. The support is computed whilst computing the explanation
and is returned along with the explanation because CrossE and SemanticCrossE returns solely the
explanations with at least one support. Moreover, diferent similarity functions are defined in CrossE
(euclidean) and SemanticCrossE (cosine and semantic). In the following, we formalize the support,
considering a KG (, ℛ). First, let neighbors :  →  be the function selecting the  entities that
are most similar to the given entity. Second, let get_sim_triples :  → 2 be the function that selects
the triples similar to the given one, via the function neighbors: ∀⟨, , ⟩ ∈</p>
        <p>sim_triples(⟨, , ⟩) = { ⟨, , ⟩ |  ∈ neighbors() ∧  ∈  ∧ ⟨, , ⟩ ∈  }.</p>
        <p>Next, let is_support :  ×  ×  ×  → { 0, 1 } be the function that determines whether an
explanation for a given prediction is supported by another explanation for another given prediction.
Formally, given a prediction ⟨, , ⟩ with its corresponding explanation 1 and a similar triple ⟨, , ⟩
with its corresponding explanation 2, we specify for each possible type of the explanation 1, when
2 is a support:</p>
        <p>1. 1 = { ⟨, ′, ⟩ }, 2 = { ⟨, ′, ⟩ };
1We denote the set of all the subsets of a set  as 2.</p>
        <p>2. 1 = { ⟨, ′, ⟩ }, 2 = { ⟨, ′, ⟩ };
3. 1 = { ⟨, ′, ⟩, ⟨, , ⟩ }, 2 = { ⟨, ′, ⟩, ⟨, , ⟩ };
4. 1 = { ⟨, ′, ⟩, ⟨, , ⟩ }, 2 = { ⟨, ′, ⟩, ⟨, , ⟩ };
5. 1 = { ⟨, ′, ⟩, ⟨, , ⟩ }, 2 = { ⟨, ′, ⟩, ⟨, , ⟩ };
6. 1 = { ⟨, ′, ⟩, ⟨, , ⟩ }, 2 = { ⟨, ′, ⟩, ⟨, , ⟩ }.
where ′ is a predicate similar to ,  is any other predicate  ∈ ℛ, and  is any other entity  ∈ .</p>
        <p>Then, let support :  ×  → N be the function that measures the number of supports for the
explanation of a given prediction, formally: ∀⟨, , ⟩ ∈ ,  = explain(⟨, , ⟩)
Finally, let  = [1, ..., ] ⊂  be a sequence of predicted triples and  = [1, . . . , ] ⊂  be a
sequence of explanations such that ∀ ∈ { 1, . . . ,  } explanation  explains prediction  (explain() =
), then let average_support : 2 × 2 → R be the function measuring the average support of a
sequence of predictions with their corresponding explanations, formally:
average_support(, ) =</p>
        <p>1 ∑︁ support(, )
|| =1
Higher values of average support indicate more efective explanations. Specifically, the sequences of
explanations and predictions are partitioned into six disjoint subsets, one for each explanation type.
The average support is then computed independently for each subset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Measuring the Consistency between the Evaluation Protocols</title>
        <p>To verify the consistency between the protocols, we employ the standard Pearson correlation coeficient,
as it has been used to assess the inter-rater consistency among raters/evaluators using continuous
scales.</p>
        <p>
          The Pearson correlation coeficient  measures the linear relationship between two sets of variables
 and  and falls within the interval [
          <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
          ], where 1 indicates a perfect positive correlation (as 
increases, so does ), − 1 indicates perfect negative correlation (as  increases,  decreases), and 0
indicates that there is no linear relationship between the variables.
        </p>
        <p>As for the support protocol, in addition to the average_support for each explanation type, we
compute the total number of supports (#supports) for the set of evaluated explanations, since we
consider the quality of an explanation to be independent of the type of path, and dependent solely on
the number of supports. Hence, we compute the correlation between the following pairs of metrics:</p>
        <p>For each correlation value, we also perform a permutation test that outputs a -value intuitively denoting
the probability that the correlation is due to chance:  &lt; 0.05 denotes a statistically significant (not
due to chance) correlation.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Evaluation</title>
      <p>In this section, we illustrate the experimental setup (§ 5.1) and discuss the results (§ 5.2).</p>
      <sec id="sec-5-1">
        <title>Entities</title>
      </sec>
      <sec id="sec-5-2">
        <title>Predicates</title>
      </sec>
      <sec id="sec-5-3">
        <title>Train triples</title>
      </sec>
      <sec id="sec-5-4">
        <title>Valid triples</title>
      </sec>
      <sec id="sec-5-5">
        <title>Test triples</title>
        <p>DB100K
YAGO4-20</p>
        <sec id="sec-5-5-1">
          <title>5.1. Experimental Setup</title>
          <p>
            We performed the study on two publicly available KGs: YAGO4-20, DB100K sampled from DBpedia and
YAGO4, respectively [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]; their statistics are reported in Tab. 1. YAGO4-20 and DB100K contain not only
triples, but also ontological axioms that SemanticCrossE leverages for computing the explanations
via the semantic similarity measure. In addition, we performed the experiments with respect to three
diferent LP methods. Specifically, we adopted three seminal KGE models, each representing a prominent
family of such models, namely: TransE [21] (translational), ConvE [22] (neural) and ComplEx [23]
(tensor factorization). Moreover, since KGE models are machine learning solutions, the KGs are further
split into a training set, a validation set, and a test set of triples. For each KGE model and KG, we
computed the explanations via CrossE and SemanticCrossE of the triples in the test set that are
(correctly) top-ranked via the LP method. We employ solely CrossE and SemanticCrossE as it is
dificult to evaluate the other SOTA methods, such as Criage and Kelpie, via the recall and support
protocols. The number of explained triples for each KGE model and KG is reported in Tab. 2. All
the code, datasets, and trained models utilized in our study are openly accessible on GitHub2. The
correlations are computed firstly for the complete set of results and then considering separately the
results for each KGE model and each KG.
          </p>
        </sec>
        <sec id="sec-5-5-2">
          <title>5.2. The Outcomes of the Evaluation</title>
          <p>Tab. 3 reports the outcomes of the evaluation of the computed explanations via the protocols re-training,
recall, and support. Based on such values, we computed the correlation coeficients, reported in Tab. 4.
The correlation coeficients suggest a moderate and significant positive correlation for all pairs of
metrics. As for the analysis conducted separately for each KGE model, the outcomes suggest a strong
and significant positive correlation when considering ConvE and ComplEx, but not when considering
TransE. Specifically, considering TransE, the correlation of the re-training metrics with the recall is
close to 0 and not significant, while the one of the re-training metrics with the number of supports is
strongly negative and significant. The low correlation when considering TransE may be due to the
lower performance, in terms of the number of correct predictions in Tab. 2, of such a model compared
to that of the other models. As for the analysis conducted separately for each KG, the outcomes suggest
a strong and significant positive correlation when considering DB100K and a correlation close to 0 and
not significant when considering YAGO4-20. The low correlation when considering YAGO4-20 may be
2https://github.com/LeoSantovito/lpx_evalprotocol_consistency
ΔMRR - #supports</p>
        </sec>
      </sec>
      <sec id="sec-5-6">
        <title>LP-X method</title>
        <p>CrossE
SemanticCrossE
SemanticCrossE
CrossE
SemanticCrossE
SemanticCrossE
CrossE
SemanticCrossE
SemanticCrossE
CrossE
SemanticCrossE
SemanticCrossE
CrossE
SemanticCrossE
SemanticCrossE
CrossE
SemanticCrossE
SemanticCrossE
0.491*
recall
#supports
TransE
TransE
TransE
ConvE
ConvE
ConvE
ComplEx
ComplEx
ComplEx
TransE
TransE
TransE
ConvE
ConvE
ConvE
ComplEx
ComplEx
ComplEx
Set
Complete
TransE
ConvE
ComplEx
DB100K
YAGO4-20 
DB100K
DB100K
DB100K
DB100K
DB100K
DB100K
DB100K
DB100K
DB100K
YAGO4-20
YAGO4-20
YAGO4-20
YAGO4-20
YAGO4-20
YAGO4-20
YAGO4-20
YAGO4-20</p>
        <p>YAGO4-20





Metric
-value
-value
-value
-value
-value
-value
12710
12022
14706
123876
159068
152723
320007
492067
434052
5061
7525
6761</p>
        <p>53
3473
3576
5761
6898
6589
0.649*</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>We conducted an empirical study of the consistency between the prominent protocols for evaluating
explanation, namely re-training, recall, and support. Specifically, we evaluated
CrossE and
SemanticCrossE, originally evaluated via recall and support, also via re-training. Hence, we computed the
correlation between the metrics resulting from the diferent protocols. The outcomes suggest that the
protocols are overall consistent. A current limitation stems from the number of explained predictions
that varies across KGE models and KGs. For the future, we aim not only at conducting a study with a
ifxed number of explained predictions, but also at extending the study with other protocols, such as
LP-DIXIT [24], other KGs, including those without schema level knowledge, and other consistency
statistics.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partially supported by project FAIR - Future AI Research (PE00000013), spoke 6 - Symbiotic
AI (https://future-ai-research.it/) under the PNRR MUR program funded by the European Union
NextGenerationEU, and by PRIN project HypeKG - Hybrid Prediction and Explanation with Knowledge
Graphs (Prot. 2022Y34XNM, CUP H53D23003700006) under the PNRR MUR program funded by the
European Union - NextGenerationEU</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tool.
Semantic Web Conference (ESWC 2024), volume 14664, Springer Nature Switzerland, Cham, 2024,
pp. 180–198. doi:10.1007/978-3-031-60626-7_10.
[10] W. Zhang, B. Paudel, W. Zhang, A. Bernstein, H. Chen, S. Culpepper, Interaction Embeddings for
Prediction and Explanation in Knowledge Graphs, in: WSDM ’19: Proceedings of the Twelfth ACM
International Conference on Web Search and Data Mining, Melbourne, Australia, February 11-15,
2019, ACM, New York, New York, USA, 2019, pp. 96–104. doi:10.1145/3289600.3291014.
[11] C. d’Amato, P. Masella, N. Fanizzi, An Approach Based on Semantic Similarity to Explaining Link
Predictions on Knowledge Graphs, in: J. He, R. Unland, E. J. Santos, X. Tao, H. Purohit, W.-J. van den
Heuvel, J. Yearwood, J. Cao (Eds.), IEEE/WIC/ACM International Conference on Web Intelligence,
ACM, New York, New York, USA, 2021, pp. 170–177. doi:10.1145/3486622.349395.
[12] H. Zhang, T. Zheng, J. Gao, C. Miao, L. Su, Y. Li, K. Ren, Data Poisoning Attack against Knowledge
Graph Embedding, in: S. Kraus (Ed.), IJCAI ’19: Proceedings of the 28th International Joint
Conference on Artificial Intelligence; Macao, China; 10-16 August 2019, IJCAI, Online, 2019, pp.
4853–4859. doi:10.24963/ijcai.2019/674.
[13] D. Zhao, G. Wan, Y. Zhan, Z. Wang, L. Ding, Z. Zheng, B. Du, KE-X: Towards subgraph explanations
of knowledge graph embedding based on knowledge information gain, Knowledge-Based Systems
278 (2023) 110772. doi:10.1016/j.knosys.2023.110772.
[14] T. Ma, X. song, W. Tao, M. Li, J. Zhang, X. Pan, J. Lin, B. Song, x. Zeng, KGExplainer:
Towards Exploring Connected Subgraph Explanations for Knowledge Graph Completion, 2024.
arXiv:2404.03893.
[15] E. Amador-Domínguez, E. Serrano, D. Manrique, GEnI: A framework for the generation of
explanations and insights of knowledge graph embedding predictions, Neurocomputing 521 (2023)
199–212. doi:10.1016/j.neucom.2022.12.010.
[16] P. Betz, C. Meilicke, H. Stuckenschmidt, Adversarial Explanations for Knowledge Graph
Embeddings, in: L. De Raedt (Ed.), IJCAI ’22: Proceedings of the 31th International Joint Conference
on Artificial Intelligence; Vienna, Austria; 23-29 July 2022, IJCAI, Online, 2022, pp. 2820–2826.
doi:10.24963/ijcai.2022/391.
[17] N. A. Krishnan, C. R. Rivero, A Model-Agnostic Method to Interpret Link Prediction Evaluation
of Knowledge Graph Embeddings, in: I. Frommholz (Ed.), CIKM ’23: Proceedings of the 32nd
ACM International Conference on Information and Knowledge Management, Birmingham, United
Kingdom, October 21-25, 2023, ACM, New York, New York, USA, 2023, pp. 1107–1116. doi:10.
1145/3583780.3614763.
[18] Y. Ismaeil, D. Stepanova, T.-K. Tran, H. Blockeel, FeaBI: A Feature Selection-Based Framework for
Interpreting KG Embeddings, in: T. R. Payne, V. Presutti, G. Qi, M. Poveda-Villalón, G. Stoilos,
L. Hollink, Z. Kaoudi, G. Cheng, J. Li (Eds.), The Semantic Web – ISWC 2023: 22nd International
Semantic Web Conference, Athens, Greece, November 6-10, 2023, Proceedings, Part I, Springer,
Cham, Switzerland, 2023, pp. 599–617. doi:10.1007/978-3-031-47240-4\_32.
[19] N. Halliwell, F. Gandon, F. Lecue, User Scored Evaluation of Non-Unique Explanations for
Relational Graph Convolutional Network Link Prediction on Knowledge Graphs, in: Proceedings
of the 11th Knowledge Capture Conference, ACM, Virtual Event USA, 2021, pp. 57–64. doi:10.
1145/3460210.3493557.
[20] P. S. Martin, T. Besold, P. Kumari, FRUNI and FTREE Synthetic Knowledge Graphs for Evaluating
Explainability, in: XAI in Action: Past, Present, and Future Applications@NeurIPS2023 (No Formal
Proceedings), OpenReview, Online, 2023.
[21] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating Embeddings for
Modeling Multi-Relational Data, in: Proceedings of the 26th International Conference on Neural
Information Processing Systems - Volume 2, NIPS’13, Curran Associates Inc., Red Hook, NY, USA,
2013, pp. 2787–2795.
[22] T. Dettmers, P. Minervini, P. Stenetorp, S. Riedel, Convolutional 2d knowledge graph embeddings,
in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth
Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on
Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18, AAAI Press, Cambridge,
Massachusetts, 2018, pp. 1811–1818. URL: https://https://dl.acm.org/doi/10.5555/3504035.3504256.
[23] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple link
prediction, in: International Conference on Machine Learning, JMLR, Online, 2016, pp. 2071–2080.
doi:10.5555/3045390.3045609.
[24] R. Barile, C. d’Amato, N. Fanizzi, LP-DIXIT: Evaluating Explanations for Link Predictions on
Knowledge Graphs using Large Language Models, in: Proceedings of the ACM on Web Conference
2025, WWW ’25, Association for Computing Machinery, New York, NY, USA, 2025, p. 4034–4042.
URL: https://doi.org/10.1145/3696410.3714667. doi:10.1145/3696410.3714667.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>D'amato</article-title>
          , G. D.
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kirrane</surname>
            ,
            <given-names>J. E. L.</given-names>
          </string-name>
          <string-name>
            <surname>Gayo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Neumaier</surname>
            ,
            <given-names>A.-C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Rashid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Schmelzeisen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sequeda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <source>Knowledge Graphs, ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1145/3447772.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T. Pellissier</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , YAGO 4:
          <string-name>
            <given-names>A</given-names>
            <surname>Reason-able Knowledge</surname>
          </string-name>
          <string-name>
            <given-names>Base</given-names>
            , in: A.
            <surname>Harth</surname>
          </string-name>
          , et al. (Eds.),
          <source>The Semantic Web</source>
          , volume
          <volume>12123</volume>
          , Berlin, Heidelberg,
          <year>2020</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>596</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -49461-2\_
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Building a broad knowledge graph for products</article-title>
          ,
          <source>in: 2019 IEEE 35th International Conference on Data Engineering (ICDE)</source>
          ,
          <source>IEEE Computer Society</source>
          , Washington DC, USA,
          <year>2019</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2019</year>
          .
          <volume>00010</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matinata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <article-title>Knowledge Graph Embedding for Link Prediction: A Comparative Analysis</article-title>
          ,
          <source>ACM Transactions on Knowledge Discovery from Data</source>
          <volume>15</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>49</lpage>
          . doi:
          <volume>10</volume>
          .1145/3424672.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nováček</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <article-title>Predicting Polypharmacy Side-Efects using Knowledge Graph Embeddings</article-title>
          ,
          <source>AMIA Summits on Translational Science Proceedings</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>449</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schramm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wehner</surname>
          </string-name>
          , U. Schmid,
          <string-name>
            <surname>Comprehensible Artificial</surname>
          </string-name>
          <article-title>Intelligence on Knowledge Graphs: A survey</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>79</volume>
          (
          <year>2023</year>
          )
          <fpage>100806</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezeshkpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Irvine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Investigating Robustness and Interpretability of Link Prediction via Adversarial Modifications</article-title>
          , in: Burstein, Jill, Doran, Christy, Solorio, Thamar (Eds.),
          <source>NAACL-HLT '19: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers); Minneapolis, Minnesota, USA;
          <fpage>02</fpage>
          -
          <lpage>07</lpage>
          June 2019, volume
          <volume>1</volume>
          , Association for Computational Linguistics, Kerrville, Texas, USA,
          <year>2019</year>
          , pp.
          <fpage>3336</fpage>
          -
          <lpage>3347</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1337.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          , T. Teofili,
          <source>Explaining Link Prediction Systems based on Knowledge Graph Embeddings</source>
          , in: Z.
          <string-name>
            <surname>Ives</surname>
          </string-name>
          (Ed.),
          <source>SIGMOD/PODS '22: Proceedings of the 2022 International Conference on Management of Data; Philadelphia</source>
          , Pennsylvania, USA;
          <fpage>12</fpage>
          -
          <lpage>17</lpage>
          June 2022, ACM, New York, New York, USA,
          <year>2022</year>
          , pp.
          <fpage>2062</fpage>
          -
          <lpage>2075</lpage>
          . doi:
          <volume>10</volume>
          .1145/3514221.3517887.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Barile</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fanizzi</surname>
          </string-name>
          ,
          <article-title>Explanation of Link Predictions on Knowledge Graphs via Levelwise Filtering and Graph Summarization</article-title>
          , in: A.
          <string-name>
            <surname>Meroño Peñuela</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , P. Lisena (Eds.),
          <source>Proceedings of the 26th European</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>