A Simplified Benchmark for Non-ambiguous
    Explanations of Knowledge Graph Link
Prediction using Relational Graph Convolutional
                   Networks?

          Nicholas Halliwell1 , Fabien Gandon1 , and Freddy Lecue1,2
                1
                 Inria, Université Côte d’Azur, CNRS, I3S, France
                      2
                        CortAIx, Thales, Montreal, Canada
         {nicholas.halliwell,fabien.gandon,freddy.lecue}@inria.fr


      Abstract. Relational Graph Convolutional Networks (RGCNs) identify
      relationships within a Knowledge Graph to learn real-valued embeddings
      for each node and edge. Recently, researchers have proposed explanation
      methods to interpret the predictions of these black-box models. However,
      comparisons across explanation methods is difficult without a common
      dataset and standard evaluation metrics to evaluate the explanations.
      In this paper, we propose a method, including two datasets (Royalty-
      20k and Royalty-30k), to benchmark explanation methods on the task
      of explainable link prediction using Graph Neural Networks. We report
      the results of state-of-the-art explanation methods for RGCNs. 3

      Keywords: link prediction · Explainable AI · knowledge graphs.


1   Need for Explaining Links Predicted by RGCNs

Knowledge Graphs often represent facts as triples in the form (subject, predicate,
object), where a subject and object represent an entity, linked by some predicate.
Link prediction is used to discover new facts from existing ones. Graph embed-
ding algorithms can be used for link prediction, learning a function to map each
subject, object, and predicate to a low dimensional space. A Relational Graph
Convolutional Network (RGCN) [4] leverages Graph Convolutional Networks [3]
with a scoring function such as DistMult [5] as an output layer, returning a
probability of the input triple being a fact.
    Recently, methods have been developed to explain the predictions of Graph
Neural Networks. GNNExplainer [6] explains the predictions of any Graph Neu-
ral Network, learning a mask over the adjacency matrix to identify the most
informative subgraph. ExplaiNE [2] quantifies how the predicted probability of
a link changes when weakening or removing a link with a neighboring node.
?
  Copyright ©2021 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0).
3
  These datasets are available: https://gitlab.com/halliwelln/royalty-datasets
        N. Halliwell et al.

    The authors of ExplaiNE acknowledge the difficulty in measuring the qual-
ity of explanation generated, and a lack of available datasets with ground truth
explanations [2]. This approach is evaluated with four datasets, none of which
include ground truth explanations. Additionally, computing ground truth expla-
nations for these datasets is non-trivial.
    On the task of link prediction on Knowledge Graphs, GNNExplainer remains
untested. Neither GNNExplainer or ExplaiNE use the same dataset in their
experiments. On the task of explainable link prediction on Knowledge Graphs,
no standard datasets or scoring metrics exist to evaluate explanation methods.
    In this work, we propose a method, including two datasets (Royalty-20k
and Royalty-30k) to evaluate explanations methods for link prediction using
RGCNs. We propose the use of a scoring metric on these datasets, and perform
a benchmark comparing the explanations of ExplaiNE and GNNExplainer.


2    Building the Royalty Datasets with Explanations

We define ground truth explanations as a single justification for an entailment.
We use an open-source semantic reasoner with rule-tracing capabilities [1] to
generate ground truth explanations for each rule, allowing for each set of ex-
planations to be precisely controlled and selected. We rely on a set of rules
equivalent to strict Horns clauses i.e. disjunctions of literals with exactly one
positive literal lc , all the other li being negated: ¬l1 ∨ ... ∨ ¬ln ∨ lc . The impli-
cation form of the clause can be seen as an inference rule assuming that, if all
li hold (the antecedent of the rule), then the consequent lc also holds, denoted
lc ← l1 ∧ ... ∧ ln . In our case, each literal is a binary predicate capturing a triple
pattern of the Knowledge Graph with variables universally quantified for the
whole clause.
    For a given Knowledge Graph and a given set of rules, the semantic rea-
soner performs a forward chaining materialization of all inferences that can be
made. Each time the engine finds a mapping of triples T1 , . . . , Tn making the
antecedent of a rule true, it materializes the consequent triple Tc , and records
the explanations in the form Tc ← (T1 , . . . , Tn ), where Tc is a generated triple,
and triples T1 , . . . , Tn are its explanation.
    For example, under the hasGrandparent rule, if two triples (Princess Marie
Anne, hasParent, Louis XIV ), and (Louis XIV, hasParent, Anne of Austria) are
identified in a Knowledge Graph, the semantic reasoner generates a new triple
(Princess Marie Anne , hasGrandparent, Anne of Austria). We compute expla-
nation triples in a similar fashion for each logical rule defined in each dataset. In
this example, the explanation for why (Princess Marie Anne, hasGrandparent,
Anne of Austria) is a valid fact is because (Princess Marie Anne, hasParent,
Louis XIV ), and (Louis XIV, hasParent, Anne of Austria).
    This approach to generating ground truth explanations can be applied to
many Knowledge Graphs and many sets of rules from different domains. It also
corresponds to the RGCN explanation methods, which provide explanations in
the form of a set of evidence triples.
                    A Simplified Benchmark for Non-ambiguous Explanations


                                           # Rule
                                                    # Unique Explanation Predicate
  Dataset        Predicate      # Triples Generated
                                                     Entities Cardinality Property
                                           Triples

                 hasSpouse         7,526       3,763      6,114            1 Symmetric
 Royalty-20k    hasSuccessor       6,277       2,003      6,928            1    Inverse
               hasPredecessor      6,277       2,159      6,928            1    Inverse
                  Full data       20,080       7,924      8,861            -          -

                hasSpouse          7,526       3,763      6,114            1 Symmetric
 Royalty-30k hasGrandparent        7,736       7,736      4,330            2    Chain
                hasParent         15,472           0          -            -         -
                Full data         30,734      11,499     11,483            -         -
Table 1: Royalty datasets: Breakdown of each predicate in the dataset. # of
Triples denotes the total number of triples with that predicate. Explanation
Cardinality denotes the number of triples in the ground truth explanation set.


    There may be several ways to define these logical rules. For example, hasSuc-
cessor can be defined using its inverse relation hasPredecessor. In some cases,
hasPredecessor could be correlated to the hasParent relation and therefore con-
sidered an explanation. Both explanations could be correct, thus the optimal ex-
planation is ambiguous. In this work, we build two datasets with non-ambiguous
explanations, where each logical rule is designed to give one and only one ground
truth explanation for each triple in the training and test sets. This prevents an
explanation method from having to arbitrarily select between alternative expla-
nations. Table 1 gives dataset statistics for each predicate. Each rule is detailed
below.

Spouse Entity X is the spouse of Y if Y is the spouse of X, more formally,
hasSpouse(X, Y ) ← hasSpouse(Y, X). This is a symmetric relationship. A pred-
icate p is said to be symmetric for some subject s and object o if and only if
(s, p, o) ← (o, p, s). There are 7, 526 triples with the hasSpouse predicate in each
dataset, 3, 763 of which are generated by rules. Note this rule is the same for
both datasets.

Successor and Predecessor A successor in the context of royalty is one
who immediately follows the current holder of the throne. X is the succes-
sor of Y if Y is the predecessor of X. Equivalently, hasSuccessor(X, Y ) ←
hasP redecessor(Y, X). Likewise, a predecessor is defined as one who held the
throne immediately before the current holder. X is the predecessor of Y if Y is
the successor of X. Equivalently, hasP redecessor(X, Y ) ← hasSuccessor(Y, X).
Indeed hasSuccessor and hasPredecessor follow an inverse relationship. A pred-
icate p1 is the inverse of p2 if and only if (s, p1 , o) ← (o, p2 , s). Therefore triples
with the hasSuccessor predicate are used to explain the triples with the hasPrede-
cessor predicate and vice-versa. There are 6, 277 triples with hasSuccessor pred-
       N. Halliwell et al.

icate, 2, 003 of which are generated by rules. Similarly, there are 6, 277 triples
with hasPredecessor predicate, 2, 159 of which are generated by rules.

Grandparent We define hasGrandparent using a chain property pattern. A
predicate p is a chain of predicates pi if and only if (s, p, o) ← (s, p1 , s2 ) ∧
... ∧ (sn , pn , o). Y is the grandparent of X if Y is the parent of X’s parent P .
Equivalently, hasGrandparent(X, Y ) ← hasP arent(X, P ) ∧ hasP arent(P, Y ).
There are 7, 736 triples with hasGrandparent predicate, all of which are generated
by rules.

3   Evaluating Explanations of RGCNs
In this benchmark, ExplaiNE and GNNExplainer were asked to identify existing
triples in the graph as candidate explanations. We use the Jaccard similarity
to score the similarity between predicted and ground truth explanations. This
metric is appropriate, as it compares the triples in common between the predicted
and ground truth without considering the order. Furthermore, this metric can be
applied to data from different domains without relying on further assumptions.
Benchmark results can be found in Table 2.
    As an example of one of ExplaiNE’s errors, for some triple (Albert III, Count
of Everstein, hasSpouse, Richeza of Poland ) and its ground truth explanation
(Richeza of Poland, hasSpouse, Albert III, Count of Everstein), ExplaiNE pre-
dicted a first degree neighbor (Richeza of Poland, hasSpouse, Alfonso VIII of
Leon and Castile) to be its explanation. Note the incorrectly predicted triple
uses the hasSpouse predicate but in a wrong way.
    Across both datasets, we observe GNNExplainer had the highest Jaccard and
F1 -Score performance on the hasSpouse predicate. GNNExplainer’s best per-
forming predicates were hasSpouse, hasSuccessor and hasPredecessor. On the
Royalty-30k dataset, we observe performance drops across all metrics on the
hasGrandparent predicate. We found GNNExplainer to have a recall of 1 for
each predicate, meaning the triples in the ground truth explanation are always
identified, however, many false positives are included in the predicted explana-
tion set. Indeed a recall of 1 can be trivially achieved by including the entire
input graph in the predicted explanation set. This was not the case for the pre-
dicted explanations of GNNExplainer, however, the predicted explanation sets
were often too large (up to 20 triples on average), and had many false positives.
    Table 2 also reports the results of ExplaiNE on the Royalty datasets. Sim-
ilar to GNNExplainer, we observe the best Jaccard and F1 -Score performance
on the hasSpouse predicate. On the Royalty-30k dataset, we see lower relative
performance on the predicates where larger explanations need to be predicted,
such as the hasGrandparent rule.
    Overall, we find ExplaiNE outperformed GNNExplainer in terms of Jaccard
score for all rules across both datasets. While GNNExplainer outperforms Ex-
plaiNE in terms of F1 score on hasSpouse, and hasGrandparent, along with the
Royalty-30k full dataset, this is likely due to GNNExplainer’s high recall, indi-
cating that the F1 metric is not appropriate for this task.
                    A Simplified Benchmark for Non-ambiguous Explanations


                                                      Predicates
   Dataset      Models     Metrics Spouse Successor Predecessor Grandparent Full set

                RGCN Accuracy 0.802          0.74       0.75          -        0.746
                           Precision 0.656   0.182     0.182          -        0.277
                 GNN        Recall    1.0     1.0       1.0           -         1.0
               Explainer      F1     0.792   0.307     0.308          -        0.433
                           Jaccard 0.328     0.178     0.178          -        0.184
 Royalty-20k
                        Precision 0.754      0.319     0.368          -       0.397
                         Recall 0.571        0.317     0.365          -       0.577
               ExplaiNE
                           F1      0.65      0.318     0.366          -        0.47
                        Jaccard 0.388        0.314     0.363          -       0.274
                RGCN Accuracy 0.802            -         -          0.75       0.703
                           Precision 0.656     -         -         0.067      0.261
                 GNN        Recall    1.0      -         -          1.0        1.0
               Explainer      F1     0.792     -         -         0.125      0.414
                           Jaccard 0.328       -         -         0.133      0.174
 Royalty-30k
                        Precision 0.754        -         -         0.101      0.363
                         Recall 0.571          -         -         0.135      0.412
               ExplaiNE
                           F1      0.65        -         -         0.115      0.386
                        Jaccard 0.388          -         -         0.135      0.216
Table 2: Benchmark results on Royalty-20k and Royalty-30k: Link prediction
results for RGCN, and explanation evaluation for GNNExplainer and ExplaiNE.
Best F1 and Jaccard scores per predicate denoted in bold.


References
1. Corby, O., Gaignard, A., Faron Zucker, C., Montagnat, J.: KGRAM Versatile In-
   ference and Query Engine for the Web of Linked Data. In: IEEE/WIC/ACM Inter-
   national Conference on Web Intelligence. pp. 1–8. Macao, China (Dec 2012)
2. Kang, B., Lijffijt, J., Bie, T.D.: Explaine: An approach for explaining network
   embedding-based link predictions. CoRR abs/1904.12694 (2019)
3. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
   networks. In: International Conference on Learning Representations, ICLR (2017)
4. Schlichtkrull, M.S., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling,
   M.: Modeling relational data with graph convolutional networks. In: Gangemi, A.,
   Navigli, R., Vidal, M., Hitzler, P., Troncy, R., Hollink, L., Tordai, A., Alam, M.
   (eds.) European Semantic Web Conference, ESWC (2018)
5. Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for
   learning and inference in knowledge bases. In: Bengio, Y., LeCun, Y. (eds.) 3rd
   International Conference on Learning Representations, ICLR (2015)
6. Ying, Z., Bourgeois, D., You, J., Zitnik, M., Leskovec, J.: Gnnexplainer: Gener-
   ating explanations for graph neural networks. In: Wallach, H.M., Larochelle, H.,
   Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural
   Information Processing Systems (2019)