=Paper= {{Paper |id=Vol-2998/paper2 |storemode=property |title=Semantic Queries Explaining Opaque Machine Learning Classifiers |pdfUrl=https://ceur-ws.org/Vol-2998/paper2.pdf |volume=Vol-2998 |authors=Jason Liartis,Edmund Dervakos,Orfeas Menis Mastromichalakis,Alexandros Chortaras,Giorgos Stamou |dblpUrl=https://dblp.org/rec/conf/icann/LiartisDMCS21 }} ==Semantic Queries Explaining Opaque Machine Learning Classifiers== https://ceur-ws.org/Vol-2998/paper2.pdf

Semantic Queries Explaining Opaque Machine
Learning Classifiers
Jason Liartis1 , Edmund Dervakos1 , Orfeas Menis - Mastromichalakis1 ,
Alexandros Chortaras1 and Giorgos Stamou1
1
Artificial Intelligence and Learning Systems Laboratory, School of Electrical and Computer Engineering, National
Technical University of Athens

Abstract
The success of deep learning on solving a multitude of different problems has highlighted the importance
of explainability. In many critical applications, such as in medicine, deep learning cannot be ethically, or
lawfully utilized due to its intrinsic opacity. On the other hand, description logics and knowledge represen-
tation technologies provide a transparent and interepretable framework for describing and classifying data.
In this paper we present a methodology for utilizing description logics for explaining black-box classifiers.
Specifically, given a dataset, a knowledge which describes it and a black-box classifier, we attempt to
mimic the black-box with conjunctive queries over the knowledge. These queries are then presented as
global explanations of the black-box classifier.

Keywords
XAI, Explainability, Knowledge Graphs

1. Introduction
Over the past decade, Artificial Intelligence and in particular Deep Neural Networks have
accomplished impressive achievements in various tasks such as Image Recognition and Natural
Language Processing in numerous domains like medicine, finance, arts, and many more. In
order to tackle challenging problems, the models become deeper and deeper and their structure
complexity is constantly rising, making them inherently opaque. While traditional machine
learning models, such as decision trees, are interpretable by design, modern deep neural networks
are hard to explain due to their complex architecture and operation. This opacity raises ethical
and legal concerns regarding the real-life use of such models and has lead to the emergence of
eXplainable Artificial Intelligence (XAI), to make the operation of opaque AI systems more
comprehensible to humans [1, 2, 3, 4]. XAI is already a major topic in many events of the AI
community and is constantly engaging more researchers due to its immediate and important effect
on the application of AI systems.
Knowledge Graphs (KG) and Description Logics (DL) offer an advanced, adaptable and
interpretable framework of high expressiveness that can be employed to explain complex neural
networks exploiting years of research and improvement on the area. Knowledge graphs [5], as a
scalable common understandable structured representation of a domain based on the way humans

© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
mentally perceive the world, have emerged as a promising complement or extension to machine
learning approaches for achieving explainability [4].
In this paper, we introduce a novel explainability framework for representing model-agnostic
knowledge graph-based explanations, as conjunctive queries (resulting to explanations like
“animals in images with household items are classified as domestic animals”), approaching the
task as a Query Reverse Engineering (QRE) problem, deriving the explanation-queries from the
respective outputs of the model. We discuss the creation of appropriate datasets for our methods,
and we propose algorithms to compute such explanations. Additionally, we investigate evaluation
methods and metrics, and finally, we evaluate the methodology and algorithms experimenting
with data from the CLEVR-Hans3 [6] dataset, extracting explanations for deep learning models.

2. Related Work
Approaches to explainability vary with regard to data domain (images, text, tabular), form of
explanations (rule-based, counterfactual, feature importance etc.), and scope (global, local) [3].
Closely related to this work are global rule-based explanation methods. These attempt to extract
rules based on the predictions of a black-box classifier on a dataset. Many of these methods are
based on statistics [7, 8] or extracting rules from decision trees [9, 10], while others are based on
logics [11, 12]. Extensionally related to this work are also global prototype explanation methods
[13, 14] which present specific representative samples from a dataset as explanations of black-box
classifiers.
A key feature of the algorithms proposed in Section 5 is the computation of the Least Common
Subsumer (LCS). This problem has been extensively studied for various decription logic expres-
sivities [15, 16, 17, 18]. Furthermore, this work makes use of the strong theoretical and practical
results in the area of semantic query answering [19, 20, 21, 22] for evaluating and applying the
proposed algorithms in a practical setting.
Numerous works have addressed the Query Reverse Engineering problem, both for knowledge
bases and traditional databases [23, 24, 25], and tools have been developed to offer an easy and
user friendly way to perform QRE on knowledge bases [26]. The key difference between these
works and ours is that we are able to find an approximate solution in cases where there is no exact
answer, thanks to the heuristic approach of our algorithms (for more details see Section 5).

3. Background
Let 𝒱 = ⟨CN, RN, IN⟩ be a vocabulary, with CN, RN, IN mutually disjoint finite sets of concept,
role and individual names, respectively. Let also 𝒯 and 𝒜 be a terminology (TBox) and an
assertional database (ABox), respectively, over 𝒱 using a Description Logics (DL) dialect ℒ,
i.e. a set of axioms and assertions that use elements of 𝒱 and constructors of ℒ. The pair
⟨𝒱, ℒ⟩ is a DL-language, and 𝒦 = ⟨𝒯 , 𝒜⟩ is a knowledge base (KB) over this language. The
semantics of KBs are defined in the standard way using interpretations [27]. Given a domain
∆, an interpretation ℐ = (∆ℐ , ·ℐ ) assigns a set 𝐶 ℐ ⊆ ∆ℐ to concept 𝐶, a set 𝑟ℐ ⊆ ∆ℐ × ∆ℐ to
role 𝑟, and an object 𝑎ℐ ∈ ∆ to individual 𝑎. ℐ is a model of an ABox 𝒜 iff 𝑎ℐ ∈ 𝐶 ℐ for all
𝐶(𝑎) ∈ 𝒜, and (𝑎ℐ , 𝑏ℐ ) ∈ 𝑟ℐ for all 𝑟(𝑎, 𝑏) ∈ 𝒜. ℐ is a model of a TBox 𝒯 if it satisfies all
axioms in 𝒯 .
Given a vocabulary 𝒱, a conjunctive query (simply, a query) 𝑞 is an expression { ⟨𝑥1 , . . . 𝑥𝑘 ⟩ |
∃𝑦1 . . . ∃𝑦𝑙 .(𝑐1 ∧ . . . ∧ 𝑐𝑛 ) }, where 𝑘, 𝑙 ≥ 0, 𝑛 ≥ 1, 𝑥𝑖 , 𝑦𝑖 are variable names, each 𝑐𝑖 is an atom
𝐶(𝑢) or 𝑟(𝑢, 𝑣), where 𝐶 ∈ CN, 𝑟 ∈ RN, 𝑢, 𝑣 ∈ IN ∪ {𝑥𝑖 }𝑘𝑖=1 ∪ {𝑦𝑖 }𝑙𝑖=1 and all 𝑥𝑖 , 𝑦𝑖 appear in
at least one atom. The vector ⟨𝑥1 , . . . 𝑥𝑘 ⟩ is the head of 𝑞, its elements are the answer variables,
and {𝑐1 , . . . , 𝑐𝑛 } is the body of 𝑞. If 𝑞 has no answer variables it is boolean, and if its body
is a singleton, it is atomic. In this paper we focus on non-boolean queries having one answer
variable and in which all arguments or all 𝑐𝑖 are variables, which are called instance queries.
For simplicity we write instance queries as 𝑞 = {𝑐1 , ..., 𝑐𝑛 }, considering always 𝑥 as the answer
variable. Given a KB 𝒦, a query 𝑞 and an interpretation ℐ of 𝒦, an individual 𝑎 ∈ IN is an answer
to 𝑞 in ℐ if it there is a successful variable substitution, substituting 𝑎ℐ for 𝑥. This is formally
defined as a match, a mapping 𝜋 : VN(𝑞) → ∆ℐ such that 𝜋(𝑢) ∈ 𝐶 ℐ for all 𝐶(𝑢) ∈ 𝑞, and
(𝜋(𝑢), 𝜋(𝑣)) ∈ 𝑟ℐ for all 𝑟(𝑢, 𝑣) ∈ 𝑞. Furthermore, it is a certain answer iff it is an answer under
all interpretations that are models of 𝒦. The set of certain answers to 𝑞 is cert(𝑞, 𝒦). In some DL
dialects, determining the certain answers to a query can be reduced to determining the answers of
a special model, called the canonical model [28] [29] 𝒞𝒯 ,𝒜 of 𝒦 = ⟨𝒯 , 𝒜⟩. In the dialect we use
for our experiments, called Horn 𝒮ℋℐ𝒬, this process is even simpler since we can guarantee that
the canonical model is finite.
Consider a vocabulary 𝒱 = ⟨CN, RN, IN⟩, a knowledge base 𝒦 = ⟨𝒯 , 𝒜⟩, where 𝒯 is the
TBox and 𝒜 the ABox of 𝒦 using a DL dialect ℒ on 𝒱, and the set 𝒬 of all (conjunctive)
queries over 𝒱. Let 𝑄 ⊆ 𝒬 be a set of queries over 𝒦. With a slight abuse of notation, we write
cert(𝑄, 𝒦) to denote the set of all answers to the queries of 𝑄, i.e. cert(𝑄, 𝒦) = ∪𝑞∈𝑄 cert(𝑞, 𝒦).
We can partially order 𝒬 using query subsumption: A query 𝑞2 subsumes a query 𝑞1 (we write
𝑞1 ≤𝑆 𝑞2 ) iff there is a substitution 𝜃 s.t. 𝑞2 𝜃 ⊆ 𝑞1 . If 𝑞1 , 𝑞2 are mutually subsumed, they are
syntactically equivalent (𝑞1 ≡𝑆 𝑞2 ).
Let be 𝑞 an instance query and 𝑞 ′ ⊆ 𝑞. If 𝑞 ′ is a minimal subset of 𝑞 s.t. 𝑞 ′ ≤𝑆 𝑞, then 𝑞 ′ is a
condensation of 𝑞 (cond(𝑞)). If that minimal 𝑞 ′ is the same set as 𝑞, then 𝑞 is condensed [30].

4. Explaining Opaque ML Classifiers
We are now ready to define and explain the operation of machine learning (ML) classifiers. In
this section we only consider boolean classifiers in our definitions, that classify individuals to two
classes, but the generalization from boolean to multi-class classifiers is considered trivial.

Definition 1. Given a set 𝒟 ⊆ IN, a classifier F defined on 𝒟 is a mapping from 𝒟 to {0, 1}, i.e.
F : 𝒟 → {0, 1}. We call 𝒟 the domain of the classifier and with F(𝒟) we denote the set of all
elements of 𝒟 that are classified to 1, i.e. F(𝒟) = {𝑎 ∈ 𝒟 | F(𝑎) = 1}.

Note that deep neural networks typically classify objects, taking as input other sources of
information (images, etc) not included in their semantic description in 𝒦. On the other hand,
knowledge graphs usually contain this information in the form of datatype properties of objects.
Throughout the theoretical analysis we omit this information from 𝒦, for simplicity, however for
the evaluation we compute F(𝒟) from images of objects (see Section 6 for details).
Definition 2. Let 𝒟 ⊆ IN and F be a classifier on 𝒟, 𝒦 a KB over a vocabulary 𝒱 =
⟨CN, RN, IN⟩. A query 𝑞 over 𝒱, is a FO-explanation (first-order explanation), or simply an
explanation, of F over 𝒦, iff cert(𝑞, 𝒦) = F(𝒟).
Explanations of ML classifiers do not always exist. Moreover, they are not always unique or
helpful in practice. In the following, we introduce the notion of approximate explanations , in
order to relax the demand to exactly match the output of the classifier. A query 𝑞, is an approximate
FO-explanation, or simply an approximate explanation of F, over 𝒦, iff cert(𝑞, 𝒦) ∩ F(𝒟) ̸= ∅.
Of special interest are approximate explanations that guarantee that all their answers are
classified by F as positive. A query 𝑞, is an under-FO-explanation, or simply an under-explanation,
of F over 𝒦, iff it is an approximate explanation of F and cert(𝑞, 𝒦) ⊆ F(𝒟).
Of interest are also approximate explanations for which all individuals classified by F as positive
are in their answer set. A query 𝑞, is an over-FO-explanation, or simply an over-explanation, of F
over 𝒦, iff it is an approximate explanation of F and F(𝒟) ⊆ cert(𝑞, 𝒦). In most cases multiple
approximate explanations exist, so we need to find a way to evaluate them. In the following
paragraph we propose a set of evaluation metrics for such explanations.

Explanation Evaluation Metrics We can define similarity measures that take into account
the syntactical form of the queries, their certain answers over any ABox, or their certain answers
over a specific knowledge base. For the syntactic similarity of two queries, there is a lot of work
in the area (see for example the work on query refinement and graph distance). For answer-based
similarity for any ABox, it is not obvious how to define similarity measures, since it is difficult to
consider all ABoxes or to capture all the TBox information by checking the syntax. The case
of answer-based similarity for a specific knowledge base is more intuitive and can be addressed
using either the well-known Jaccard similarity coefficient defined on sets, or other popular metrics
from the area of machine learning like precision and recall. So, we define the following three
similarity measures on 𝒬 for a query-explanation 𝑞 of a classifier F over a specific knowledge
base 𝒦 and a set 𝒟 ⊆ IN as follows:

Degree A Jaccard distance-based similarity measure, it’s obvious that the degree of an
(exact) explanation is equal to 1. deg(𝑞, F) = |cert(𝑞,𝒦)∩F(𝒟)|
|cert(𝑞,𝒦)∪F(𝒟)|

Precision A similarity measure inspired by the precision of ML classifiers, it’s obvious that
the precision of an under-explanation is equal to 1. pre(𝑞, F) = |cert(𝑞,𝒦)∩F(𝒟)|
|cert(𝑞,𝒦)|

Recall A similarity measure inspired by the recall of ML classifiers, it’s obvious that the
recall of an over-explanation is equal to 1. rec(𝑞, F) = |cert(𝑞,𝒦)∩F(𝒟)|
|F(𝒟)|

5. Computing Explanations
In this section we describe in detail the proposed algorithms, and introduce the concept of an
Explanation Dataset, which plays a key role for effectively utilizing the proposed algorithms in a
practical setting.
5.1. Algorithms
In order to unify explanations we use the notion of the least common subsumer (LCS). Given
two queries 𝑞1 , 𝑞2 , their least common subsumer LCS(𝑞1 , 𝑞2 ) is defined as the query 𝑞 for which
𝑞1 , 𝑞2 ≤𝑆 𝑞 and ∀𝑞 ′ : 𝑞1 , 𝑞2 ≤𝑆 𝑞 ′ ⇔ 𝑞 ≤𝑆 𝑞 ′ . The least common subsumer is the most specific
generalization of 𝑞1 , 𝑞2 .
In order to manipulate queries and compute explanations, we use the representation of interpre-
tations and queries as labeled graphs. For interpretations, each element 𝑎ℐ ∈ ∆ℐ is represented
by a node with a label which contains every concept 𝐶 ∈ CN for which 𝑎ℐ ∈ 𝐶 ℐ . Two nodes
𝑎ℐ , 𝑏ℐ are connected with an edge with label 𝑟 if (𝑎ℐ , 𝑏ℐ ) ∈ 𝑟ℐ . This graph can be described
by a triple (𝑉, 𝐸, 𝐿) where 𝑉 = ∆ℐ are the nodes 𝑎ℐ , 𝐸 ⊆ ∆ℐ × RN × ∆ℐ are the edges
(𝑎ℐ , 𝑟, 𝑏ℐ ), and 𝐿 : 𝑉 → 2CN is the labeling function 𝐿(𝑎ℐ ) = {𝐶, 𝐷, . . . } of the nodes. The
graph representation for queries is defined in the same way, but with nodes corresponding to
variables and labels and edges to the conjuncts containing those variables.
These representations are useful because a lot of the concepts we have presented so far can
be rephrased in terms of homomorphisms between labeled graphs. Given two labeled graphs
𝐺1 = (𝑉1 , 𝐸1 , 𝐿1 ), 𝐺2 = (𝑉2 , 𝐸2 , 𝐿2 ), a mapping ℎ : 𝑉1 → 𝑉2 is called a homomorphism
iff it respects the structure of 𝐺1 , or more formally: (i) ∀𝑣 ∈ 𝑉1 , 𝐿1 (𝑣) ⊆ 𝐿2 (ℎ(𝑣)) and (ii)
∀𝑢, 𝑣 ∈ 𝑉1 , (𝑢, 𝑟, 𝑣) ∈ 𝐸1 ⇒ (ℎ(𝑢), 𝑟, ℎ(𝑣)) ∈ 𝐸2 . If such a mapping exists we say that 𝐺1 is
homomorphic to 𝐺2 and we write 𝐺1 → 𝐺2 .
A match 𝜋 can now be rephrased as a homomorphism from the query graph of 𝑞 to the
interpretation graph of 𝐼, and 𝑎 is an answer to 𝑞 if there is such a homomorphism with 𝜋(𝑥) = 𝑎ℐ
and a certain answer if there is a homomorphism to the canonical model with 𝜋(𝑥) = 𝑎𝒞𝒯 ,𝒜 .
Subsumption can also be described in terms of homomorphisms since its definition implies that
𝐺𝑞1 → 𝐺𝑞2 ⇔ 𝑞2 ≤𝑆 𝑞1 , where 𝐺𝑞 is the graph of query 𝑞.
In order to calculate the LCS we use an extension of the Kronecker product of graphs to labeled
graphs. Given two labeled graphs 𝐺1 = (𝑉1 , 𝐸1 , 𝐿1 ), 𝐺2 = (𝑉2 , 𝐸2 , 𝐿2 ) the Kronecker product
𝐺 = 𝐺1 × 𝐺2 of those graphs is: 𝐺 = (𝑉, 𝐸, 𝐿), 𝑉 = 𝑉1 × 𝑉2 (Cartesian product of sets),
((𝑢1 , 𝑟, 𝑣1 ) ∈ 𝐸1 , (𝑢2 , 𝑟, 𝑣2 ) ∈ 𝐸2 ) ⇔ ((𝑢1 , 𝑢2 ), 𝑟, (𝑣1 , 𝑣2 )) ∈ 𝐸, 𝐿((𝑣1 , 𝑣2 )) = 𝐿1 (𝑣1 ) ∩
𝐿2 (𝑣2 )
As with the Kronecker product of unlabeled graphs, it holds that 𝐻 → 𝐺1 , 𝐺2 ⇔ 𝐻 →
𝐺1 × 𝐺2 , this implies that the graph of LCS(𝑞1 , 𝑞2 ) is 𝐺𝑞1 × 𝐺𝑞2 , with the node (𝑥, 𝑥) becoming
the new answer variable and the other nodes of 𝐺𝑞1 × 𝐺𝑞2 are renamed arbitrarily. Calculating
the LCS using the Kronecker product involves 𝑂(𝑛2 𝑚2 ) operations, where 𝑛 = |𝑉1 |, 𝑚 = |𝑉2 |.
Even though the Kronecker product between two queries computes the LCS, the query it
produces is not minimal with respect to the number of variables. Since these queries are intended
to be shown to humans as explanations, the minimization of the number of variables is imperative.
However, condensing a query is coNP-complete [30]. For this reason, we utilize Algorithm 1
which removes redundant conjuncts and variables, however without a guarantee of producing a
fully condensed query.
For each pair of variables present in a query, Algorithm 1 checks if unifying the two variables
is equivalent to deleting one of the two, in which case the variable and conjuncts are deleted.
Verifying the correctness of Algorithm 1 amounts to verifying the claim of line 6. By unifying
variable 𝑣𝑗 with 𝑣𝑖 , all conjuncts of the form 𝐶(𝑣𝑗 ) become 𝐶(𝑣𝑖 ) and all conjuncts of the forms
𝑟(𝑣𝑗 , 𝑣𝑘 ), 𝑟(𝑣𝑘 , 𝑣𝑗 ), 𝑟(𝑣𝑗 , 𝑣𝑗 ) become respectively 𝑟(𝑣𝑖 , 𝑣𝑘 ), 𝑟(𝑣𝑘 , 𝑣𝑖 ), 𝑟(𝑣𝑖 , 𝑣𝑖 ). Line 6 checks
that those conjuncts are already present in the query and therefore that unifying 𝑣𝑗 with 𝑣𝑖 is
equivalent to deleting 𝑣𝑗 . Unifying two variables of 𝑞 produces a query that is subsumed by
𝑞, while deleting a variable produces a query that subsumes 𝑞, therefore Algorithm 1 produces
syntactically equivalent queries. For the complexity of 1 refer to the appendix.

Algorithm 1: M INIMIZE
Input: Query 𝑞 = (𝑉, 𝐸, 𝐿) to be minimized.
Output: The minimized query 𝑞 ′ .
1 𝑛←𝑉
2 𝑞′ ← 𝑞
3 do
4 𝑞 ← 𝑞′
5 foreach pair 0 < 𝑖, 𝑗 ≤ 𝑛, 𝑖 ̸= 𝑗 do
6 Check if unifying 𝑣𝑗 of 𝑞 ′ with 𝑣𝑖 of 𝑞 ′ would be the same as deleting it:
7 if 𝐿(𝑣𝑗 ) ⊆ 𝐿(𝑣𝑖 ) and ((𝑣𝑗 , 𝑟, 𝑣𝑘 ) ∈ 𝐸 ⇒ (𝑣𝑖 , 𝑟, 𝑣𝑘 ) ∈ 𝐸, 𝑘 ̸= 𝑗) and
((𝑣𝑘 , 𝑟, 𝑣𝑗 ) ∈ 𝐸 ⇒ (𝑣𝑘 , 𝑟, 𝑣𝑖 ) ∈ 𝐸, 𝑘 ̸= 𝑗) and
((𝑣𝑗 , 𝑟, 𝑣𝑗 ) ∈ 𝐸 ⇒ (𝑣𝑖 , 𝑟, 𝑣𝑖 ) ∈ 𝐸) then
8 Delete variable 𝑗 from 𝑞 ′ .
9 end
10 end
11 while 𝑞 ′ ̸= 𝑞
12 return 𝑞 ′

We can also use the graph representation of the canonical model to introduce the notion of the
most specific query (MSQ) of an individual 𝑎. That is the query that contains as much information
as possible about 𝑎, it is the least query in terms of subsumption that has that 𝑎 as a certain
answer and it can be constructed by converting the connected component of the graph of 𝒞𝒯 ,𝒜
that contains 𝑎 into a query graph with the node of 𝑎𝒞𝒯 ,𝒜 becoming the answer variable. Every
query that has 𝑎 as a certain answer must be homomorphic to that connected component and
therefore to the MSQ as well.
For generating explanations of black-box classifiers, we propose Algorithm 2, which generates
candidate approximate explanations for sets of individuals, by utilizing a heuristic for disjointness
which is discussed later in this section. For generating explanations, Algorithm 2 first populates a
list with the MSQ of each individual. Then it removes the two least disjoint (according to the
heuristic) queries and replaces them with their LCS (computed as the Kronecker product) only if
it has fewer variables than a pre-set threshold, after it is minimized with Algorithm 1. The LCS
of the two least disjoint queries is also added to the set of candidate explanations to be returned.
This process is repeated until there are fewer than two queries left to check for disjointness.
For heuristically approximating disjointness between two queries we use Algorithm 3. Algo-
rithm 3 computes a rough estimate of how disjoint two queries are. It compares every variable
𝑣1 of 𝑞1 with every variable 𝑣2 of 𝑞2 counting how many concept and role conjuncts containing
𝑣1 would certainly be removed if 𝑣1 was unified with 𝑣2 . It keeps the best such count for 𝑣1 and
Algorithm 2: E XPLAIN
Input: A set of individuals {𝑖1 , 𝑖2 , . . . 𝑖𝑛 } ⊆ IN and a threshold 𝑡 of maximum query size.
Output: A set of queries as explanations of the individuals.
1 explanations ← ∅
2 queries ← {msq(𝑖𝑗 )}𝑛 𝑗=1
3 while ‘queries’ has two or more elements do
4 Find the least disjoint pair of queries:
5 𝑞1 , 𝑞2 ← arg min{disj(𝑞, 𝑞 ′ ) | 𝑞, 𝑞 ′ ∈ queries}
6 Remove 𝑞1 , 𝑞2 from ‘queries’.
7 𝑞 ← minimize(LCS(𝑞1 , 𝑞2 ))
8 if the number of variables in 𝑞 is ≤ 𝑡 then
9 explanations ← explanations ∪ {𝑞}
10 queries ← queries ∪ {𝑞}
11 end
12 end
13 return explanations

adds it to the estimate of the disjointness. This process is then symmetrically repeated for the
variables of 𝑞2 .

5.2. Explanation Dataset
The connecting component between our work and machine learning classifiers which drives
this explanation framework is the explanation dataset. It is essentially a semantically enriched
dataset compatible with the classifier under investigation. It consists of data that can be fed to
the classifier (e.g. images for image classifiers) along with semantic descriptions of this data
expressed with description logics. The additional information of this enriched dataset allows us
to produce explanations for the classifiers exploiting the algorithms mentioned in Section 5.1.
There isn’t a standard procedure for the semantic enrichment of data, so we need to find a
way to create these explanation datasets. Thankfully, enriched datasets already exist within the
premises of machine learning like the Visual Genome [31] or the CLEVR [32] datasets, which
contain images along with metadata for each image, such as annotations or sets of questions-
answers. It is usually easy to express these metadata in a description logic, in most cases mapping
them to individuals, concepts, roles and axioms. Such an example is the creation of an explanation
dataset from the CLEVR-Hans3 dataset [6] shown in 6.1.

6. Experiments and Discussion
6.1. Explanation Dataset Creation
CLEVR-Hans3 [6] is a confounded image classification dataset, designed to evaluate algorithms
that detect and fix biases of classifiers. It consists of CLEVR [32] images divided into three
Algorithm 3: D ISJ
Input: A pair of queries 𝑞1 = (𝑉1 , 𝐸1 , 𝐿1 ), 𝑞2 = (𝑉2 , 𝐸2 , 𝐿2 ) for which to calculate a
very rough estimate of how disjoint they are.
Output: The estimate of the disjointness.
1 disj ← 0
2 For every variable in 𝑞1 find how much it differs in terms of labels and number of edges
from its closest match in 𝑞2 :
3 foreach 𝑣1 ∈ 𝑉1 do
4 min_diff ← +∞
5 foreach 𝑣2 ∈ 𝑉2 do
6 diff ← |𝐿1 (𝑣1 ) ∖ 𝐿2 (𝑣2 )|
7 for 𝑟 ∈ RN do
8 diff ← diff
+ max {|{(𝑣1 , 𝑟, 𝑢1 ) ∈ 𝐸1 , 𝑢1 ∈ 𝑉1 }| − |{(𝑣2 , 𝑟, 𝑢2 ) ∈ 𝐸2 , 𝑢2 ∈ 𝑉2 }|, 0}
9 diff ← diff
+ max {|{(𝑢1 , 𝑟, 𝑣1 ) ∈ 𝐸1 , 𝑢1 ∈ 𝑉1 }| − |{(𝑢2 , 𝑟, 𝑣2 ) ∈ 𝐸2 , 𝑢2 ∈ 𝑉2 }|, 0}
10 end
11 min_diff ← min(min_diff, diff)
12 end
13 disj ← disj + min_diff
14 end
15 Repeat the above but with 𝑞1 and 𝑞2 reversed.
16 return disj

classes, of which two are confounded. The membership of a class is based on combinations of
objects’ attributes and relations. Thus, within the dataset, consisting of train, validation, and test
splits, all train, and validation images of confounded classes will be confounded with a specific
attribute. The rules that the classes follow are the following, with the confounding factors in
parentheses: (i) Large (Gray) Cube and Large Cylinder, (ii) Small Metal Cube and Small (Metal)
Sphere, (iii) Large Blue Sphere and Small Yellow Sphere.
We created our explanation dataset using the test set of CLEVR-Hans3, consisting of 750 im-
ages for each class. Exploiting the description of the images provided in json files, we constructed
a vocabulary ⟨CN, RN, IN⟩, with all the images and the objects therein as individuals (IN), the con-
cepts defining the size, color, shape, and material of each object, as well as two indicative concepts
Image and Object as concept names (CN), and the role "contains(Image, Object)" indicating the
existence of an object in a specific image as the only role name (RN). We then created a knowl-
edge base over this vocabulary, with the ABox containing the semantic description of all images
and the respective objects, and the TBox containing certain rather trivial inclusion axioms. The
sets CN, RN and the Tbox of our knowledge base and the respective vocabulary are the following:
CN = {Image, Object, Cube, Cylinder, Sphere, Metal, Rubber, Blue,Brown, Cyan, Gray, Green, Purple,
Red, Yellow, Large, Small}, RN = {contains(Image, Object)}, 𝒯 = {x ⊑ Object} (where 𝑥 ̸∈
{Image, Object}).
Figure 1: The three classes of CLEVR-Hans3 with their rules and the confounding factors in
parentheses.

6.2. Setting
For CLEVR-Hans3 we used the same classifier and training procedure as the one used by the
creators of the dataset in [33]. The classifier is a deep CNN, specifically ResNet34 [34]. The
performance of the classifier is shown in Table 1. After training the classifier we acquired its
predictions on the test set and generated explanations for each class with Algorithm 2. We also
loaded the explanation dataset in GraphDB 1 for getting certain answers of queries, and evaluating
explanations.

Table 1
Performance of ResNet34 on CLEVR-Hans3.
Test set metrics Confusion martix
True label Precision Recall F1-score Class 1 Class 2 Class 3
Class 1 0.94 0.16 0.27 118 511 121
Class 2 0.59 0.98 0.54 5 736 9
Class 3 0.85 1.00 0.92 2 0 748

1
https://graphdb.ontotext.com/
6.3. Results
The explanations generated for the classifier on the test set of CLEVR-Hans3 are shown in Table
2, where we show the query, the value of each metric and the numbers of positive and negative
individuals. The term positive individuals refers to the certain answers of the query that are
classified to the respective class, while the term negative individuals refers to the rest of the
certain answers. In our representation of the queries in Table 2 we have omitted the answer
variable x, along with all conjuncts of the form x contains y and conjuncts of the form Object(y),
for brevity. The algorithm found an under-explanation (precision=1) and an over-explanation
(recall=1) for each class, with the best explanation degree achieved for class 3, which lacks a
confounding factor. Over-explanations and under-explanations are of particular interest since
they can be translated into global rules which the classifier follows on the particular dataset. For
instance the under-explanation for class 1 is translated into the rule “If the image contains a Large
Gray Cube, a Large Cylinder and a Large Metal Object then it is classified to class 1.", while
the over-explanation for the same class is translated into the rule “If the image does not contain
a Large Cube then it is not classified to class 1". We can see that over-explanations tend to be
more general, in order to include all the predictions of the classifier, while on the other side,
under-explanations tend to be much more specific. That’s why we find approximate explanations
to be very useful for the general description of a classifier. Both over- and under-explanations
tend to have low degree, while in general explanations with high degree provide us with a
more accurate approximation of what the classifier has learned. So it is also useful to consider
approximate explanations of high degree, even though they cannot be translated to rules like over-
and under-explanations.
It is interesting to note that the over-explanation produced for class 1 contains a Large Cube
but not a Large Cylinder. This gives us the information that in the training process the classifier
learned to pay more attention to the presence of cubes rather than the presence of cylinders. The
elements of the under-explanation that differ from the true rule of class 1 can be a great starting
point for a closer inspection of the classifier. We expected the presence of a Gray Cube from the
confounding factor introduced in the training and validation sets, but in a real world scenario
similar insights can be reached by inspecting the queries. In our case, we further inquired the role
that the Gray Cube and the Large Metal Object play in the under-explanation by removing either
of them from the query and examining its new performance. In Table 3 we can see that the gray
color was essential for the under-explanation while the Large Metal Object was not, and in fact
its removal improved the under-explanation and returned almost the entire class.
Another result that piqued our attention is the approximate explanation for class 3 which is the
actual rule that describes this class. This explanation returns two negative individuals which we
can also see in the confusion matrix of the classifier and we were interested to examine what sets
these two individuals apart. We found that both of these individuals are answers to the query “y1
is Large, Gray, Cube”. This shows us once again the large effect the confounding factor of class 1
had on the classifier.
Our overall results show that the classifier tends to emphasize low level information such as
color and shape and ignores high level information such as texture and the combined presence
of multiple objects. This is the reason why the confounding factor of class 1 had an important
effect to the way images are classified, while the confounding factor of class 2 seems to have had
Table 2
Optimal explanations with regard to the three metrics on CLEVR-Hans3.
Metric Query Precision Recall Degree Positives
Class 1
y1 is Large, Cube, Gray.
Best
y2 is Large, Cylinder. 1.00 0.66 0.66 83
Precision
y3 is Large, Metal.
Best
y1 is Large, Cube. 0.09 1.00 0.09 125
Recall
y1 is Large, Cube, Gray.
Best
y2 is Large, Cylinder. 1.00 0.66 0.66 83
Degree
y3 is Large, Metal.
Class 2
y1 is Small, Sphere.
y2 is Large, Rubber.
Best
y3 is Small, Metal, Cube. 1.00 0.09 0.09 116
Precision
y4 is Small, Brown.
y5 is Small, Rubber, Cylinder.
Best
y1 is Cube. 0.63 1.00 0.63 1247
Recall
Best y1 is Metal, Cube.
0.78 0.8 0.65 1005
Degree y2 is Small, Metal.
Class 3
y1 is Metal, Blue.
y2 is Large, Blue, Sphere.
Best
y3 is Yellow, Small, Sphere. 1.00 0.42 0.42 365
Precision
y4 is Small, Rubber.
y5 is Metal, Sphere.
Best y1 is Large.
0.42 1.00 0.42 878
Recall y2 is Sphere.
Best y1 is Yellow, Small, Sphere.
0.99 0.85 0.85 748
Degree y2 is Large, Blue, Sphere.

a much smaller one.

7. Conclusion
We introduced a theoretical framework for representing global explanations for ML classifiers
in the form of conjunctive queries. We also developed algorithms for computing explanations
for limited knowledge bases, and investigated some of their quality-related properties. Using
Table 3
Two modified versions of the class 1 under-explanation produced by removing conjuncts.
Query Positives Negatives
y1 is Large, Cube. y2 is Large, Cylinder. y3 is Large, Metal. 108 547
y1 is Large, Cube, Gray. y2 is Large, Cylinder. 93 0

CLEVR-Hans3, we were able to generate multiple explanations for a deep learning classifier.
In several cases, we found the explanations to be useful for detecting potential biases, and
providing a more intuitive approach for evaluating classifiers besides performance metrics which
are typically used.
One of the conclusions we can draw from our experiments is the importance of developing
methods evaluating explanations, which we plan to explore in future work. The three metrics
used in this paper are simple and intuitive, however none take under consideration the human-
readability factor which is crucial for explainability.
Finally, the quality and usefulness of explanations generated with the proposed methodology
depends on the characteristics of the explanation dataset. Thus, in future work we also plan to
investigate what constitutes a “good" explanation dataset, in addition to experimenting on more
specialized domains in which explainability is critical, such as in medical applications.

References
[1] M. Turek, Explainable artificial intelligence (xai), Defense Advanced Research
Projects Agency. http://web. archive. org/web/20190728055815/https://www. darpa.
mil/program/explainable-artificial-intelligence (2018).
[2] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, B. Yu, Interpretable machine learning:
definitions, methods, and applications, CoRR abs/1901.04592 (2019).
[3] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of
methods for explaining black box models, ACM Comput. Surv. 51 (2019) 93:1–93:42.
[4] A. B. Arrieta, N. D. Rodríguez, J. D. Ser, A. Bennetot, S. Tabik, A. Barbado, S. García,
S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, F. Herrera, Explainable artificial
intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible
AI, Inf. Fusion 58 (2020) 82–115.
[5] A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutiérrez, J. E. L. Gayo,
S. Kirrane, S. Neumaier, A. Polleres, R. Navigli, A. N. Ngomo, S. M. Rashid, A. Rula,
L. Schmelzeisen, J. F. Sequeda, S. Staab, A. Zimmermann, Knowledge graphs, CoRR
abs/2003.02320 (2020).
[6] W. Stammer, P. Schramowski, K. Kersting, Right for the right concept: Revising neuro-
symbolic concepts by interacting with their explanations, arXiv preprint arXiv:2011.12854
(2020).
[7] H. Yang, C. Rudin, M. I. Seltzer, Scalable bayesian rule lists, in: Proceedings of the 34th
International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11
August 2017, 2017, pp. 3921–3930. URL: http://proceedings.mlr.press/v70/yang17h.html.
[8] Y. Ming, H. Qu, E. Bertini, Rulematrix: Visualizing and understanding classifiers with
rules, IEEE Trans. Vis. Comput. Graph. 25 (2019) 342–352. URL: https://doi.org/10.1109/
TVCG.2018.2864812. doi:10.1109/TVCG.2018.2864812.
[9] M. W. Craven, J. W. Shavlik, Extracting tree-structured representations of trained net-
works, in: Advances in Neural Information Processing Systems 8, NIPS, Denver,
CO, USA, November 27-30, 1995, 1995, pp. 24–30. URL: http://papers.nips.cc/paper/
1152-extracting-tree-structured-representations-of-trained-networks.
[10] Y. Zhou, G. Hooker, Interpreting models via single tree approximation, arXiv preprint
arXiv:1610.09036 (2016).
[11] M. K. Sarker, N. Xie, D. Doran, M. Raymer, P. Hitzler, Explaining trained neural networks
with semantic web technologies: First steps, in: NeSy, volume 2003 of CEUR Workshop
Proceedings, CEUR-WS.org, 2017.
[12] G. Ciravegna, F. Giannini, M. Gori, M. Maggini, S. Melacci, Human-driven FOL ex-
planations of deep learning, in: Proceedings of the Twenty-Ninth International Joint
Conference on Artificial Intelligence, IJCAI 2020, 2020, pp. 2234–2240. URL: https:
//doi.org/10.24963/ijcai.2020/309. doi:10.24963/ijcai.2020/309.
[13] B. Kim, O. Koyejo, R. Khanna, Examples are not enough, learn to criticize! criticism
for interpretability, in: Advances in Neural Information Processing Systems 29: An-
nual Conference on Neural Information Processing Systems 2016, December 5-10, 2016,
Barcelona, Spain, 2016, pp. 2280–2288. URL: https://proceedings.neurips.cc/paper/2016/
hash/5680522b8e2bb01943234bce7bf84534-Abstract.html.
[14] K. S. Gurumoorthy, A. Dhurandhar, G. A. Cecchi, C. C. Aggarwal, Efficient data repre-
sentation by selecting prototypes with importance weights, in: 2019 IEEE International
Conference on Data Mining, ICDM 2019, Beijing, China, November 8-11, 2019, 2019, pp.
260–269. URL: https://doi.org/10.1109/ICDM.2019.00036. doi:10.1109/ICDM.2019.
00036.
[15] W. W. Cohen, A. Borgida, H. Hirsh, Computing least common subsumers in description
logics, in: Proceedings of the 10th National Conference on Artificial Intelligence, San Jose,
CA, USA, July 12-16, 1992, 1992, pp. 754–760. URL: http://www.aaai.org/Library/AAAI/
1992/aaai92-117.php.
[16] R. Küsters, R. Molitor, Structural subsumption and least common subsumers in a description
logic with existential and number restrictions, Stud Logica 81 (2005) 227–259. URL:
https://doi.org/10.1007/s11225-005-3705-5. doi:10.1007/s11225-005-3705-5.
[17] F. Baader, B. Sertkaya, A. Turhan, Computing the least common subsumer w.r.t. a back-
ground terminology, J. Appl. Log. 5 (2007) 392–420. URL: https://doi.org/10.1016/j.jal.
2006.03.002. doi:10.1016/j.jal.2006.03.002.
[18] F. M. Donini, S. Colucci, T. D. Noia, E. D. Sciascio, A tableaux-based method for computing
least common subsumers for expressive description logics, in: Proceedings of the 22nd
International Workshop on Description Logics (DL 2009), Oxford, UK, July 27-30, 2009,
2009. URL: http://ceur-ws.org/Vol-477/paper_22.pdf.
[19] D. Calvanese, G. D. Giacomo, D. Lembo, M. Lenzerini, R. Rosati, Tractable reasoning and
efficient query answering in description logics: The DL-Lite family, J. Autom. Reason. 39
(2007) 385–429.
[20] B. C. Grau, B. Motik, G. Stoilos, I. Horrocks, Computing datalog rewritings beyond horn
ontologies, in: IJCAI, IJCAI/AAAI, 2013, pp. 832–838.
[21] A. Chortaras, M. Giazitzoglou, G. Stamou, Inside the query space of DL knowledge bases,
in: Description Logics, volume 2373 of CEUR Workshop Proceedings, CEUR-WS.org,
2019.
[22] D. Trivela, G. Stoilos, A. Chortaras, G. Stamou, Resolution-based rewriting for horn-SHIQ
ontologies, Knowl. Inf. Syst. 62 (2020) 107–143.
[23] M. Arenas, G. I. Diaz, E. V. Kostylev, Reverse engineering SPARQL queries, in: J. Bour-
deau, J. Hendler, R. Nkambou, I. Horrocks, B. Y. Zhao (Eds.), Proceedings of the 25th
International Conference on World Wide Web, WWW 2016, Montreal, Canada, April
11 - 15, 2016, ACM, 2016, pp. 239–249. URL: https://doi.org/10.1145/2872427.2882989.
doi:10.1145/2872427.2882989.
[24] Q. T. Tran, C. Y. Chan, S. Parthasarathy, Query reverse engineering, VLDB J.
23 (2014) 721–746. URL: https://doi.org/10.1007/s00778-013-0349-3. doi:10.1007/
s00778-013-0349-3.
[25] A. Petrova, E. V. Kostylev, B. C. Grau, I. Horrocks, Query-based entity comparison
in knowledge graphs revisited, in: C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek,
I. F. Cruz, A. Hogan, J. Song, M. Lefrançois, F. Gandon (Eds.), The Semantic Web -
ISWC 2019 - 18th International Semantic Web Conference, Auckland, New Zealand,
October 26-30, 2019, Proceedings, Part I, volume 11778 of Lecture Notes in Computer
Science, Springer, 2019, pp. 558–575. URL: https://doi.org/10.1007/978-3-030-30793-6_32.
doi:10.1007/978-3-030-30793-6\_32.
[26] G. Diaz, M. Arenas, M. Benedikt, Sparqlbye: Querying rdf data by example, Proc.
VLDB Endow. 9 (2016) 1533–1536. URL: https://doi.org/10.14778/3007263.3007302.
doi:10.14778/3007263.3007302.
[27] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, P. F. Patel-Schneider (Eds.), The
Description Logic Handbook: Theory, Implementation, and Applications, Cambridge
University Press, 2003.
[28] R. Kontchakov, M. Zakharyaschev, An Introduction to Description Logics and Query
Rewriting, Springer International Publishing, Cham, 2014, pp. 195–244. URL: https://doi.
org/10.1007/978-3-319-10587-1_5. doi:10.1007/978-3-319-10587-1_5.
[29] F. Baader, I. Horrocks, C. Lutz, U. Sattler, An Introduction to Description Logic, Cambridge
University Press, 2017. doi:10.1017/9781139025355.
[30] G. Gottlob, C. G. Fermüller, Removing redundancy from a clause, Artif. Intell. 61 (1993)
263–289.
[31] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li,
D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced
dense image annotations, arXiv preprint arXiv:1602.07332 (2016).
[32] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, R. B. Gir-
shick, CLEVR: A diagnostic dataset for compositional language and elementary vi-
sual reasoning, CoRR abs/1612.06890 (2016). URL: http://arxiv.org/abs/1612.06890.
arXiv:1612.06890.
[33] W. Stammer, P. Schramowski, K. Kersting, Right for the right concept: Revising neuro-
symbolic concepts by interacting with their explanations, CoRR abs/2011.12854 (2020).
[34] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, CoRR
abs/1512.03385 (2015). URL: http://arxiv.org/abs/1512.03385. arXiv:1512.03385.

A. Complexity of Algorithms
A.1. Minimize
Given 𝑛 as the number of variables present in a query, then loop 3 will be executed at most 𝑛 times,
since at each loop either a variable is deleted or the algorithm returns. Loop 5 checks all pairs
of variables (𝑂(𝑛2 )), and condition 7 requires 𝑂(𝑛) set comparisons and 2|RN| comparisons of
rows and columns of adjacency matrices. Thus the complexity of Algorithm 1 is 𝑂(𝑛4 ).

A.2. Disj
If the number of variables present in each query is 𝑛 and 𝑚, then the time complexity of the
algorithm is 𝑂(𝑛 · 𝑚) due to the two outer loops, while the inner loop and other computations
require constant time with some careful pre-processing of the queries so that the edge count is
readily available.

A.3. Explain
Regarding the complexity of Algorithm 2, having computed the complexity of its components, let
𝑛 be the number of individuals, 𝑚 be the maximum number of variables present in an individual’s
MSQ and 𝑡 be the threshold for the number of variables, above which we discard the resulting
queries. Finding the connected component which corresponds to each individual’s MSQ requires
𝑂(𝑚2 ) operations, thus initializing the queries list can be done in 𝑂(𝑛𝑚2 ). Then, at each loop,
the algorithm removes two queries from the list, and depending on if the condition is satisfied, it
adds a query to the list, thus the loop will be executed at most 𝑛 − 1 times. If we have stored the
values of disj(𝑞, 𝑞 ′ ) after computing them for the first time, then we can find the pair of queries
which minimize it with 𝑂(𝑛2 ) operations, while computing them for the first time has a time
complexity of 𝑂(𝑚2 ) for each pair of queries as shown in the previous paragraph. Initially there
are 𝑛(𝑛−1)
2 pairs, while on iteration 𝑖, by adding a new query to the list, 𝑛 − 𝑖 new pairs are
created. The maximum number of variables for created queries is 𝑡 ≥ 𝑚, thus all executions of
finding the least disjoint pair of queries will require 𝑂(𝑛3 + 𝑛2 𝑡2 ) operations. Finally computing
the LCS can be done in 𝑂(𝑡4 ), while the complexity of minimization is also 𝑂(𝑡4 ), resulting in
the time complexity of Algorithm 2 as 𝑂(𝑛3 + 𝑛2 𝑡2 + 𝑛𝑡4 ).