Ontology-based explanation of classifiers
                                                                    Federico Croce
                                                                    Gianluca Cima
                                                                   Maurizio Lenzerini
                                                                    Tiziana Catarci
                                                             <lastname>@diag.uniroma1.it
                                                             Sapienza — University of Rome

ABSTRACT                                                                          input-output pairs provided as examples. When applied to classi-
The rise of data mining and machine learning use in many ap-                      fication, the ultimate goal of supervised learning is to construct
plications has brought new challenges related to classification.                  algorithms that are able to predict the target output (i.e., the class)
Here, we deal with the following challenge: how to interpret and                  of the proposed inputs. To achieve this, the learning algorithm
understand the reason behind a classifier’s prediction. Indeed,                   is provided with some training examples that demonstrate the
understanding the behaviour of a classifier is widely recognized                  intended relation of input and output values. Then the learner is
as a very important task for wide and safe adoption of machine                    supposed to approximate the correct output, so as to be able to
learning and data mining technologies, especially in high-risk                    classify instances that have not been shown during training.
domains, and in dealing with bias. We present a preliminary work                     The rise of machine learning use in many applications has
on a proposal of using the Ontology-Based Data Management                         brought new challenges related to classification. Here, we deal
paradigm for explaining the behavior of a classifier in terms of                  with the following challenge: how to interpret and understand
the concepts and the relations that are meaningful in the domain                  the reason behind a classifier’s prediction. Indeed, understanding
that is relevant for the classifier.                                              the behaviour of a classifier is recognized as a very important
                                                                                  task for wide and safe adoption of machine learning and data
                                                                                  mining technologies, especially in high-risk domains, and, as we
1    INTRODUCTION                                                                 discussed above, in dealing with bias.
One of the problems in processing information ethically is the                       In this paper we present a preliminary work on this subject,
perpetuation and amplification of unfair biases existing in train-                based on the use of semantic technologies. In particular, we as-
ing data and in the outcome of classifiers.                                       sume that the classification task is performed in an organization
    It is well known that many learning algorithms (data analytics,               that adopts an Ontology-Based Data Management (OBDM) ap-
data mining, machine learning, ML) base their predictions on                      proach [15, 16]. OBDM is a paradigm for accessing data using a
training data and improve them with the growth of such data.                      conceptual representation of the domain of interest expressed as
In a typical project, the creation and curation of training data                  an ontology. The OBDM paradigm relies on a three-level architec-
sets is largely a human-based activity and involve several people:                ture, consisting of the data layer, the ontology, and the mapping
domain experts, data scientists, machine learning experts, etc. In                between the two.
other words, data-related human design decisions affect learning                      • The ontology is a declarative and explicit representation
outcomes throughout the entire process pipeline, even if at a                           of the domain of interest for the organization, formulated
certain point these decisions seem to disappear in the black-box                        in a Description Logic (DL) [2, 7], so as to take advantage
“magic” approach of ML algorithms. On the other hand, it is now                         of various reasoning capabilities in accessing data.
gaining attention the fact that humans typically suffer from con-                     • The data layer is constituted by the existing data sources
scious and unconscious biases and current historical data used in                       that are relevant for the organization.
training set very often incorporate such biases, so perpetuating                      • The mapping is a set of declarative assertions specifying
and amplifying existing inequalities and unfair choices. While                          how the sources in the data layer relate to the ontology.
researchers of different areas (from philosophy to computer sci-
ence passing through social sciences and law) have begun a rich                       Consequently, an OBDM specification is a triple J = ⟨O, S, M⟩
discourse on this problem, concrete solutions on how to address                   which, together with an S-database 𝐷, form a so-called OBDM
it by discovering and eliminating unintended unfair biases are                    system Σ = ⟨J, 𝐷⟩. Given such a system Σ, suppose that 𝜆 is
still missing. A critical aspect in assessing and addressing bias                 the result of a classification task carried out by any actor, e.g., a
is represented by the lack of transparency, accountability and                    human or a machine, and that the objects involved in the classifi-
human-interpretability of the ML algorithms that make overly                      cation task are represented as tuples in the S-database 𝐷, which
difficult to fully understand the expected outcomes. A famous                     we assume relational.
example is the COMPAS algorithm used by the Department of                             In particular, in this work we consider a binary classifier, and
Corrections in Wisconsin, New York and Florida that has led to                    therefore we regard 𝜆 as a partial function 𝜆 : dom(𝐷)𝑛 →
harsher sentencing toward African Americans [1].                                  {+1, −1}, where 𝑛 ≥ 1 is an integer. We denote by 𝜆 + (resp.,
    In this paper we address the problem of providing explana-                    𝜆 − ) the set of tuples that have been classified positively (resp.,
tions for supervised classification. Supervised learning is the task              negatively), i.e., 𝜆 + = {𝑡® ∈ dom(𝐷)𝑛 | 𝜆(𝑡®) = +1} (resp., 𝜆 − =
of learning a function that maps an input to an output based on                   {𝑡® ∈ dom(𝐷)𝑛 | 𝜆(𝑡®) = −1}).
                                                                                      We observe that another view of the partial function 𝜆 is that
© 2020 Copyright held by the owner/author(s). Published in Workshop Proceedings   of a training set. In this case, 𝜆 + represents the tuples tagged
of the EDBT/ICDT 2020 Joint Conference, March 30-April 2, 2020 on CEUR-WS.org     positively during the classifier training, while 𝜆 − represents the
Distribution of this paper is permitted under the terms of the Creative Commons
license CC BY 4.0.                                                                tuples tagged negatively.
   Intuitively, our goal is to derive an expression over O that
semantically describes the partial function 𝜆 in the best way w.r.t.
Σ. In other words, the main task in our framework is searching
for a “good” definition of 𝜆 using the concepts and the roles of
the ontology. Without loss of generality, we consider such an
expression to be a query 𝑞 over O, and we formalize the notion of
“semantically describing” 𝜆 by requiring that the certain answers
to 𝑞 w.r.t. Σ include all the tuples in 𝜆 + (or, as many tuples in 𝜆 +
as possible), and none of the tuples in 𝜆 − (or, as few tuples in 𝜆 −
as possible).
   Following the terminology of some recent papers, the goal
of our framework can be generally described as the reverse en-
gineering task of finding a describing query, from a set of ex-
amples in a database. The roots of this task can be found in
the Query By Example (QBE) approach for classical relational
databases [3, 4, 18, 19]. In a nutshell, such an approach allows a
user to explore the database by providing a set of positive and
                                                                                      Figure 1: OBDM Specification and System
negative examples to the system, implicitly referring to the query
whose answers are all the positive examples and none of the
negatives. This idea has also been studied by the Description                an ontology, S is the schema of the data source, and M is the
Logics (DLs) community, with a particular attention to the line              mapping between S and O. Specifically, M consists of a set of
of research of the so-called concept learning. In particular, the            mapping assertions, each one relating a query over the source
work in [13] has an interesting characterization of the complex-             schema to a query over the ontology. An OBDM system Σ =
ity of learning an ontology concept, formulated in expressive                ⟨J, 𝐷⟩ is obtained by adding to J an extensional level, which is
DLs, from positive and negative examples. We also mention the                given in terms of an S-database 𝐷, which represents the data at
concept learning tools in [5, 12, 17], that include several learning         the source, and is structured according to the schema S.
algorithms and support an extensive range of DLs, even expres-                  The formal semantics of ⟨J, 𝐷⟩ is specified by the set Mod 𝐷 (J )
sive ones such as ALC and ALCQ. Finally, we consider the                     of its models, which is the set of (logical) interpretations I for O
work in [14] to be related to our work. The authors study the                such that I is a model of O, i.e., it satisfies all axioms in O, and
problem of deriving (unions of) conjunctive queries, with ontolo-            ⟨𝐷, I⟩ satisfies all the assertions in M. The satisfaction of a map-
gies formulated in Horn-ALCI, deriving algorithms and tight                  ping assertion depends on its form, which is meant to represent
complexity bounds.                                                           semantic assumptions about the completeness of the source data
   Our work is focused on the Ontology-Based Data Manage-                    with respect to the intended ontology models. Specifically, sound
ment (OBDM) paradigm [6, 11]. Having the layer for linking the               (resp., complete, exact) mappings capture sources containing a
data to the ontology is a non trivial extension of the problem,              subset (resp., a superset, exactly the set) of the expected data.
that has important consequences, as we will show in a follow-                   In OBDM, the main service to be provided by the system is
ing section of this paper. The goal of this paper is to present a            query answering. The user poses queries by referring only to
general framework for explaining a classifier by means of an                 the ontology, and is therefore masked from the implementation
ontology, that can be adapted to several different contexts. For             details and the idiosyncrasies of the data source. The fact that the
this reason, an important aspect of our framework, is the pos-               semantics of ⟨J, 𝐷⟩ is defined in terms of a set of models makes
sibility of defining a number of criteria one wants the output               the task of query answering involved. Indeed, query answering
query to be optimized on. This flexibility, makes it possible to             cannot be simply based on evaluating the query expression over
derive completely different solutions, depending on the specific             a single interpretation, like in traditional databases. Rather, it
criteria in use. Specifically, given an OBDM system and a set of             amounts to compute the so-called certain answers, i.e., the tuples
positive and negative examples, the goal of the framework could              that satisfy the query in all interpretations in Mod 𝐷 (J ), and
be to find a query over the ontology whose answers include all               has therefore the characteristic of a logical inference task. More
the positive examples and none of the negatives. However, we                 formally, given a OBDM specification J = ⟨O, S, M⟩, a query
consider reasonable for some applications that one may want                  𝑞 O over O, and an S-database 𝐷, we define the certain answers
to relax this requirement, and allow the framework to find a                 of 𝑞 O w.r.t. J and 𝐷, denoted by cert 𝐷 , as the set of tuples 𝑡®
                                                                                                                    𝑞O , J
query whose answers are as similar as possible to the positive
                                                                             of S-constants such that 𝑡® ∈ 𝑞 𝐵O , for every 𝐵 ∈ Mod 𝐷 (J ). Obvi-
examples, includes only a small fraction of the negatives, and
enjoys additional predefined criteria.                                       ously, the computation of certain answers must take into account
                                                                             the semantics of the ontology, the knowledge expressed in the
                                                                             mapping, and the content of the data source. Designing efficient
2    PRELIMINARIES                                                           query processing algorithms is one of the main challenges of
Given a schema S, an S-database 𝐷 is a finite set of atoms 𝑠 (®       𝑐 ),   OBDM. Indeed, an OBDM framework is characterized by three
where 𝑠 is an 𝑛-ary predicate symbol of S, and 𝑐® = (𝑐 1, . . . , 𝑐𝑛 ) is    formalisms:
an 𝑛-tuple of constants.                                                        (1) the language used to express the ontology;
   As mentioned earlier, we distinguish between the specification               (2) the language used for queries;
of an OBDM system, and the OBDM system itself (cf. Figure 1).                   (3) the language used to specify the mapping.
An OBDM specification J determines the intensional level of                  and the choices made for each of the three formalisms affect
the system, and is expressed as a triple ⟨O, S, M⟩, where O is               semantic and computational properties of the system.
    The axioms of the ontology allow one to enrich the informa-             Example 3.3. Let the source database be 𝐷 = {R(a,b), S(a,c),
tion coming from the source with domain knowledge, and hence             Z(c,d), W(d,e), W(e,h), R(f,g)}, and let 𝑡® = ⟨a⟩. We have that:
to infer additional answers to queries. The language used for                 • W𝑡®,0 (𝐷) = {𝑅(𝑎, 𝑏), 𝑆 (𝑎, 𝑐)}
the ontology deeply affects the computational characteristics
                                                                              • W𝑡®,1 (𝐷) = {𝑍 (𝑐, 𝑑)}
of query answering. For this reason, instead of expressing the
                                                                              • W𝑡®,2 (𝐷) = {𝑊 (𝑑, 𝑒)}
ontology in first-order logic (FOL), one adopts tailored languages,
typically based on Description Logics (DLs), which ensure decid-         Finally, the border of radius 2 of 𝑡® in 𝐷 is B𝑡®,2 (𝐷)                  =
ability and possibly efficiency of reasoning.                            {𝑅(𝑎, 𝑏), 𝑆 (𝑎, 𝑐), 𝑍 (𝑐, 𝑑),𝑊 (𝑑, 𝑒)}.                                  □
    Also, the use of FOL (i.e., SQL) as a query language, immedi-
ately leads to undecidability of query answering, even when the             With the above notion at hand, we now define when a query
ontology consists only of an alphabet (i.e., it is a flat schema), and   𝑞 O over the ontology O matches (w.r.t. an OBDM specification
when the mapping is of the simplest possible form, i.e., it spec-        J ) a border B𝑡®,𝑟 (𝐷) for a radius 𝑟 , a tuple 𝑡®, and a source database
ifies a one-to-one correspondence between ontology elements              𝐷.
and database tables. The language typically adopted is Union
                                                                            Definition 3.4. A query 𝑞 O J -matches a border B𝑡®,𝑟 (𝐷) of
of Conjunctive Queries (UCQs), i.e., FOL queries expressed as a
                                                                                                                                      B® (𝐷)
union of select-project-join SQL queries.                                radius 𝑟 of a tuple 𝑡® in a source database 𝐷, if 𝑡® ∈ cert𝑞 𝑡 ,𝑟, J .   □
                                                                                                                                       O
    With respect to mapping specification, the incompleteness of
the source data is captured correctly by mappings that are sound.          The next proposition establishes how FOL queries behave
Moreover, allowing to mix sound mapping assertions with com-             when the radius 𝑟 of a border B𝑡®,𝑟 (𝐷) increments.
plete or exact ones leads to undecidability of query answering,
even when only CQs are used in queries and mapping assertions,              Proposition 3.5. Let J = ⟨O, S, M⟩ be an OBDM specifica-
and the ontology is simply a flat schema. As a consequence, all          tion, B𝑡®,𝑟 (𝐷) be a border of radius 𝑟 of a tuple 𝑡® in an S-database
proposals for OBDM frameworks so far, including the one in this          𝐷, and 𝑞 O be a FOL query over O. If 𝑞 O J -matches B𝑡®,𝑟 (𝐷), then
paper, assume that mappings are sound. In addition, the concern          𝑞 O J -matches B𝑡®,𝑟 +1 (𝐷).
above on the use of FOL applies also for the ontology queries in
the mapping. Note instead, that the source queries in the mapping           Proof. The proof is based on the following two observa-
                                                                                                          ′
are directly evaluated over the source database, and hence are           tions: (𝑖) cert𝑞𝐷 , J ⊆ cert𝑞𝐷 , J , for any OBDM specification
                                                                                          O              O
typically allowed to be arbitrary (efficiently) computable queries.      J = ⟨O, S, M⟩, FOL query 𝑞 O , and pair of S-databases 𝐷, 𝐷 ′
                                                                         such that 𝐷 ⊆ 𝐷 ′ . (ii) B𝑡®,𝑟 (𝐷) ⊆ B𝑡®,𝑟 +1 (𝐷), for any 𝑟 ≥ 0 and
3    THE FRAMEWORK                                                       tuple 𝑡® of a database 𝐷.                                          □
As we said in the introduction, we consider the result of a binary
classification task or the characterization of a training set for a         Similarly to what described in [3, 13, 14], one may be interested
classifier as a partial function 𝜆 : dom(𝐷)𝑛 → {+1, −1}, where           in finding a query 𝑞 O over O expressed in a certain language
𝑛 ≥ 1 is an integer. We remind the reader that we denote by 𝜆 +          L O that perfectly separates the set of tuples in 𝜆 + from the set
(resp., 𝜆 − ) the set of tuples that have been classified positively     of tuples in 𝜆 − , that is, a query 𝑞 O ∈ L O such that, for a given a
(resp., negatively), i.e., 𝜆 + = {𝑡® ∈ dom(𝐷)𝑛 | 𝜆(𝑡®) = +1} (resp.,     radius 𝑟 , the following two conditions hold:
𝜆 − = {𝑡® ∈ dom(𝐷)𝑛 | 𝜆(𝑡®) = −1}).
   Before formally defining when a query over O semantically                (1) for all 𝑡® ∈ 𝜆 + , 𝑞 O J -matches B𝑡®,𝑟 (𝐷),
describes 𝜆, we introduce some preliminary notions.                         (2) for all 𝑡® ∈ 𝜆 − , 𝑞 O does not J -match B𝑡®,𝑟 (𝐷).
   Definition 3.1. Let W be a set of atoms. We say that an atom             However, the following example shows that, even in very
𝛼 is reachable from W if there exists an atom 𝛽 ∈ W such that            simple cases, such query is not guaranteed to exists.
there is a constant 𝑐 ∈ dom(𝐷) that appears in both 𝛼 and 𝛽. □
                                                                            Example 3.6. Consider the following database 𝐷:
   We now define which are the relevant atoms of an S-database
𝐷 w.r.t. a tuple 𝑡® ∈ dom(𝐷)𝑛 . To be as general as possible, we
introduce a parametric notion of border of radius 𝑟 , where the                                STUD       𝜆
parameter 𝑟 is a natural number whose intended meaning is to                                   A10        +1              LOC
indicate how far one is interested in going for identifying an                                 B80        +1
                                                                                         𝜆+                         Sap     Rome
atom as relevant.                                                                              C12        +1        TV      Rome
                                                                                               D50        +1        Pol     Milan
  Definition 3.2. Let 𝐷 be an S-database, and let 𝑡® be a tuple in
dom(𝐷)𝑛 . Consider the following definition:                                             𝜆−    E25        -1
   • W𝑡®,0 (𝐷) = {𝛼 ∈ 𝐷 | 𝛼 has a constant 𝑐 appearing in 𝑡®}
   • W𝑡®,𝑗+1 (𝐷) = {𝛼 ∈ 𝐷 | 𝛼 is reachable from W𝑡®,𝑗 }
Then, for a natural number 𝑟 , the border of radius 𝑟 of 𝑡® in 𝐷,
denoted by B𝑡®,𝑟 (𝐷), is:                                                                                  ENR
                               Ø                                                                A10     Math        TV
                   B𝑡®,𝑟 (𝐷) =     W𝑡®,𝑖 (𝐷).                                                   B80     Math        Sap
                                0≤𝑖 ≤𝑟                                                          C12     Science     Norm
                                                                   □                            D50     Science     TV
    We illustrate the notion of border of radius with an example.                               E25     Math        Pol
   The corresponding borders of radius 1, for each tuple are:                      Definition 3.7. A query 𝑞 O L O -best describes 𝜆 w.r.t. an OBDM
                                                                               system Σ = ⟨J, 𝐷⟩, a radius 𝑟 , a set of criteria Δ, a set of functions
BA10,1 (𝐷) = {STUD(A10), ENR(A10, Math, TV), LOC(TV, Rome)}
                                                                               F , and an expression Z, if 𝑞 O ∈ L O and there exists no query
BB80,1 (𝐷) = {STUD(B80), ENR(B80, Math, Sap), LOC(Sap, Rome)}                  𝑞 ′O ∈ L O such that Z F (𝑞 ′O ) > Z F (𝑞 O ).                        □
BC12,1 (𝐷) = {STUD(C12), ENR(C12, Science, Norm)}
BD50,1 (𝐷) = {STUD(D50), ENR(D50, Science, TV), LOC(TV, Rome)}                   As for the set of criteria to be considered, here we just list
                                                                               some interesting ones:
BE25,1 (𝐷) = {STUD(E25), ENR(E25, Math, Pol), LOC(Pol, Milan)}
                                                                                𝛿 1 = “Are there many tuples 𝑡® ∈ 𝜆 + such that 𝑞 O J -matches
Moreover, let O = {studies ⊑ likes}, and M be:                                        B𝑡®,𝑟 (𝐷)?”
                    ENR(x, y, z) ⇝ studies(x,y)                                 𝛿 2 = “Are there few tuples 𝑡® ∈ 𝜆 + such that 𝑞 O does not J -
                    ENR(x, y, z) ⇝ taughtIn(y,z)                                      match B𝑡®,𝑟 (𝐷)?”
                      LOC(x, y) ⇝ locatedIn(x,y)                                𝛿 3 = “Are there many tuples 𝑡® ∈ 𝜆 − such that 𝑞 O does not
                                                                                      J -match B𝑡®,𝑟 (𝐷)?”
                                                                                𝛿 4 = “Are there few tuples 𝑡® ∈ 𝜆 − such that 𝑞 O J -matches
Let L O be the class of conjunctive queries (CQ). It is possible                      B𝑡®,𝑟 (𝐷)?”
to show that there is no CQ-query over the ontology that per-
fectly separates the set of tuples in 𝜆 + from the set of tuples in               Furthermore, depending on the query language L O consid-
𝜆 − . Nonetheless, observe that there are several CQ-queries that              ered, there may be many other meaningful criteria. For instance,
reasonably describe 𝜆. For example:                                            when L O = 𝐶𝑄, one may be interested in 𝛿 5 = “Are there few
                                                                               atoms used by the query 𝑞 O ?”, and when L O = 𝑈𝐶𝑄 one may
  𝑞 1 (𝑥) ← studies(x,y) ∧ taughtIn(y,z) ∧ locatedIn(z, ‘Rome’)                be further interested in 𝛿 6 = “Are there few disjuncts used by the
  𝑞 2 (𝑥) ← studies(x, ‘Math’)                                                 query 𝑞 O ?”.
                                                                                  We conclude this section by applying such newly introduced
  𝑞 3 (𝑥) ← likes(x, ‘Science’)
                                                                               framework to Example 3.6.
It is easy to verify that:
     • 𝑞 1 Σ-matches B𝑡®,1 (𝐷), for all 𝑡® ∈ {A10, B80, D50}                      Example 3.8. We refer to J , 𝑟 , 𝜆, and the queries 𝑞 1, 𝑞 2, 𝑞 3 as
                                                                               in Example 3.6. Suppose one is interested in the set of criteria
     • 𝑞 2 Σ-matches B𝑡®,1 (𝐷), for all 𝑡® ∈ {A10, B80, E25}
                                                                               Δ = {𝛿 1, 𝛿 4, 𝛿 5 }, with the following associated set of functions
     • 𝑞 3 Σ-matches B𝑡®,1 (𝐷), for all 𝑡® ∈ {C12, D50}                        F:
 Looking at the above queries, one could ask which query is the                                     | {𝑡® ∈ 𝜆 + | 𝑞 O Σ-matches B𝑡®,𝑟 (𝐷) } |
 best. The answer to this question, however, is not trivial, since 𝑞 2              • 𝑓𝛿 1 (𝑞 O ) =                       |𝜆 + |
 Σ-matches 24 of B𝑡®,1 (𝐷) for 𝑡® in 𝜆 + , and all B𝑡®,1 (𝐷) for 𝑡® in 𝜆 − ,                               | {𝑡® ∈ 𝜆 − | 𝑞 O Σ-matches B𝑡®,𝑟 (𝐷) } |
                                                                                    • 𝑓𝛿 4 (𝑞 O ) = 1 −                          |𝜆 − |
whilst 𝑞 1 Σ-matches 34 of B𝑡®,1 (𝐷) for 𝑡® in 𝜆 + , and no B𝑡®,1 (𝐷) for                                           1
                                                                                    • 𝑓𝛿 5 (𝑞 O ) = |atoms appearing in 𝑞 |
𝑡® in 𝜆 − . Besides, 𝑞 3 Σ-matches 24 of B𝑡®,1 (𝐷) for 𝑡® in 𝜆 + , and no                                                         O

 B𝑡®,1 (𝐷) for 𝑡® in 𝜆 − . Finally, 𝑞 2 and 𝑞 3 have less atoms than 𝑞 1 .     Now, consider the expression Z =           1  𝛼𝑧𝛿 × 𝛽𝑧𝛿 × 𝛾𝑧𝛿
                                                                                                                                  4   5
                                                                                                                                        , i.e. the aver-
                                                                                                                            𝛼 + 𝛽 +𝛾
                                                                         □     age of the evaluations of each function of F , weighted over three
   The above example suggests that searching for a query aiming                parameters 𝛼, 𝛽, and 𝛾. One can verify that the following queries
at semantically describing 𝜆 with the only constraint of satisfying            best describe 𝜆 w.r.t. J , 𝑟 , Δ, F , and Z, for each instantiation of
conditions (1) and (2) may turn out to be unsatisfactory. For this             Z:
reason, we propose a different approach by complicating the                        (1) (𝛼 = 𝛽 = 𝛾 = 1) → 𝑞 3
framework, so as to be potentially appealing in many different                     (2) (𝛼 = 3, 𝛽 = 1, 𝛾 = 1) → 𝑞 1
contexts.
   In general, one is interested in a query 𝑞 O over O expressed               In fact, let Z1 be the instantiation of the parameters of the ex-
in a certain language L O that accomplishes in the best way a                  pression Z corresponding to (1), then Z1 (𝑞 1 ) = 0.693, Z1 (𝑞 2 ) =
set Δ of criteria. We formalize the idea by introducing a set of               0.333, Z1 (𝑞 3 ) = 0.833. Similarly, let Z2 be the instantiation of
functions F , one for each criteria 𝛿 ∈ Δ, and a mathematical                  the parameters of the expression Z corresponding to (2), then
expression Z having a variable 𝑧𝛿 for each criteria 𝛿 ∈ Δ.                     Z2 (𝑞 1 ) = 0.716, Z2 (𝑞 2 ) = 0.5, Z2 (𝑞 3 ) = 0.7.               □
   Specifically, for a certain criteria 𝛿 ∈ Δ, the value of the func-
       J,𝑟
tion 𝑓𝛿,𝜆  (𝑞 O ) represents how much the query 𝑞 O meets criteria             4    CONCLUSIONS
𝛿 for 𝜆 w.r.t. the OBDM system Σ = ⟨J, 𝐷⟩ and the considered                   We have presented a framework for using the Ontology-Based
radius 𝑟 . Without loss of generality, we can obviously consider               Data Management paradigm in order to provide an explanation
all such functions to have the same range of values as their                   of the behavior of a classifier. Our short term goal in this research
codomain. Then, after instantiating each variable 𝑧𝛿 in Z with                 is to provide techniques for deriving useful explanations in terms
                             J,𝑟
the corresponding value 𝑓𝛿,𝜆     (𝑞 O ), the total value of the obtained       of queries over the ontology. Interestingly, the work in [8, 9] pro-
expression, denoted by Z F (𝑞 O ), represents the Z-score of the               vides a ground basis for the reverse engineering process described
query 𝑞 O under F .                                                            in this paper, from the data sources to the ontology. Moreover,
   Among the various possible queries in a certain query lan-                  the work in [10] offers an interesting set of techniques for ex-
guage L O , it is reasonable to look for the ones that give us the             plaining query answers in the context of an OBDM. Our future
highest possible score. This naturally led to the following main               work will also include an evaluation of both the framework and
definition of our framework:                                                   the techniques presented in this paper to real world settings.
5     ACKNOWLEDGEMENTS
This work has been partially supported by Sapienza under the
PRIN 2017 project “HOPE” (prot. 2017MMJJRE), and by European
Research Council under the European Union’s Horizon 2020
Programme through the ERC Advanced Grant WhiteMech (No.
834228).

REFERENCES
 [1] Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias. retrieved
     from https://www.propublica.org/article/machine-bias-risk-assessments-in-
     criminal-sentencing, May 23, 2016.
 [2] Baader, F., Calvanese, D., McGuinness, D., Nardi, D., and Patel-Schneider,
     P. F., Eds. The Description Logic Handbook: Theory, Implementation and Appli-
     cations, 2nd ed. Cambridge University Press, 2007.
 [3] Barceló, P., and Romero, M. The Complexity of Reverse Engineering Prob-
     lems for Conjunctive Queries. Proceedings of the 16th International Semantic
     Web Conference, 2017 (2017), 17 pages.
 [4] Bonifati, A., Ciucanu, R., and Staworko, S. Learning join queries from
     user examples. ACM Trans. Database Syst. 40, 4 (Jan. 2016), 24:1–24:38.
 [5] Bühmann, L., Lehmann, J., Westphal, P., and Bin, S. Dl-learner structured
     machine learning on semantic web data. In Companion Proceedings of the
     The Web Conference 2018 (Republic and Canton of Geneva, Switzerland, 2018),
     WWW ’18, International World Wide Web Conferences Steering Committee,
     pp. 467–471.
 [6] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., and
     Rosati, R. Linking data to ontologies: The description logic DL-Lite𝐴 . In
     Proceedings of the Second International Workshop on OWL: Experiences and
     Directions (OWLED 2006) (2006), vol. 216 of CEUR Electronic Workshop Pro-
     ceedings, http://ceur-ws.org/.
 [7] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., and
     Rosati, R. Ontology-based database access. In Proceedings of the Fifteenth
     Italian Conference on Database Systems (SEBD 2007) (2007), pp. 324–331.
 [8] Cima, G. Preliminary results on ontology-based open data publishing. In Pro-
     ceedings of the Thirtieth International Workshop on Description Logics (DL 2017)
     (2017), vol. 1879 of CEUR Electronic Workshop Proceedings, http://ceur-ws.org/.
 [9] Cima, G., Lenzerini, M., and Poggi, A. Semantic characterization of data
     services through ontologies. In Proceedings of the Twenty-Eighth International
     Joint Conference on Artificial Intelligence (IJCAI 2019) (2019), pp. 1647–1653.
[10] Croce, F., and Lenzerini, M. A framework for explaining query answers in
     dl-lite. In Proceedings of the Twenty-First International Conference on Knowledge
     Engineering and Knowledge Management (EKAW 2018) (2018).
[11] Daraio, C., Lenzerini, M., Leporelli, C., Naggar, P., Bonaccorsi, A., and
     Bartolucci, A. The advantages of an ontology-based data management
     approach: openness, interoperability and data quality. Scientometrics 108 (03
     2016).
[12] Fanizzi, N., Rizzo, G., d’Amato, C., and Esposito, F. Dlfoil: Class expression
     learning revisited. In Proceedings of the Twenty-First International Conference
     on Knowledge Engineering and Knowledge Management (EKAW 2018) (2018).
[13] Funk, M., Jung, J. C., Lutz, C., Pulcini, H., and Wolter, F. Learning descrip-
     tion logic concepts: When can positive and negative examples be separated?
     In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
     Intelligence, IJCAI-19 (7 2019), International Joint Conferences on Artificial
     Intelligence Organization, pp. 1682–1688.
[14] Gutiérrez-Basulto, V., Jung, J. C., and Sabellek, L. Reverse engineering
     queries in ontology-enriched systems: The case of expressive horn description
     logic ontologies. In Proceedings of the Twenty-Seventh International Joint
     Conference on Artificial Intelligence, IJCAI-18 (7 2018), International Joint
     Conferences on Artificial Intelligence Organization, pp. 1847–1853.
[15] Lenzerini, M. Ontology-based data management. In Proceedings of the
     Twentieth International Conference on Information and Knowledge Management
     (CIKM 2011) (2011), pp. 5–6.
[16] Lenzerini, M. Managing data through the lens of an ontology. AI Magazine
     39, 2 (2018), 65–74.
[17] Straccia, U., and Mucci, M. pfoil-dl: Learning (fuzzy) el concept descriptions
     from crisp owl data using a probabilistic ensemble estimation. In Proceedings
     of the 30th Annual ACM Symposium on Applied Computing (New York, NY,
     USA, 2015), SAC ’15, ACM, pp. 345–352.
[18] Tran, Q. T., Chan, C.-Y., and Parthasarathy, S. Query reverse engineering.
     The VLDB Journal 23, 5 (Oct. 2014), 721–746.
[19] Zloof, M. M. Query-by-example: The invocation and definition of tables and
     forms. In Proceedings of the 1st International Conference on Very Large Data
     Bases (New York, NY, USA, 1975), VLDB ’75, ACM, pp. 1–24.