=Paper=
{{Paper
|id=Vol-3741/paper34
|storemode=property
|title=Computing the Why-Provenance for Datalog Queries via SAT Solvers
|pdfUrl=https://ceur-ws.org/Vol-3741/paper34.pdf
|volume=Vol-3741
|authors=Marco Calautti,Ester Livshits,Andreas Pieris,Markus Schneider
|dblpUrl=https://dblp.org/rec/conf/sebd/CalauttiLPS24
}}
==Computing the Why-Provenance for Datalog Queries via SAT Solvers==
<pdf width="1500px">https://ceur-ws.org/Vol-3741/paper34.pdf</pdf>
<pre>
                                Computing the Why-Provenance for Datalog Queries
                                via SAT Solvers⋆
                                (Discussion Paper)

                                Marco Calautti1,** , Ester Livshits2 , Andreas Pieris2,3 and Markus Schneider2
                                1
                                  Department of Computer Science, University of Milan
                                2
                                  School of Informatics, University of Edinburgh
                                3
                                  Department of Computer Science, University of Cyprus


                                           Abstract
                                           Explaining an answer to a Datalog query is an essential task towards Explainable AI, especially nowadays
                                           where Datalog plays a critical role in the development of ontology-based applications. A well-established
                                           approach for explaining a query answer is the so-called why-provenance, which essentially collects all
                                           the subsets of the input database that can be used to obtain that answer via some derivation process,
                                           typically represented as a proof tree. It is well known, however, that computing the why-provenance for
                                           Datalog queries is computationally expensive, and thus, very few attempts can be found in the literature.
                                           The goal of this work is to demonstrate how off-the-shelf SAT solvers can be exploited towards an
                                           efficient computation of the why-provenance for Datalog queries. Interestingly, our SAT-based approach
                                           allows us to build the why-provenance in an incremental fashion, that is, one explanation at a time,
                                           which is much more useful in a practical context than the one-shot computation of the whole set of
                                           explanations as done by existing approaches.

                                           Keywords
                                           Datalog queries, explainability, why-provenance, computational complexity

                                1. Introduction
                                Datalog has emerged in the 1980s as a logic-based query language from Logic Programming and
                                has been extensively studied since then [2]. The name Datalog reflects the intention of devising
                                a counterpart of Prolog for data processing. It essentially extends the language of unions of
                                conjunctive queries, which corresponds to the select-project-join-union fragment of relational
                                algebra, with the important feature of recursion, much needed to express some natural queries.
                                   As for any other query language, explaining why a result to a Datalog query is obtained
                                is crucial towards transparent data-intensive applications. A well-established approach for
                                providing such explanations to query answers is the so-called why-provenance [3]. Its essence
                                is to collect all the subsets of the input database that as a whole can be used to derive a certain
                                answer. More precisely, for Datalog queries, the why-provenance of an answer tuple ¯𝑡 is obtained
                                by considering all the possible proof trees 𝑇 of the fact Ans(𝑡¯), with Ans being the answer
                                predicate of the Datalog query in question, and then collecting all the database facts that label
                                the leaves of 𝑇 . Recall that a proof tree of a fact 𝛼 w.r.t. a database 𝐷 and a set Σ of Datalog
                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                ⋆
                                 This is a short version of the paper [1]
                                **
                                   Corresponding author.
                                $ marco.calautti@unimi.it (M. Calautti); ester.livshits@ed.ac.uk (E. Livshits); apieris@inf.ed.ac.uk (A. Pieris);
                                m.schneider@ed.ac.uk (M. Schneider)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
rules forms a tree-like representation of a way for deriving 𝛼 by starting from 𝐷 and executing
the rules occurring in Σ [2].
    Despite its wide acceptance, why-provenance for Datalog queries comes with two weaknesses:
it is computationally very expensive, and it may provide counterintuitive explanations. The
first weakness is manifested by the fact that, although why-provenance for Datalog queries
has been around for decades, only a couple of works have considered implementing it for
recursive queries [4, 5]. An attempt to change this state of affairs was made in [6] by focusing
on the more practical setting of computing the why-provenance of a given query answer
(a.k.a. on-demand why-provenance), instead of computing the why-provenance for all the query
answers. Concerning the second weakness, it has been observed that there are proof trees that
correspond to unnatural derivation processes, e.g., derivations where an atom is derived from
itself [7]. Now, an explanation witnessed via such an unnatural proof tree, might be classified
as a counterintuitive one as it does not correspond to an intuitive derivation process that can
be extracted from the proof tree; this is further discussed in Section 3.
    The main goal of this work is to tackle the two weaknesses of why-provenance for Datalog
queries discussed above. In particular, we place our work in the more practical setting of
on-demand why-provenance, and target an efficient implementation that provides conceptually
meaningful explanations for the given query answer.
2. Preliminaries
We consider the disjoint countably infinite sets C and V of constants and variables, respectively.
We may refer to constants and variables as terms. For brevity, given an integer 𝑛 > 0, we may
write [𝑛] for the set of integers {1, . . . , 𝑛}.
Relational Databases. A schema S is a finite set of relation names (or predicates) with
associated arity. We write 𝑅/𝑛 to say that 𝑅 has arity 𝑛 ≥ 0; we may write ar(𝑅) for 𝑛. A
(relational) atom 𝛼 over S is an expression of the form 𝑅(𝑡¯), where 𝑅/𝑛 ∈ S and ¯𝑡 is an 𝑛-tuple
of terms. By abuse of notation, we may treat tuples as the set of their elements. A fact is an
atom that mentions only constants. A database over S is a finite set of facts over S. The active
domain of a database 𝐷, denoted dom(𝐷), is the set of constants in 𝐷.
Syntax and Semantics of Datalog Programs. A (Datalog) rule 𝜎 over a schema S is an
expression of the form
                              ¯ 0 ) :– 𝑅1 (𝑥
                          𝑅0 (𝑥            ¯ 1 ), . . . , 𝑅𝑛 (𝑥
                                                              ¯𝑛)
for 𝑛 ≥ 1, where 𝑅𝑖 (𝑥¯ 𝑖 ) is a (constant-free) relational atom over S for 𝑖 ∈ {0, . . . , 𝑛}, and each
variable in 𝑥
            ¯ 0 occurs in 𝑥  ¯ 𝑘 for some 𝑘 ∈ [𝑛]. We refer to 𝑅0 (𝑥  ¯ 0 ) as the head of 𝜎, denoted
head(𝜎), and to the expression that appears on the right of the :– symbol as the body of 𝜎,
denoted body(𝜎), which we may treat as the set of its atoms.
   A Datalog program over a schema S is defined as a finite set Σ of Datalog rules over S. A
predicate 𝑅 occurring in Σ is called extensional if there is no rule in Σ having 𝑅 in its head, and
intensional if there exists at least one rule in Σ with 𝑅 in its head. The extensional (database)
schema of Σ, denoted edb(Σ), is the set of all extensional predicates in Σ, while the intensional
schema of Σ, denoted idb(Σ), is the set of all intensional predicates in Σ. Note that, by definition,
edb(Σ) ∩ idb(Σ) = ∅. The schema of Σ, denoted sch(Σ), is the set edb(Σ) ∪ idb(Σ), which is
in general a subset of S since some predicates of S may not appear in Σ.
   An elegant property of Datalog programs is that they have three equivalent semantics:
model-theoretic, fixpoint, and proof-theoretic [2]. We recall the proof-theoretic semantics
of Datalog programs since it is closer to the notion of why-provenance. To this end, we
need the key notion of proof tree of a fact. For a database 𝐷 and a Datalog program Σ, let
base(𝐷, Σ) = {𝑅(𝑡¯) | 𝑅 ∈ sch(Σ) and ¯𝑡 ∈ dom(𝐷)ar(𝑅) }, the facts that can be formed using
predicates of sch(Σ) and constants of dom(𝐷).
Definition 2.1. (Proof Tree) Consider a Datalog program Σ, a database 𝐷 over edb(Σ), and a
fact 𝛼 over sch(Σ). A proof tree of 𝛼 w.r.t. 𝐷 and Σ is a finite labeled rooted tree 𝑇 = (𝑉, 𝐸, 𝜆),
with 𝜆 : 𝑉 → base(𝐷, Σ), such that:
   1. If 𝑣 ∈ 𝑉 is the root, then 𝜆(𝑣) = 𝛼.
   2. If 𝑣 ∈ 𝑉 is a leaf, then 𝜆(𝑣) ∈ 𝐷.
   3. If 𝑣 ∈ 𝑉 is a node with 𝑛 ≥ 1 children 𝑢1 , . . . , 𝑢                   ⋃︀𝑛 , then there is a rule
      𝑅0 (𝑥¯ 0 ) :– 𝑅1 (𝑥                  ¯ 𝑛 ) ∈ Σ and a function ℎ : 𝑖∈[𝑛] 𝑥
                        ¯ 1 ), . . . , 𝑅𝑛 (𝑥                                          ¯ 𝑖 → C such that
      𝜆(𝑣) = 𝑅0 (ℎ(𝑥   ¯ 0 )), and 𝜆(𝑢𝑖 ) = 𝑅𝑖 (ℎ(𝑥  ¯ 𝑖 )) for each 𝑖 ∈ [𝑛].
   Essentially, a proof tree of a fact 𝛼 w.r.t. 𝐷 and Σ indicates that we can derive 𝛼 starting from
𝐷 end executing the rules of Σ. Now, given a Datalog program Σ and a database 𝐷 over sch(Σ),
the semantics of Σ on 𝐷 is Σ(𝐷) = {𝛼 | there is a proof tree of 𝛼 w.r.t. 𝐷 and Σ}. That is, the
set of facts that can be proven using 𝐷 and Σ.
Datalog Queries. It is now straightforward to recall the syntax and the semantics of Datalog
queries. A Datalog query is a pair 𝑄 = (Σ, 𝑅), where Σ is a Datalog program and 𝑅 a predicate
of idb(Σ). Now, for a database 𝐷 over edb(Σ), the answer to 𝑄 over 𝐷 is defined as the set
                                   {︁                               }︁
                       𝑄(𝐷) = ¯𝑡 ∈ dom(𝐷)ar(𝑅) | 𝑅(𝑡¯) ∈ Σ(𝐷) ,

that is, the set of tuples ¯𝑡 of dom(𝐷)ar(𝑅) such that the fact 𝑅(𝑡¯) can be derived using 𝐷 and Σ.
Why-Provenance for Datalog Queries. As discussed in the introduction, why-provenance is
a standard way of explaining query results. It essentially collects all the subsets of the database
(without unnecessary atoms) that allow to prove (or derive) a query result. We now formalize
this simple idea.
   Given a proof tree 𝑇 = (𝑉, 𝐸, 𝜆) (of some fact w.r.t. some database and Datalog program),
the support of 𝑇 is the set support(𝑇 ) = {𝜆(𝑣) | 𝑣 ∈ 𝑉 is a leaf of 𝑇 }, which is essentially the
set of facts that label the leaves of the proof tree 𝑇 . Note that support(𝑇 ) is a subset of the
underlying database since, by definition, the leaves of a proof tree are labeled with database
atoms. The formal definition of why-provenance for Datalog queries follows.
Definition 2.2. (Why-Provenance for Datalog) Consider a Datalog query 𝑄 = (Σ, 𝑅), a
database 𝐷 over edb(Σ), and a tuple ¯𝑡 ∈ dom(𝐷)ar(𝑅) . The why-provenance of ¯𝑡 w.r.t. 𝐷 and 𝑄
is defined as the family of sets of facts

           why(𝑡¯, 𝐷, 𝑄) = {support(𝑇 ) | 𝑇 is a proof tree of 𝑅(𝑡¯) w.r.t. 𝐷 and Σ}.
   Intuitively speaking, a set of facts 𝐷′ ⊆ 𝐷 that belongs to why(𝑡¯, 𝐷, 𝑄) should be understood
as a reason why the tuple ¯𝑡 is an answer to the query 𝑄 over the database 𝐷, i.e., 𝐷′ explains
why ¯𝑡 ∈ 𝑄(𝐷). In particular, all the facts of 𝐷′ are used in order to derive ¯𝑡 as an answer.


3. Unambiguous Proof Trees
The standard notion of why-provenance defined above relies on arbitrary proof trees. However,
as discussed in [7], there are proof trees that are counterintuitive. For example, such a proof tree
is one where a fact is derived from itself, that is, it contains two nodes labeled with the same
fact and one is a descendant of the other. Now, a member of why(𝑡¯, 𝐷, 𝑄), witnessed via such
an unnatural proof tree, might be classified as a counterintuitive explanation of ¯𝑡 as it does not
correspond to a natural derivation process that can be extracted from the proof tree. Therefore,
we need refined classes of proof trees that overcome the conceptual limitations of arbitrary proof
trees. Some refined classes of proof trees have been recently discussed in [7]: non-recursive proof
trees, minimal-depth proof trees, and hereditary minimal-depth proof trees. Roughly speaking, a
non-recursive proof tree is a proof tree that does not contain two nodes labeled with the same
fact and one is a descendant of the other, a minimal-depth proof tree is a proof tree that has the
minimum depth among all the proof trees of the same fact, and a hereditary minimal-depth proof
tree is minimizing the depth of each of its subtrees. Although non-recursive and (hereditary)
minimal-depth proof trees form well-justified notions that deserve our attention, there are still
proof trees from those classes that can be classified as counterintuitive. In fact, there are proof
trees that are non-recursive and (hereditary) minimal-depth, but they are ambiguous in the way
some facts are derived. Here is a simple example that illustrates this phenomenon.

Example 3.1. Consider the Datalog program Σ

                                  𝐴(𝑥) :– 𝑆(𝑥)
                                  𝐴(𝑥) :– 𝐴(𝑦), 𝐴(𝑧), 𝑇 (𝑦, 𝑧, 𝑥)

that encodes the path accessibility problem [8]. The predicate 𝑆 represents source nodes, 𝐴
represents nodes that are accessible from the source nodes, and 𝑇 represents accessibility
conditions, that is, 𝑇 (𝑦, 𝑧, 𝑥) means that if both 𝑦 and 𝑧 are accessible from the source nodes,
then so is 𝑥. We further consider the database

                      𝐷 = {𝑆(𝑎), 𝑆(𝑏), 𝑇 (𝑎, 𝑎, 𝑐), 𝑇 (𝑏, 𝑏, 𝑐), 𝑇 (𝑐, 𝑐, 𝑑)}.

The following is a proof tree of the fact 𝐴(𝑑) w.r.t. 𝐷 and Σ that is non-recursive and (hereditary)
minimal-depth, but it suffers from the ambiguity issue described above:
                                                        𝐴(𝑑)

                                   𝐴(𝑐)                         𝐴(𝑐)         𝑇 (𝑐, 𝑐, 𝑑)

                          𝐴(𝑎)    𝐴(𝑎)    𝑇 (𝑎, 𝑎, 𝑐)   𝐴(𝑏)   𝐴(𝑏)    𝑇 (𝑏, 𝑏, 𝑐)

                           𝑆(𝑎)   𝑆(𝑎)                  𝑆(𝑏)   𝑆(𝑏)
Indeed, there are two nodes labeled with the fact 𝐴(𝑐), but their subtrees differ, and thus, it is
ambiguous how 𝐴(𝑐) is derived. Hence, the database 𝐷, which belongs to the why-provenance
of (𝑑) w.r.t. 𝐷 and 𝑄 due to the above proof tree, might be classified as a counterintuitive
explanation since it does not correspond to an intuitive derivation process where each fact is
derived once due to a unique reason. Indeed, the intuitive explanations that one expects are the
following: 𝑑 is accessible from 𝑎 via 𝑐 (i.e., the subset {𝑆(𝑎), 𝑇 (𝑎, 𝑎, 𝑐), 𝑇 (𝑐, 𝑐, 𝑑)} of 𝐷), or 𝑑
is accessible from 𝑏 via 𝑐 (i.e., the subset {𝑆(𝑏), 𝑇 (𝑏, 𝑏, 𝑐), 𝑇 (𝑐, 𝑐, 𝑑)} of 𝐷).

   This leads to the class of unambiguous proof trees, where all occurrences of a fact must be
proved via the same derivation. Two rooted trees 𝑇 = (𝑉, 𝐸, 𝜆) and 𝑇 ′ = (𝑉 ′ , 𝐸 ′ , 𝜆′ ) are
isomorphic, denoted 𝑇 ≈ 𝑇 ′ , if there exists a bijection ℎ : 𝑉 → 𝑉 ′ such that, for each 𝑣 ∈ 𝑉 ,
𝜆(𝑣) = 𝜆′ (ℎ(𝑣)), and for each 𝑢, 𝑣 ∈ 𝑉 , (𝑢, 𝑣) ∈ 𝐸 iff (ℎ(𝑢), ℎ(𝑣)) ∈ 𝐸 ′ . Let 𝑇 [𝑣] be the
subtree of the proof tree 𝑇 rooted at node 𝑣. The formal definition of unambiguous proof trees
follows.
Definition 3.2. (Unambiguous Proof Tree) Consider a Datalog program Σ, a database 𝐷
over edb(Σ), and a fact 𝛼 over sch(Σ). An unambiguous proof tree of 𝛼 w.r.t. 𝐷 and Σ is a
proof tree 𝑇 = (𝑉, 𝐸, 𝜆) of 𝛼 w.r.t. 𝐷 and Σ such that, for all 𝑣, 𝑢 ∈ 𝑉 , 𝜆(𝑣) = 𝜆(𝑢) implies
𝑇 [𝑣] ≈ 𝑇 [𝑢].
   Why-provenance relative to unambiguous proof trees is defined in the obvious way: for
a Datalog query 𝑄 = (Σ, 𝑅), a database 𝐷 over edb(Σ), and a tuple ¯𝑡 ∈ dom(𝐷)ar(𝑅) , the
why-provenance of ¯𝑡 w.r.t. 𝐷 and 𝑄 relative to unambiguous proof trees is the family of sets of
facts
          {support(𝑇 ) | 𝑇 is an unambiguous proof tree of 𝑅(𝑡¯) w.r.t. 𝐷 and Σ},
denoted whyUN (𝑡¯, 𝐷, 𝑄). The main concern of this work is to efficiently compute the why-
provenance of a tuple relative to unambiguous proof trees. To this end, we are going to exploit
off-the-shelf SAT solvers and report encouraging results; this is the subject of the next two
sections. To the best of our knowledge, this is the first time that SAT solvers are used for
computing the why-provenance. Let us stress that focusing on unambiguous proof trees, apart
from their conceptual advantage discussed above, was crucial towards our encouraging results
as it is unclear how a SAT-based implementation can be made practical for arbitrary proof trees.
4. From Why-Provenance to SAT
In this section, we show that the why-provenance of a tuple relative to unambiguous proof
trees can be extracted from the satisfying truth assignments of a Boolean formula.
Compactly Representing Unambiguous Proof Trees. The construction of the Boolean
formula relies on a characterization of the existence of an unambiguous proof tree of a fact
𝛼 w.r.t. 𝐷 and Σ via the existence of a so-called compressed directed acyclic graph (DAG) of 𝛼
w.r.t. 𝐷 and Σ, which, intuitively speaking, is a compact representation of an unambiguous
proof tree of 𝛼 w.r.t. 𝐷 and Σ; this is needed, as an unambiguous proof tree can be exponentially
large in the size of 𝐷. We proceed to formalize the notion of compressed DAG and give the
characterization in question. Recall that a DAG 𝐺 is rooted if it has exactly one node, the root,
with no incoming edges. A node of 𝐺 is a leaf if it has no outgoing edges.
Definition 4.1. (Compressed DAG) Consider a Datalog program Σ, a database 𝐷 over edb(Σ),
and a fact 𝛼 over sch(Σ). A compressed DAG of 𝛼 w.r.t. 𝐷 and Σ is a rooted DAG 𝐺 = (𝑉, 𝐸),
with 𝑉 ⊆ base(𝐷, Σ), such that:
   1. The root of 𝐺 is 𝛼.
   2. If 𝛽 ∈ 𝑉 is a leaf node, then 𝛽 ∈ 𝐷.
   3. If 𝛽 ∈ 𝑉 has 𝑛 ≥ 1 outgoing edges (𝛽, 𝛾1 ), . . . , (𝛽, ⋃︀          𝛾𝑛 ), then there is a rule
      𝑅0 (𝑥¯ 0 ) :– 𝑅1 (𝑥                    ¯ 𝑚 ) ∈ Σ and a function ℎ : 𝑖∈[𝑚] 𝑥
                          ¯ 1 ), . . . , 𝑅𝑚 (𝑥                                    ¯ 𝑖 → C such that
      𝛽 = 𝑅0 (ℎ(𝑥   ¯ 0 )) and {𝛾𝑖 }𝑖∈[𝑛] = {𝑅𝑖 (ℎ(𝑥   ¯ 𝑖 )) | 𝑖 ∈ [𝑚]}.

   For a compressed DAG 𝐺 = (𝑉, 𝐸), the support of 𝐺 is defined analogously to the support
of a proof tree, that is, support(𝐺) = {𝑣 ∈ 𝑉 | 𝑣 is a leaf of 𝐺}. The desired characterization
follows.

Proposition 4.2. For a Datalog program Σ, a database 𝐷 over edb(Σ), a fact 𝛼 over sch(Σ), and
a database 𝐷′ ⊆ 𝐷, the following are equivalent:
   1. There exists an unambiguous proof tree 𝑇 of 𝛼 w.r.t. 𝐷 and Σ such that support(𝑇 ) = 𝐷′ .
   2. There exists a compressed DAG 𝐺 of 𝛼 w.r.t. 𝐷 and Σ such that support(𝐺) = 𝐷′ .

   Note that the above characterization relies on the fact that we focus on unambiguous proof
trees. More precisely, unambiguity allows us to use a single node in the compressed DAG as a
representative for all the (possibly exponentially many) nodes in the proof tree labelled with
the same fact.
The Boolean Formula. Fix a Datalog query 𝑄 = (Σ, 𝑅), a database 𝐷 over edb(Σ), and
a tuple ¯𝑡 ∈ dom(𝐷)ar(𝑅) . We construct in polynomial time in 𝐷 a Boolean formula 𝜑(𝑡¯,𝐷,𝑄)
such that the why-provenance of ¯𝑡 w.r.t. 𝐷 and 𝑄 relative to unambiguous proof trees can be
computed from the truth assignments that make 𝜑(𝑡¯,𝐷,𝑄) true.
   Among the Boolean variables of 𝜑(𝑡¯,𝐷,𝑄) , we also have the disjoint sets of variables 𝑉𝑁 and
𝑉𝐸 , where each variable in 𝑉𝑁 corresponds to a possible node of some compressed DAG of 𝑅(𝑡¯)
w.r.t. 𝐷 and Σ, while each variable in 𝑉𝐸 corresponds to a possible edge between two nodes
of some compressed DAG of 𝑅(𝑡¯) w.r.t. 𝐷 and Σ. The key idea is that the variables of 𝑉𝑁 and
𝑉𝐸 that become true via a satisfying truth assignment of 𝜑(𝑡¯,𝐷,𝑄) , induce the nodes and the
edges of a compressed DAG 𝐺 for 𝑅(𝑡¯) w.r.t. 𝐷 and Σ, which, by Proposition 4.2, implies that
support(𝐺) ∈ whyUN (𝑡¯, 𝐷, 𝑄). The Boolean formula 𝜑(𝑡¯,𝐷,𝑄) is a conjunction of the form

                             𝜑graph ∧ 𝜑root ∧ 𝜑proof ∧ 𝜑acyclic ,

where 𝜑graph is in charge of guaranteeing consistency between the truth assignments of the
variables in 𝑉𝑁 and the variables in 𝑉𝐸 , i.e., if an edge between two nodes is part of 𝐺, then the
two nodes must belong to 𝐺 as well. The formula 𝜑root guarantees that the atom 𝑅(𝑡¯) is a node
of 𝐺, is the root of 𝐺, and no other node 𝑣 of 𝐺 can be the root (i.e., 𝑣 must have at least one
incoming edge). The formula 𝜑proof is in charge of guaranteeing that, whenever an intensional
atom 𝛼 is a node of 𝐺, then it must have the correct children in 𝐺. Finally, 𝜑acyclic is in charge
of checking that 𝐺, namely the graph whose edges correspond to the true variables in 𝑉𝐸 , is
acyclic. We can prove the following crucial result about the above Boolean formula:
       Scenario                            Databases                          Recursive   #Rules
    TClosure [9, 10]              𝐷bitcoin (235K), 𝐷facebook (88.2K)             ✓           2
      Doctors [6]                             𝐷1 (100K)                          ✗           6
       Galen [6]            𝐷1 (26.5K), 𝐷2 (30.5K), 𝐷3 (67K), 𝐷4 (82K)           ✓          14
     Andersen [11]     𝐷1 (68K), 𝐷2 (340K), 𝐷3 (680K), 𝐷4 (3.4M), 𝐷5 (6.8M)      ✓           2
      CSDA [11]            𝐷httpd (10M), 𝐷postgresql (34.8M), 𝐷linux (44M)       ✓           2
Table 1
Experimental scenarios. For each database 𝐷, the number of tuples occurring in 𝐷 is in parenthesis.


 Proposition 4.3. Consider a Datalog query 𝑄 = (Σ, 𝑅), a database 𝐷 over edb(Σ), and a tuple
¯𝑡 ∈ dom(𝐷)ar(𝑅) . It holds that:

    • 𝜑(𝑡¯,𝐷,𝑄) is in CNF and can be constructed in polynomial time w.r.t. 𝐷.
    • There is a one-to-one correspondence between the members of whyUN (𝑡¯, 𝐷, 𝑄) and the
      satisfying assignments of 𝜑(𝑡¯,𝐷,𝑄) .
    • Each member of whyUN (𝑡¯, 𝐷, 𝑄) can be recovered from a satisfying assignment of 𝜑(𝑡¯,𝐷,𝑄)
      in polynomial time w.r.t. the size of 𝜑(𝑡¯,𝐷,𝑄) .
5. Implementation and Experimental Evaluation
Proposition 4.3 provides a way for computing the why-provenance of a tuple relative to unam-
biguous proof trees via off-the-shelf SAT solvers. But how does this machinery behave when
applied in a practical context? In particular, we are interested in the incremental computation
of the why-provenance by enumerating its members, which is more useful in practice than
computing the whole set at once.
   This is achieved by adapting a standard technique from the SAT literature for enumerating
the satisfying assignments of a Boolean formula, called blocking clause. After asking the
SAT solver for an arbitrary satisfying assignment 𝜏 of 𝜑(𝑡¯,𝐷,𝑄) , we output the member of
whyUN (𝑡¯, 𝐷, 𝑄), denoted db(𝜏 ), corresponding to 𝜏 , and then construct a “blocking” clause
𝐶db(𝜏 ) which expresses that no other satisfying assignment 𝜏 ′ should give rise to the same
member of the why-provenance. This will exclude the previously computed explanations from
the computation. We keep adding such blocking clauses each time we get a new member of the
why-provenance until the formula is unsatisfiable.
   We now proceed to experimentally evaluate the SAT-based approach discussed above. To
this end, we consider a variety of scenarios from the Datalog literature consisting of a query
𝑄 = (Σ, 𝑅) and a family of databases 𝒟 over edb(Σ). All the scenarios are summarized in
Table 1.
Experimental Setup. For each scenario 𝑠 consisting of the query 𝑄 = (Σ, 𝑅) and the family of
databases 𝒟, and for each 𝐷 ∈ 𝒟, we have computed 𝑄(𝐷) using DLV [12], version 2.1.1, and
selected 100 tuples ¯𝑡1𝑠,𝐷 , . . . , ¯𝑡100
                                       𝑠,𝐷 from 𝑄(𝐷) uniformly at random. Then, for each 𝑖 ∈ [100],
we constructed the Boolean formula 𝜑(𝑡¯𝑖 ,𝐷,𝑄) via a C++ implementation. Finally, we ran
                                                 𝑠,𝐷
the state-of-the-art SAT solver Glucose [13], version 4.2.1, with the above formula as input, to
enumerate the members of whyUN (𝑡¯𝑖𝑠,𝐷 , 𝐷, 𝑄). All the experiments have been conducted on
a laptop with an Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, and 32GB of RAM, running
Fedora Linux 37.
Experimental Results and Take-Home Messages. Concerning the construction of the
formula, in most of the scenarios, the runtime is in the order of some seconds, and we have
observed that almost all the time is spent for computing the intermediate data structures, needed
to build the formula. Considering the more demanding scenarios Andersen and CSDA, the total
time is in the order of some minutes. Note, however, that it would be similarly demanding even
for query answering.
   For the incremental computation of the why-provenance, we considered the time between
the current member of the why-provenance and the next (the delay). All delays are in the order
of milliseconds. Hence, once the formula is built, incrementally computing the members of the
why-provenance is extremely fast.
   We also compared our SAT-based approach with the one of [6], which is the only one
in the literature for constructing the why-provenance of a given tuple. We point out that
for a query 𝑄, a database 𝐷, and a tuple ¯𝑡 ∈ 𝑄(𝐷), the technique of [6], which we call
rule-based, constructs the set why(𝑡¯, 𝐷, 𝑄) via a rewriting of 𝑄 into a set of existential rules.
Since whyUN (𝑡¯, 𝐷, 𝑄) ⊆ why(𝑡¯, 𝐷, 𝑄), one may think that computing why(𝑡¯, 𝐷, 𝑄) is more
demanding. However, computing whyUN (𝑡¯, 𝐷, 𝑄) requires checking for unambiguity. We use
VLog [14], version 0.9.0, for the rule-based implementation, and we perform the comparison
over the Galen and Doctors-based scenarios as these are the only ones considered in [6], and
thus, we have access to the rewritten set of existential rules. For the comparison, since the
approach of [6] does not support incremental computation, we considered the end-to-end
runtime for constructing whyUN (𝑡¯, 𝐷, 𝑄) and why(𝑡¯, 𝐷, 𝑄) using the SAT-based and the rule-
based implementation, respectively; we set a 5 minutes timeout for both approaches.
   We observed that our SAT-based implementation consistently outperforms the rule-based
one. In particular, for the Galen scenario, in most cases, the rule-based implementation does not
finish in less than 5 minutes, with the worst case occurring with the largest database, where 41
out of the 100 runs time out.


6. Future Steps
From our analysis, it is clear that our future efforts should focus on improving the construction
of the Boolean formula. Since proof trees describe a finite reasoning process, it will be interesting
to understand how they can be adapted to other rule-based formalisms with finite reasoning, in
order to apply our SAT-based approach, such as logic programs with finite stable models [15, 16,
17], and ontology-mediated queries that guarantee the termination of the chase (e.g., see [18,
19, 20, 21, 22, 23]). Finally, it would be interesting to see how to adapt our approach to explain
answers obtained over uncertain data, such as inconsistent databases (e.g., see [24, 25, 26, 27, 28]).


Acknowledgments
This work was funded by the European Union - Next Generation EU under the MUR PRIN-PNRR
grant P2022KHTX7 “DISTORT”, and by the EPSRC grant EP/S003800/1 “EQUID”.
References
 [1] M. Calautti, E. Livshits, A. Pieris, M. Schneider, Computing the why-provenance for
     datalog queries via SAT solvers, in: AAAI, 2024, pp. 10459–10466.
 [2] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison-Wesley, 1995.
 [3] P. Buneman, S. Khanna, W. C. Tan, Why and where: A characterization of data provenance,
     in: ICDT, 2001, pp. 316–330.
 [4] D. Zhao, P. Subotic, B. Scholz, Debugging large-scale datalog: A scalable provenance
     evaluation strategy, ACM Trans. Program. Lang. Syst. 42 (2020) 7:1–7:35.
 [5] J. Esparza, M. Luttenberger, M. Schlund, Fpsolve: A generic solver for fixpoint equations
     over semirings, in: CIAA, 2014, pp. 1–15.
 [6] A. Elhalawati, M. Krötzsch, S. Mennicke, An existential rule framework for computing
     why-provenance on-demand for datalog, in: RuleML+RR, 2022.
 [7] C. Bourgaux, P. Bourhis, L. Peterfreund, M. Thomazo, Revisiting semiring provenance for
     datalog, in: KR, 2022.
 [8] S. A. Cook, An observation on time-storage trade off, J. Comput. Syst. Sci. 9 (1974) 308–316.
 [9] M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robinson, C. E. Leiserson,
     Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for
     financial forensics, CoRR abs/1908.02591 (2019). URL: http://arxiv.org/abs/1908.02591.
[10] J. McAuley, J. Leskovec, Learning to discover social circles in ego networks, in: NIPS, 2012,
     p. 539–547.
[11] Z. Fan, S. Mallireddy, P. Koutris, Towards better understanding of the performance and
     design of datalog systems, in: Datalog 2.0, 2022, pp. 166–180.
[12] N. Leone, G. Pfeifer, W. Faber, T. Eiter, G. Gottlob, S. Perri, F. Scarcello, The DLV system
     for knowledge representation and reasoning, ACM Trans. Comput. Log. 7 (2006) 499–562.
[13] G. Audemard, L. Simon, On the glucose SAT solver, Int. J. Artif. Intell. Tools 27 (2018)
     1840001:1–1840001:25.
[14] J. Urbani, C. Jacobs, M. Krötzsch, Column-oriented datalog materialization for large
     knowledge graphs, in: AAAI, 2016, pp. 258–264.
[15] M. Calautti, S. Greco, F. Spezzano, I. Trubitsyna, Checking termination of bottom-up
     evaluation of logic programs with function symbols, TPLP 15 (2015) 854–889.
[16] M. Calautti, S. Greco, I. Trubitsyna, Detecting decidable classes of finitely ground logic
     programs with function symbols, in: PPDP, 2013, pp. 239–250.
[17] M. Calautti, S. Greco, C. Molinaro, I. Trubitsyna, Logic program termination analysis using
     atom sizes, in: IJCAI, 2015, pp. 2833–2839.
[18] T. Gogacz, J. Marcinkowski, A. Pieris, Uniform restricted chase termination, SIAM J.
     Comput. 52 (2023) 641–683.
[19] M. Calautti, A. Pieris, Semi-oblivious chase termination: The sticky case, ToCS 65 (2021)
     84–121.
[20] M. Calautti, G. Gottlob, A. Pieris, Non-uniformly terminating chase: Size and complexity,
     in: PODS, 2022, pp. 369–378.
[21] M. Krötzsch, M. Marx, S. Rudolph, The power of the terminating chase (invited talk), in:
     ICDT, volume 127, 2019, pp. 3:1–3:17.
[22] M. Calautti, M. Milani, A. Pieris, Semi-oblivious chase termination for linear existential
     rules: An experimental study, VLDB 16 (2023) 2858–2870.
[23] M. Calautti, G. Gottlob, A. Pieris, Chase termination for guarded existential rules, in:
     PODS, 2015, pp. 91–103.
[24] M. Arenas, L. E. Bertossi, J. Chomicki, Consistent query answers in inconsistent databases,
     in: PODS, 1999, pp. 68–79.
[25] M. Calautti, M. Console, A. Pieris, Counting database repairs under primary keys revisited,
     in: PODS, 2019, pp. 104–118.
[26] M. Calautti, S. Greco, C. Molinaro, I. Trubitsyna, Preference-based inconsistency-tolerant
     query answering under existential rules, AI 312 (2022) 103772.
[27] M. Calautti, L. Caroprese, S. Greco, C. Molinaro, I. Trubitsyna, E. Zumpano, Existential
     active integrity constraints, ESWA 168 (2021) 114297.
[28] M. Calautti, M. Console, A. Pieris, Benchmarking approximate consistent query answering,
     in: PODS, 2021, pp. 233–246.

</pre>