Computing the Why-Provenance for Datalog Queries via SAT Solvers⋆ (Discussion Paper) Marco Calautti1,** , Ester Livshits2 , Andreas Pieris2,3 and Markus Schneider2 1 Department of Computer Science, University of Milan 2 School of Informatics, University of Edinburgh 3 Department of Computer Science, University of Cyprus Abstract Explaining an answer to a Datalog query is an essential task towards Explainable AI, especially nowadays where Datalog plays a critical role in the development of ontology-based applications. A well-established approach for explaining a query answer is the so-called why-provenance, which essentially collects all the subsets of the input database that can be used to obtain that answer via some derivation process, typically represented as a proof tree. It is well known, however, that computing the why-provenance for Datalog queries is computationally expensive, and thus, very few attempts can be found in the literature. The goal of this work is to demonstrate how off-the-shelf SAT solvers can be exploited towards an efficient computation of the why-provenance for Datalog queries. Interestingly, our SAT-based approach allows us to build the why-provenance in an incremental fashion, that is, one explanation at a time, which is much more useful in a practical context than the one-shot computation of the whole set of explanations as done by existing approaches. Keywords Datalog queries, explainability, why-provenance, computational complexity 1. Introduction Datalog has emerged in the 1980s as a logic-based query language from Logic Programming and has been extensively studied since then [2]. The name Datalog reflects the intention of devising a counterpart of Prolog for data processing. It essentially extends the language of unions of conjunctive queries, which corresponds to the select-project-join-union fragment of relational algebra, with the important feature of recursion, much needed to express some natural queries. As for any other query language, explaining why a result to a Datalog query is obtained is crucial towards transparent data-intensive applications. A well-established approach for providing such explanations to query answers is the so-called why-provenance [3]. Its essence is to collect all the subsets of the input database that as a whole can be used to derive a certain answer. More precisely, for Datalog queries, the why-provenance of an answer tuple ¯𝑑 is obtained by considering all the possible proof trees 𝑇 of the fact Ans(𝑑¯), with Ans being the answer predicate of the Datalog query in question, and then collecting all the database facts that label the leaves of 𝑇 . Recall that a proof tree of a fact 𝛼 w.r.t. a database 𝐷 and a set Ξ£ of Datalog SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy ⋆ This is a short version of the paper [1] ** Corresponding author. $ marco.calautti@unimi.it (M. Calautti); ester.livshits@ed.ac.uk (E. Livshits); apieris@inf.ed.ac.uk (A. Pieris); m.schneider@ed.ac.uk (M. Schneider) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings rules forms a tree-like representation of a way for deriving 𝛼 by starting from 𝐷 and executing the rules occurring in Ξ£ [2]. Despite its wide acceptance, why-provenance for Datalog queries comes with two weaknesses: it is computationally very expensive, and it may provide counterintuitive explanations. The first weakness is manifested by the fact that, although why-provenance for Datalog queries has been around for decades, only a couple of works have considered implementing it for recursive queries [4, 5]. An attempt to change this state of affairs was made in [6] by focusing on the more practical setting of computing the why-provenance of a given query answer (a.k.a. on-demand why-provenance), instead of computing the why-provenance for all the query answers. Concerning the second weakness, it has been observed that there are proof trees that correspond to unnatural derivation processes, e.g., derivations where an atom is derived from itself [7]. Now, an explanation witnessed via such an unnatural proof tree, might be classified as a counterintuitive one as it does not correspond to an intuitive derivation process that can be extracted from the proof tree; this is further discussed in Section 3. The main goal of this work is to tackle the two weaknesses of why-provenance for Datalog queries discussed above. In particular, we place our work in the more practical setting of on-demand why-provenance, and target an efficient implementation that provides conceptually meaningful explanations for the given query answer. 2. Preliminaries We consider the disjoint countably infinite sets C and V of constants and variables, respectively. We may refer to constants and variables as terms. For brevity, given an integer 𝑛 > 0, we may write [𝑛] for the set of integers {1, . . . , 𝑛}. Relational Databases. A schema S is a finite set of relation names (or predicates) with associated arity. We write 𝑅/𝑛 to say that 𝑅 has arity 𝑛 β‰₯ 0; we may write ar(𝑅) for 𝑛. A (relational) atom 𝛼 over S is an expression of the form 𝑅(𝑑¯), where 𝑅/𝑛 ∈ S and ¯𝑑 is an 𝑛-tuple of terms. By abuse of notation, we may treat tuples as the set of their elements. A fact is an atom that mentions only constants. A database over S is a finite set of facts over S. The active domain of a database 𝐷, denoted dom(𝐷), is the set of constants in 𝐷. Syntax and Semantics of Datalog Programs. A (Datalog) rule 𝜎 over a schema S is an expression of the form Β― 0 ) :– 𝑅1 (π‘₯ 𝑅0 (π‘₯ Β― 1 ), . . . , 𝑅𝑛 (π‘₯ ¯𝑛) for 𝑛 β‰₯ 1, where 𝑅𝑖 (π‘₯Β― 𝑖 ) is a (constant-free) relational atom over S for 𝑖 ∈ {0, . . . , 𝑛}, and each variable in π‘₯ Β― 0 occurs in π‘₯ Β― π‘˜ for some π‘˜ ∈ [𝑛]. We refer to 𝑅0 (π‘₯ Β― 0 ) as the head of 𝜎, denoted head(𝜎), and to the expression that appears on the right of the :– symbol as the body of 𝜎, denoted body(𝜎), which we may treat as the set of its atoms. A Datalog program over a schema S is defined as a finite set Ξ£ of Datalog rules over S. A predicate 𝑅 occurring in Ξ£ is called extensional if there is no rule in Ξ£ having 𝑅 in its head, and intensional if there exists at least one rule in Ξ£ with 𝑅 in its head. The extensional (database) schema of Ξ£, denoted edb(Ξ£), is the set of all extensional predicates in Ξ£, while the intensional schema of Ξ£, denoted idb(Ξ£), is the set of all intensional predicates in Ξ£. Note that, by definition, edb(Ξ£) ∩ idb(Ξ£) = βˆ…. The schema of Ξ£, denoted sch(Ξ£), is the set edb(Ξ£) βˆͺ idb(Ξ£), which is in general a subset of S since some predicates of S may not appear in Ξ£. An elegant property of Datalog programs is that they have three equivalent semantics: model-theoretic, fixpoint, and proof-theoretic [2]. We recall the proof-theoretic semantics of Datalog programs since it is closer to the notion of why-provenance. To this end, we need the key notion of proof tree of a fact. For a database 𝐷 and a Datalog program Ξ£, let base(𝐷, Ξ£) = {𝑅(𝑑¯) | 𝑅 ∈ sch(Ξ£) and ¯𝑑 ∈ dom(𝐷)ar(𝑅) }, the facts that can be formed using predicates of sch(Ξ£) and constants of dom(𝐷). Definition 2.1. (Proof Tree) Consider a Datalog program Ξ£, a database 𝐷 over edb(Ξ£), and a fact 𝛼 over sch(Ξ£). A proof tree of 𝛼 w.r.t. 𝐷 and Ξ£ is a finite labeled rooted tree 𝑇 = (𝑉, 𝐸, πœ†), with πœ† : 𝑉 β†’ base(𝐷, Ξ£), such that: 1. If 𝑣 ∈ 𝑉 is the root, then πœ†(𝑣) = 𝛼. 2. If 𝑣 ∈ 𝑉 is a leaf, then πœ†(𝑣) ∈ 𝐷. 3. If 𝑣 ∈ 𝑉 is a node with 𝑛 β‰₯ 1 children 𝑒1 , . . . , 𝑒 ⋃︀𝑛 , then there is a rule 𝑅0 (π‘₯Β― 0 ) :– 𝑅1 (π‘₯ Β― 𝑛 ) ∈ Ξ£ and a function β„Ž : π‘–βˆˆ[𝑛] π‘₯ Β― 1 ), . . . , 𝑅𝑛 (π‘₯ Β― 𝑖 β†’ C such that πœ†(𝑣) = 𝑅0 (β„Ž(π‘₯ Β― 0 )), and πœ†(𝑒𝑖 ) = 𝑅𝑖 (β„Ž(π‘₯ Β― 𝑖 )) for each 𝑖 ∈ [𝑛]. Essentially, a proof tree of a fact 𝛼 w.r.t. 𝐷 and Ξ£ indicates that we can derive 𝛼 starting from 𝐷 end executing the rules of Ξ£. Now, given a Datalog program Ξ£ and a database 𝐷 over sch(Ξ£), the semantics of Ξ£ on 𝐷 is Ξ£(𝐷) = {𝛼 | there is a proof tree of 𝛼 w.r.t. 𝐷 and Ξ£}. That is, the set of facts that can be proven using 𝐷 and Ξ£. Datalog Queries. It is now straightforward to recall the syntax and the semantics of Datalog queries. A Datalog query is a pair 𝑄 = (Ξ£, 𝑅), where Ξ£ is a Datalog program and 𝑅 a predicate of idb(Ξ£). Now, for a database 𝐷 over edb(Ξ£), the answer to 𝑄 over 𝐷 is defined as the set {︁ }︁ 𝑄(𝐷) = ¯𝑑 ∈ dom(𝐷)ar(𝑅) | 𝑅(𝑑¯) ∈ Ξ£(𝐷) , that is, the set of tuples ¯𝑑 of dom(𝐷)ar(𝑅) such that the fact 𝑅(𝑑¯) can be derived using 𝐷 and Ξ£. Why-Provenance for Datalog Queries. As discussed in the introduction, why-provenance is a standard way of explaining query results. It essentially collects all the subsets of the database (without unnecessary atoms) that allow to prove (or derive) a query result. We now formalize this simple idea. Given a proof tree 𝑇 = (𝑉, 𝐸, πœ†) (of some fact w.r.t. some database and Datalog program), the support of 𝑇 is the set support(𝑇 ) = {πœ†(𝑣) | 𝑣 ∈ 𝑉 is a leaf of 𝑇 }, which is essentially the set of facts that label the leaves of the proof tree 𝑇 . Note that support(𝑇 ) is a subset of the underlying database since, by definition, the leaves of a proof tree are labeled with database atoms. The formal definition of why-provenance for Datalog queries follows. Definition 2.2. (Why-Provenance for Datalog) Consider a Datalog query 𝑄 = (Ξ£, 𝑅), a database 𝐷 over edb(Ξ£), and a tuple ¯𝑑 ∈ dom(𝐷)ar(𝑅) . The why-provenance of ¯𝑑 w.r.t. 𝐷 and 𝑄 is defined as the family of sets of facts why(𝑑¯, 𝐷, 𝑄) = {support(𝑇 ) | 𝑇 is a proof tree of 𝑅(𝑑¯) w.r.t. 𝐷 and Ξ£}. Intuitively speaking, a set of facts 𝐷′ βŠ† 𝐷 that belongs to why(𝑑¯, 𝐷, 𝑄) should be understood as a reason why the tuple ¯𝑑 is an answer to the query 𝑄 over the database 𝐷, i.e., 𝐷′ explains why ¯𝑑 ∈ 𝑄(𝐷). In particular, all the facts of 𝐷′ are used in order to derive ¯𝑑 as an answer. 3. Unambiguous Proof Trees The standard notion of why-provenance defined above relies on arbitrary proof trees. However, as discussed in [7], there are proof trees that are counterintuitive. For example, such a proof tree is one where a fact is derived from itself, that is, it contains two nodes labeled with the same fact and one is a descendant of the other. Now, a member of why(𝑑¯, 𝐷, 𝑄), witnessed via such an unnatural proof tree, might be classified as a counterintuitive explanation of ¯𝑑 as it does not correspond to a natural derivation process that can be extracted from the proof tree. Therefore, we need refined classes of proof trees that overcome the conceptual limitations of arbitrary proof trees. Some refined classes of proof trees have been recently discussed in [7]: non-recursive proof trees, minimal-depth proof trees, and hereditary minimal-depth proof trees. Roughly speaking, a non-recursive proof tree is a proof tree that does not contain two nodes labeled with the same fact and one is a descendant of the other, a minimal-depth proof tree is a proof tree that has the minimum depth among all the proof trees of the same fact, and a hereditary minimal-depth proof tree is minimizing the depth of each of its subtrees. Although non-recursive and (hereditary) minimal-depth proof trees form well-justified notions that deserve our attention, there are still proof trees from those classes that can be classified as counterintuitive. In fact, there are proof trees that are non-recursive and (hereditary) minimal-depth, but they are ambiguous in the way some facts are derived. Here is a simple example that illustrates this phenomenon. Example 3.1. Consider the Datalog program Ξ£ 𝐴(π‘₯) :– 𝑆(π‘₯) 𝐴(π‘₯) :– 𝐴(𝑦), 𝐴(𝑧), 𝑇 (𝑦, 𝑧, π‘₯) that encodes the path accessibility problem [8]. The predicate 𝑆 represents source nodes, 𝐴 represents nodes that are accessible from the source nodes, and 𝑇 represents accessibility conditions, that is, 𝑇 (𝑦, 𝑧, π‘₯) means that if both 𝑦 and 𝑧 are accessible from the source nodes, then so is π‘₯. We further consider the database 𝐷 = {𝑆(π‘Ž), 𝑆(𝑏), 𝑇 (π‘Ž, π‘Ž, 𝑐), 𝑇 (𝑏, 𝑏, 𝑐), 𝑇 (𝑐, 𝑐, 𝑑)}. The following is a proof tree of the fact 𝐴(𝑑) w.r.t. 𝐷 and Ξ£ that is non-recursive and (hereditary) minimal-depth, but it suffers from the ambiguity issue described above: 𝐴(𝑑) 𝐴(𝑐) 𝐴(𝑐) 𝑇 (𝑐, 𝑐, 𝑑) 𝐴(π‘Ž) 𝐴(π‘Ž) 𝑇 (π‘Ž, π‘Ž, 𝑐) 𝐴(𝑏) 𝐴(𝑏) 𝑇 (𝑏, 𝑏, 𝑐) 𝑆(π‘Ž) 𝑆(π‘Ž) 𝑆(𝑏) 𝑆(𝑏) Indeed, there are two nodes labeled with the fact 𝐴(𝑐), but their subtrees differ, and thus, it is ambiguous how 𝐴(𝑐) is derived. Hence, the database 𝐷, which belongs to the why-provenance of (𝑑) w.r.t. 𝐷 and 𝑄 due to the above proof tree, might be classified as a counterintuitive explanation since it does not correspond to an intuitive derivation process where each fact is derived once due to a unique reason. Indeed, the intuitive explanations that one expects are the following: 𝑑 is accessible from π‘Ž via 𝑐 (i.e., the subset {𝑆(π‘Ž), 𝑇 (π‘Ž, π‘Ž, 𝑐), 𝑇 (𝑐, 𝑐, 𝑑)} of 𝐷), or 𝑑 is accessible from 𝑏 via 𝑐 (i.e., the subset {𝑆(𝑏), 𝑇 (𝑏, 𝑏, 𝑐), 𝑇 (𝑐, 𝑐, 𝑑)} of 𝐷). This leads to the class of unambiguous proof trees, where all occurrences of a fact must be proved via the same derivation. Two rooted trees 𝑇 = (𝑉, 𝐸, πœ†) and 𝑇 β€² = (𝑉 β€² , 𝐸 β€² , πœ†β€² ) are isomorphic, denoted 𝑇 β‰ˆ 𝑇 β€² , if there exists a bijection β„Ž : 𝑉 β†’ 𝑉 β€² such that, for each 𝑣 ∈ 𝑉 , πœ†(𝑣) = πœ†β€² (β„Ž(𝑣)), and for each 𝑒, 𝑣 ∈ 𝑉 , (𝑒, 𝑣) ∈ 𝐸 iff (β„Ž(𝑒), β„Ž(𝑣)) ∈ 𝐸 β€² . Let 𝑇 [𝑣] be the subtree of the proof tree 𝑇 rooted at node 𝑣. The formal definition of unambiguous proof trees follows. Definition 3.2. (Unambiguous Proof Tree) Consider a Datalog program Ξ£, a database 𝐷 over edb(Ξ£), and a fact 𝛼 over sch(Ξ£). An unambiguous proof tree of 𝛼 w.r.t. 𝐷 and Ξ£ is a proof tree 𝑇 = (𝑉, 𝐸, πœ†) of 𝛼 w.r.t. 𝐷 and Ξ£ such that, for all 𝑣, 𝑒 ∈ 𝑉 , πœ†(𝑣) = πœ†(𝑒) implies 𝑇 [𝑣] β‰ˆ 𝑇 [𝑒]. Why-provenance relative to unambiguous proof trees is defined in the obvious way: for a Datalog query 𝑄 = (Ξ£, 𝑅), a database 𝐷 over edb(Ξ£), and a tuple ¯𝑑 ∈ dom(𝐷)ar(𝑅) , the why-provenance of ¯𝑑 w.r.t. 𝐷 and 𝑄 relative to unambiguous proof trees is the family of sets of facts {support(𝑇 ) | 𝑇 is an unambiguous proof tree of 𝑅(𝑑¯) w.r.t. 𝐷 and Ξ£}, denoted whyUN (𝑑¯, 𝐷, 𝑄). The main concern of this work is to efficiently compute the why- provenance of a tuple relative to unambiguous proof trees. To this end, we are going to exploit off-the-shelf SAT solvers and report encouraging results; this is the subject of the next two sections. To the best of our knowledge, this is the first time that SAT solvers are used for computing the why-provenance. Let us stress that focusing on unambiguous proof trees, apart from their conceptual advantage discussed above, was crucial towards our encouraging results as it is unclear how a SAT-based implementation can be made practical for arbitrary proof trees. 4. From Why-Provenance to SAT In this section, we show that the why-provenance of a tuple relative to unambiguous proof trees can be extracted from the satisfying truth assignments of a Boolean formula. Compactly Representing Unambiguous Proof Trees. The construction of the Boolean formula relies on a characterization of the existence of an unambiguous proof tree of a fact 𝛼 w.r.t. 𝐷 and Ξ£ via the existence of a so-called compressed directed acyclic graph (DAG) of 𝛼 w.r.t. 𝐷 and Ξ£, which, intuitively speaking, is a compact representation of an unambiguous proof tree of 𝛼 w.r.t. 𝐷 and Ξ£; this is needed, as an unambiguous proof tree can be exponentially large in the size of 𝐷. We proceed to formalize the notion of compressed DAG and give the characterization in question. Recall that a DAG 𝐺 is rooted if it has exactly one node, the root, with no incoming edges. A node of 𝐺 is a leaf if it has no outgoing edges. Definition 4.1. (Compressed DAG) Consider a Datalog program Ξ£, a database 𝐷 over edb(Ξ£), and a fact 𝛼 over sch(Ξ£). A compressed DAG of 𝛼 w.r.t. 𝐷 and Ξ£ is a rooted DAG 𝐺 = (𝑉, 𝐸), with 𝑉 βŠ† base(𝐷, Ξ£), such that: 1. The root of 𝐺 is 𝛼. 2. If 𝛽 ∈ 𝑉 is a leaf node, then 𝛽 ∈ 𝐷. 3. If 𝛽 ∈ 𝑉 has 𝑛 β‰₯ 1 outgoing edges (𝛽, 𝛾1 ), . . . , (𝛽, ⋃︀ 𝛾𝑛 ), then there is a rule 𝑅0 (π‘₯Β― 0 ) :– 𝑅1 (π‘₯ Β― π‘š ) ∈ Ξ£ and a function β„Ž : π‘–βˆˆ[π‘š] π‘₯ Β― 1 ), . . . , π‘…π‘š (π‘₯ Β― 𝑖 β†’ C such that 𝛽 = 𝑅0 (β„Ž(π‘₯ Β― 0 )) and {𝛾𝑖 }π‘–βˆˆ[𝑛] = {𝑅𝑖 (β„Ž(π‘₯ Β― 𝑖 )) | 𝑖 ∈ [π‘š]}. For a compressed DAG 𝐺 = (𝑉, 𝐸), the support of 𝐺 is defined analogously to the support of a proof tree, that is, support(𝐺) = {𝑣 ∈ 𝑉 | 𝑣 is a leaf of 𝐺}. The desired characterization follows. Proposition 4.2. For a Datalog program Ξ£, a database 𝐷 over edb(Ξ£), a fact 𝛼 over sch(Ξ£), and a database 𝐷′ βŠ† 𝐷, the following are equivalent: 1. There exists an unambiguous proof tree 𝑇 of 𝛼 w.r.t. 𝐷 and Ξ£ such that support(𝑇 ) = 𝐷′ . 2. There exists a compressed DAG 𝐺 of 𝛼 w.r.t. 𝐷 and Ξ£ such that support(𝐺) = 𝐷′ . Note that the above characterization relies on the fact that we focus on unambiguous proof trees. More precisely, unambiguity allows us to use a single node in the compressed DAG as a representative for all the (possibly exponentially many) nodes in the proof tree labelled with the same fact. The Boolean Formula. Fix a Datalog query 𝑄 = (Ξ£, 𝑅), a database 𝐷 over edb(Ξ£), and a tuple ¯𝑑 ∈ dom(𝐷)ar(𝑅) . We construct in polynomial time in 𝐷 a Boolean formula πœ‘(𝑑¯,𝐷,𝑄) such that the why-provenance of ¯𝑑 w.r.t. 𝐷 and 𝑄 relative to unambiguous proof trees can be computed from the truth assignments that make πœ‘(𝑑¯,𝐷,𝑄) true. Among the Boolean variables of πœ‘(𝑑¯,𝐷,𝑄) , we also have the disjoint sets of variables 𝑉𝑁 and 𝑉𝐸 , where each variable in 𝑉𝑁 corresponds to a possible node of some compressed DAG of 𝑅(𝑑¯) w.r.t. 𝐷 and Ξ£, while each variable in 𝑉𝐸 corresponds to a possible edge between two nodes of some compressed DAG of 𝑅(𝑑¯) w.r.t. 𝐷 and Ξ£. The key idea is that the variables of 𝑉𝑁 and 𝑉𝐸 that become true via a satisfying truth assignment of πœ‘(𝑑¯,𝐷,𝑄) , induce the nodes and the edges of a compressed DAG 𝐺 for 𝑅(𝑑¯) w.r.t. 𝐷 and Ξ£, which, by Proposition 4.2, implies that support(𝐺) ∈ whyUN (𝑑¯, 𝐷, 𝑄). The Boolean formula πœ‘(𝑑¯,𝐷,𝑄) is a conjunction of the form πœ‘graph ∧ πœ‘root ∧ πœ‘proof ∧ πœ‘acyclic , where πœ‘graph is in charge of guaranteeing consistency between the truth assignments of the variables in 𝑉𝑁 and the variables in 𝑉𝐸 , i.e., if an edge between two nodes is part of 𝐺, then the two nodes must belong to 𝐺 as well. The formula πœ‘root guarantees that the atom 𝑅(𝑑¯) is a node of 𝐺, is the root of 𝐺, and no other node 𝑣 of 𝐺 can be the root (i.e., 𝑣 must have at least one incoming edge). The formula πœ‘proof is in charge of guaranteeing that, whenever an intensional atom 𝛼 is a node of 𝐺, then it must have the correct children in 𝐺. Finally, πœ‘acyclic is in charge of checking that 𝐺, namely the graph whose edges correspond to the true variables in 𝑉𝐸 , is acyclic. We can prove the following crucial result about the above Boolean formula: Scenario Databases Recursive #Rules TClosure [9, 10] 𝐷bitcoin (235K), 𝐷facebook (88.2K) βœ“ 2 Doctors [6] 𝐷1 (100K) βœ— 6 Galen [6] 𝐷1 (26.5K), 𝐷2 (30.5K), 𝐷3 (67K), 𝐷4 (82K) βœ“ 14 Andersen [11] 𝐷1 (68K), 𝐷2 (340K), 𝐷3 (680K), 𝐷4 (3.4M), 𝐷5 (6.8M) βœ“ 2 CSDA [11] 𝐷httpd (10M), 𝐷postgresql (34.8M), 𝐷linux (44M) βœ“ 2 Table 1 Experimental scenarios. For each database 𝐷, the number of tuples occurring in 𝐷 is in parenthesis. Proposition 4.3. Consider a Datalog query 𝑄 = (Ξ£, 𝑅), a database 𝐷 over edb(Ξ£), and a tuple ¯𝑑 ∈ dom(𝐷)ar(𝑅) . It holds that: β€’ πœ‘(𝑑¯,𝐷,𝑄) is in CNF and can be constructed in polynomial time w.r.t. 𝐷. β€’ There is a one-to-one correspondence between the members of whyUN (𝑑¯, 𝐷, 𝑄) and the satisfying assignments of πœ‘(𝑑¯,𝐷,𝑄) . β€’ Each member of whyUN (𝑑¯, 𝐷, 𝑄) can be recovered from a satisfying assignment of πœ‘(𝑑¯,𝐷,𝑄) in polynomial time w.r.t. the size of πœ‘(𝑑¯,𝐷,𝑄) . 5. Implementation and Experimental Evaluation Proposition 4.3 provides a way for computing the why-provenance of a tuple relative to unam- biguous proof trees via off-the-shelf SAT solvers. But how does this machinery behave when applied in a practical context? In particular, we are interested in the incremental computation of the why-provenance by enumerating its members, which is more useful in practice than computing the whole set at once. This is achieved by adapting a standard technique from the SAT literature for enumerating the satisfying assignments of a Boolean formula, called blocking clause. After asking the SAT solver for an arbitrary satisfying assignment 𝜏 of πœ‘(𝑑¯,𝐷,𝑄) , we output the member of whyUN (𝑑¯, 𝐷, 𝑄), denoted db(𝜏 ), corresponding to 𝜏 , and then construct a β€œblocking” clause 𝐢db(𝜏 ) which expresses that no other satisfying assignment 𝜏 β€² should give rise to the same member of the why-provenance. This will exclude the previously computed explanations from the computation. We keep adding such blocking clauses each time we get a new member of the why-provenance until the formula is unsatisfiable. We now proceed to experimentally evaluate the SAT-based approach discussed above. To this end, we consider a variety of scenarios from the Datalog literature consisting of a query 𝑄 = (Ξ£, 𝑅) and a family of databases π’Ÿ over edb(Ξ£). All the scenarios are summarized in Table 1. Experimental Setup. For each scenario 𝑠 consisting of the query 𝑄 = (Ξ£, 𝑅) and the family of databases π’Ÿ, and for each 𝐷 ∈ π’Ÿ, we have computed 𝑄(𝐷) using DLV [12], version 2.1.1, and selected 100 tuples ¯𝑑1𝑠,𝐷 , . . . , ¯𝑑100 𝑠,𝐷 from 𝑄(𝐷) uniformly at random. Then, for each 𝑖 ∈ [100], we constructed the Boolean formula πœ‘(𝑑¯𝑖 ,𝐷,𝑄) via a C++ implementation. Finally, we ran 𝑠,𝐷 the state-of-the-art SAT solver Glucose [13], version 4.2.1, with the above formula as input, to enumerate the members of whyUN (𝑑¯𝑖𝑠,𝐷 , 𝐷, 𝑄). All the experiments have been conducted on a laptop with an Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz, and 32GB of RAM, running Fedora Linux 37. Experimental Results and Take-Home Messages. Concerning the construction of the formula, in most of the scenarios, the runtime is in the order of some seconds, and we have observed that almost all the time is spent for computing the intermediate data structures, needed to build the formula. Considering the more demanding scenarios Andersen and CSDA, the total time is in the order of some minutes. Note, however, that it would be similarly demanding even for query answering. For the incremental computation of the why-provenance, we considered the time between the current member of the why-provenance and the next (the delay). All delays are in the order of milliseconds. Hence, once the formula is built, incrementally computing the members of the why-provenance is extremely fast. We also compared our SAT-based approach with the one of [6], which is the only one in the literature for constructing the why-provenance of a given tuple. We point out that for a query 𝑄, a database 𝐷, and a tuple ¯𝑑 ∈ 𝑄(𝐷), the technique of [6], which we call rule-based, constructs the set why(𝑑¯, 𝐷, 𝑄) via a rewriting of 𝑄 into a set of existential rules. Since whyUN (𝑑¯, 𝐷, 𝑄) βŠ† why(𝑑¯, 𝐷, 𝑄), one may think that computing why(𝑑¯, 𝐷, 𝑄) is more demanding. However, computing whyUN (𝑑¯, 𝐷, 𝑄) requires checking for unambiguity. We use VLog [14], version 0.9.0, for the rule-based implementation, and we perform the comparison over the Galen and Doctors-based scenarios as these are the only ones considered in [6], and thus, we have access to the rewritten set of existential rules. For the comparison, since the approach of [6] does not support incremental computation, we considered the end-to-end runtime for constructing whyUN (𝑑¯, 𝐷, 𝑄) and why(𝑑¯, 𝐷, 𝑄) using the SAT-based and the rule- based implementation, respectively; we set a 5 minutes timeout for both approaches. We observed that our SAT-based implementation consistently outperforms the rule-based one. In particular, for the Galen scenario, in most cases, the rule-based implementation does not finish in less than 5 minutes, with the worst case occurring with the largest database, where 41 out of the 100 runs time out. 6. Future Steps From our analysis, it is clear that our future efforts should focus on improving the construction of the Boolean formula. Since proof trees describe a finite reasoning process, it will be interesting to understand how they can be adapted to other rule-based formalisms with finite reasoning, in order to apply our SAT-based approach, such as logic programs with finite stable models [15, 16, 17], and ontology-mediated queries that guarantee the termination of the chase (e.g., see [18, 19, 20, 21, 22, 23]). Finally, it would be interesting to see how to adapt our approach to explain answers obtained over uncertain data, such as inconsistent databases (e.g., see [24, 25, 26, 27, 28]). Acknowledgments This work was funded by the European Union - Next Generation EU under the MUR PRIN-PNRR grant P2022KHTX7 β€œDISTORT”, and by the EPSRC grant EP/S003800/1 β€œEQUID”. References [1] M. Calautti, E. Livshits, A. Pieris, M. Schneider, Computing the why-provenance for datalog queries via SAT solvers, in: AAAI, 2024, pp. 10459–10466. [2] S. Abiteboul, R. Hull, V. Vianu, Foundations of Databases, Addison-Wesley, 1995. [3] P. Buneman, S. Khanna, W. C. Tan, Why and where: A characterization of data provenance, in: ICDT, 2001, pp. 316–330. [4] D. Zhao, P. Subotic, B. Scholz, Debugging large-scale datalog: A scalable provenance evaluation strategy, ACM Trans. Program. Lang. Syst. 42 (2020) 7:1–7:35. [5] J. Esparza, M. Luttenberger, M. Schlund, Fpsolve: A generic solver for fixpoint equations over semirings, in: CIAA, 2014, pp. 1–15. [6] A. Elhalawati, M. KrΓΆtzsch, S. Mennicke, An existential rule framework for computing why-provenance on-demand for datalog, in: RuleML+RR, 2022. [7] C. Bourgaux, P. Bourhis, L. Peterfreund, M. Thomazo, Revisiting semiring provenance for datalog, in: KR, 2022. [8] S. A. Cook, An observation on time-storage trade off, J. Comput. Syst. Sci. 9 (1974) 308–316. [9] M. Weber, G. Domeniconi, J. Chen, D. K. I. Weidele, C. Bellei, T. Robinson, C. E. Leiserson, Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics, CoRR abs/1908.02591 (2019). URL: http://arxiv.org/abs/1908.02591. [10] J. McAuley, J. Leskovec, Learning to discover social circles in ego networks, in: NIPS, 2012, p. 539–547. [11] Z. Fan, S. Mallireddy, P. Koutris, Towards better understanding of the performance and design of datalog systems, in: Datalog 2.0, 2022, pp. 166–180. [12] N. Leone, G. Pfeifer, W. Faber, T. Eiter, G. Gottlob, S. Perri, F. Scarcello, The DLV system for knowledge representation and reasoning, ACM Trans. Comput. Log. 7 (2006) 499–562. [13] G. Audemard, L. Simon, On the glucose SAT solver, Int. J. Artif. Intell. Tools 27 (2018) 1840001:1–1840001:25. [14] J. Urbani, C. Jacobs, M. KrΓΆtzsch, Column-oriented datalog materialization for large knowledge graphs, in: AAAI, 2016, pp. 258–264. [15] M. Calautti, S. Greco, F. Spezzano, I. Trubitsyna, Checking termination of bottom-up evaluation of logic programs with function symbols, TPLP 15 (2015) 854–889. [16] M. Calautti, S. Greco, I. Trubitsyna, Detecting decidable classes of finitely ground logic programs with function symbols, in: PPDP, 2013, pp. 239–250. [17] M. Calautti, S. Greco, C. Molinaro, I. Trubitsyna, Logic program termination analysis using atom sizes, in: IJCAI, 2015, pp. 2833–2839. [18] T. Gogacz, J. Marcinkowski, A. Pieris, Uniform restricted chase termination, SIAM J. Comput. 52 (2023) 641–683. [19] M. Calautti, A. Pieris, Semi-oblivious chase termination: The sticky case, ToCS 65 (2021) 84–121. [20] M. Calautti, G. Gottlob, A. Pieris, Non-uniformly terminating chase: Size and complexity, in: PODS, 2022, pp. 369–378. [21] M. KrΓΆtzsch, M. Marx, S. Rudolph, The power of the terminating chase (invited talk), in: ICDT, volume 127, 2019, pp. 3:1–3:17. [22] M. Calautti, M. Milani, A. Pieris, Semi-oblivious chase termination for linear existential rules: An experimental study, VLDB 16 (2023) 2858–2870. [23] M. Calautti, G. Gottlob, A. Pieris, Chase termination for guarded existential rules, in: PODS, 2015, pp. 91–103. [24] M. Arenas, L. E. Bertossi, J. Chomicki, Consistent query answers in inconsistent databases, in: PODS, 1999, pp. 68–79. [25] M. Calautti, M. Console, A. Pieris, Counting database repairs under primary keys revisited, in: PODS, 2019, pp. 104–118. [26] M. Calautti, S. Greco, C. Molinaro, I. Trubitsyna, Preference-based inconsistency-tolerant query answering under existential rules, AI 312 (2022) 103772. [27] M. Calautti, L. Caroprese, S. Greco, C. Molinaro, I. Trubitsyna, E. Zumpano, Existential active integrity constraints, ESWA 168 (2021) 114297. [28] M. Calautti, M. Console, A. Pieris, Benchmarking approximate consistent query answering, in: PODS, 2021, pp. 233–246.