-

Discovering Dichotomies for Problems in Database Theory

Neha Makhija

0 0 Supervised by Wolfgang Gatterbauer. Northeastern University , 360 Huntington Ave, Boston Massachusetts 02118 , USA

Dichotomy theorems, which characterize the conditions under which a problem can be solved eficiently, have helped identify important tractability borders for probabilistic query evaluation, view maintenance, query containment (among many more problems). However, dichotomy theorems for many such problems remain elusive under key settings such as bag semantics or for queries with self-joins. This work aims to unearth dichotomies for fundamental problems in reverse data management and knowledge representation. We use a novel approach to discovering dichotomies: instead of creating dedicated algorithms for easy (PTIME) and hard cases (NP-complete), we devise unified algorithms that are guaranteed to terminate in PTIME for easy cases. Using this approach, we discovered new tractable cases for the problem of minimal factorization of provenance formulas as well as dichotomies under bag semantics for the problems of resilience and causal responsibility.

eol>Reverse Data Management Factorization Resilience Causal Responsibility Dichotomy Bag Semantics Self-Join

Problem 1 (Resilience). How surprising is it for an Oscar winning actor to act in a movie directed by their spouse? We can quantify this by calculating the resilience of the query △ :− Oscar(actor), ActsIn(actor, movie), DirectedBy(movie, dir), Spouse(actor, dir). Resilience does not equate to simply the number of output rows but rather asks for the minimum number of changes in the world needed to have no satisfying output. For example, if we do not include the spouse pair of Frances McDormand and Joel Coen, the single deletion would take away 9 rows from the output. Intuitively, if the resilience is small, there have been a small number of events that have led to an Oscar winning actor being in a movie directed by their spouse.

Interestingly, the resilience for this query can be calculated in PTIME under set semantics, but not bag semantics (such as when accounting for multiple Oscar wins). If we now change the query to remove the constraint of the actor having won an Oscar, then finding the resilience of the resulting query △ :− ActsIn(actor, movie), DirectedBy(movie, dir), Spouse(actor, dir) is NPC! 1. Introduction Our goal is to understand the complexity of solving three distinct, yet related, problems: resilience, causal responsibility and minimal factorization. They underlie many practical problems such as reverse data management [3], (including view maintenance [4, 5, 6], explanations and diagnostics [7, 8, 9]), knowledge representation as boolean formulas [10], and probabilistic inference [11].

We treat all problems with the same novel method: Rather than deriving a dedicated PTIME algorithm for certain queries (and proving hardness for the rest), we propose a unified Integer Linear Program (ILP) formulation for all conjunctive queries under all problem variants. We then show that, for all PTIME queries, the Linear Program (LP) relaxation of our ILP has the same optimal value, proving that existing ILP solvers are guaranteed to solve problems for those queries in PTIME. Through this method, we are able to uncover new tractable cases and show the first dichotomy under bag semantics in this space [2] (Fig. 1). In addition, we provide a unified hardness criterion by proving a prior conjecture [12] that states that if a query can form an “Independent Join Path”, then it must be hard [2]. This unified hardness criterion helps us prove the hardness of some queries with selfjoins, whose complexities were previously unknown.

Resilience. What is the minimal number of tuples to delete from a database to eliminate all query answers?

Resilience was introduced [13] to capture the essence of all reverse data management questions: What is a minimum set of changes to a database to produce a certain change in the output of a query? The same work gave a dichotomy [13] of the complexity for resilience for self-join free queries under set semantics. While a dichotomy for the general self-join case remains open, VLDB 2023 PhD Workshop, co-located with the 49th International Freire et al. [12] gave partial complexity results in this Conference on Very Large Data Bases (VLDB 2023), August 28, 2023, space, and conjectured that the notion of Independent Vancouver, Canada Join Paths (IJPs) is a suficient and necessary condition $ makhija.n@northeastern.edu (N. Makhija) for hardness of resilience. We proved this conjecture for n00e0h0a-m00a0k3h-i0ja2.2g1it-h68u3b6.io(N(N.M.Makahkihjaij)a) self-join free queries (with a slight fix of the original state© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License ment) [2], and proved that IJPs are a suficient condition CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) for hardness in queries with self-joins as well.

Fully Deactivated Deactivated Co-Deactivated

Triads

PTIME PTIME

NPC NPC ? #P-hard

Triads

NPC NPC ? #P-hard

PTIME PTIME: dominating (A) NPC: dominated (R, S, T) PTIME

Triad

NPC NPC NPC #P-hard #P-hard

Active Triad

NPC NPC NPC NPC NPC 9

Hierarchical Queries

1 2 2-MQP Queries 1 2 3

RES(Sets): RSP(Sets): RES(Bags): RSP(Bags): FACT(Sets): PROB(Sets): PTIME PTIME PTIME PTIME PTIME PTIME PTIME PTIME PTIME PTIME PTIME #P-hard

Linear

Queries 1 2 3 1 2

However, this tuple still has “partial” responsibility.

By measuring how far we are from a world where the tuple is counterfactual, we can get a notion of its responsibility to the output.a Interestingly, due to our new fine-grained complexity results, we can find the responsibility of a particular Oscar win in PTIME, but ifnding the responsibility of a tuple from the ActsIn, DirectedBy or Spouse Table is NPC. aThe responsibility is inversely proportional to the minimum number of tuples to be deleted | | and is given by 1/(1 + | |).

This notion of causal responsibility is due to foundational work by Halpern, Pearl, et al. [14] which defined causal responsibility based on minimal interventions in the causal graph. Meliou et al. [15] adapted this concept to define causal responsibility for database queries and showed a dichotomy under set semantics for self-join free queries. Our work proves a dichotomy under bag semantics, as well as more fine-grained result based on the relation of the tuple we find responsibility of [2].

Minimal Factorization. Given the provenance formula for a Boolean query, what is its minimal size equivalent expression? Problem 3 (Minimal Factorization). For the same query △ , how could we representing its output eficiently? Imagine the output was due to a single Oscar new. RES denotes resilience, RSP causal responsibility, FACT minimal factorization, while PROB denotes probabilistic query winning actor 1, who was married once represented by 1, and acted in two movies 1, 2, while their spouse also directed these same movies according to tuples 1, 2. We could then represent the output of this query with the boolean provenance formula 1111+ 1122. However, notice that a factorized representation 11(11 + 22) is also equivalent (and minimal in this case). We showed that the minimal factorization for △ can be found in PTIME [1], however the complexity of self-join free queries remains open in general.

Minimal factorization solves the Minimum Equivalent Expression (MEE) [16, 10] problem for provenance formulas. Previously work [17] on this problem has lead to work on factorized databases [18]. However, the dichotomy for finding the exact minimal factorization is open. Minimal factorizations of provenance can be used to obtain probabilistic inference bounds. Prior approaches for approximate probabilistic inference are either incomplete i.e. focus on just PTIME cases [11, 19, 20], or do not solve all PTIME cases exactly [11, 21]. Using minimal factorization achieves the best of both worlds i.e. it applies to easy and hard cases while recovering all known PTIME cases exactly. While we do not yet prove a dichotomy for minimal factorization, the quest for a dichotomy has helped prove a large tractable class: queries with 2 Minimal Query Plans (2-MQP).1 We also place the set of tractable queries for minimal factorization between resilience under set semantics and probabilistic inference in the self-join free case (Fig. 1) 2. Unified ILP-Based Framework For all three problems, we can construct ILPs whose size (number of constraints and variables) is polynomial in 1We hypothesize that 2-MQP Queries are a subset of linear queries. the size of the database instance. While solving ILPs is Result 2: Dichotomies under Bag Semantics. RealNPC in general, we proved that the PTIME relaxations of world databases consist of bags instead of sets i.e. a relaour programs are correct for all easy cases, thus creating tion may have multiple copies of the same tuple. Howa unified framework that solves all known easy cases in ever, the complexity of the many problems under bag PTIME. We now provide intuition for these ILPs. For semantics is not well understood. We show that both more details of construction and examples, please see the resilience and responsibility are easy under bag semanfull papers [1, 2]. tics if and only if the query is linear [2]. Notice that

Resilience. Intuitively, each tuple in the input is asso- the tractability frontier for bag semantics difers from ciated with a binary decision variable, which is set to 1 set semantics, where responsibility is a strictly harder if the tuple should be deleted else set to 0. The objective problem than resilience (Fig. 1). We show that switching of the ILP is to minimize the number of tuples, which is from set semantics to bag semantics need only change the the same as minimizing the sum of the variables. For bag objective function of the ILP, and the constraint matrix semantics, one can simply use a weighted sum, where the remains the same. weight is the number of copies of a tuple in the database. Result 3: New Tractable Cases. We are able to We have a constraint for each output row - that it must show that for any query with ≤ 2 minimal query plans, be deleted. An output row is deleted if any of the tuples minimal factorization can be solved in PTIME as the recontributing to it are deleted. Thus, for each output row, laxation of the minimal factorization ILP is always corthe sum of the variables of the tuple in the row must be rect [1]. This large class of queries includes hierarchical ≥ 1 i.e. at least one of the tuples must be deleted. queries as a special case. In addition, a fine-grained analy

Causal Responsibility. To make a tuple counterfac- sis of the complexity of causal responsibility based on the tual, one must delete all output rows it does not con- relation of the tuple we find the responsibility of, reveals tribute to (which is identical to the resilience problem), additional tractable cases (Fig. 1). but with the additional constraint that some output row Result 4: Instance-Based Tractability. We prove must be preserved. Thus, the causal responsibility ILP has cases such that our unified algorithm is guaranteed to the variables and constraints of the resilience ILP, plus terminate in PTIME, even for hard queries. For example, additional variables to indicate if an output row is deleted, we show that when the provenance of a query output is as well as one additional counterfactual constraint that read-once [20], all three problems can be solved in PTIME. enforces that all output rows cannot be deleted. The interesting aspect is that our unified algorithm does

Minimal Factorization. The ILP for Minimal Factor- not need to know about these conditions as input, it just ization is based on the idea that diferent factorizations automatically leverages those during query time. of a provenance formulas are equivalent to evaluating Result 5: Unified Hardness Criterion. Freire et diferent output rows with diferent query plans. Thus, al. [12] conjectured that the ability to construct Indethe ILP assigns a query plan to each output row.2 The pendent Join Paths (IJPs) is a necessary and suficient objective, equal to the length of the factorized expres- criterion to prove hardness of resilience for a query. sion, is calculated by summing up projections from the We proved this conjecture for the self-join free case, diferent query plans a tuple may be used in. and the suficiency criterion for all cases 3, and use it as a unified way to show hardness for all three problems. With this unified hardness criterion, we obtain 3. Key Results considerably simplified proofs, and prove hardness for a query with self-join whose complexity was previously unknown [2]. We give a Disjunctive Logic Program (DLP) formulation that can computationally derive such hardness certificates, and thus has shown several queries whose complexity was previously unknown to be hard.

We also show, with the existence of an IJP for query (), (, , ), (, , ), (, , ) that the tractable cases for minimal factorization are a strict subset of the tractable cases for resilience under set semantics.

Result 1: ILP Relaxations recover all known PTIME cases. We prove that for all prior known and newly found PTIME cases for all three problems, our ILPs are solved in guaranteed PTIME by standard solvers. For the problems of resilience [2] and minimal factorization [1], we use the LP relaxation, i.e. all integrality constraints on all decision variables are removed. However, for causal responsibility, we relax all variables except those that track if an output row is preserved, creating a Mixed Integer Linear Program (MILP). While MILPs take exponential time in general, we show that the proposed MILP relaxation for causal responsibility can be solved in PTIME if the query is known to be in PTIME [2]. 4. Conclusion and Future Work 2We show that considering only the “minimal query plans” sufices to obtain the minimal factorization.

This overview gives the intuition for a novel way of determining the complexity of problems such as resilience, 3With a slight generalization of the definition of IJP [2]