<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering Dichotomies for Problems in Database Theory</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Neha Makhija</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Supervised by Wolfgang Gatterbauer. Northeastern University</institution>
          ,
          <addr-line>360 Huntington Ave, Boston Massachusetts 02118</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Dichotomy theorems, which characterize the conditions under which a problem can be solved eficiently, have helped identify important tractability borders for probabilistic query evaluation, view maintenance, query containment (among many more problems). However, dichotomy theorems for many such problems remain elusive under key settings such as bag semantics or for queries with self-joins. This work aims to unearth dichotomies for fundamental problems in reverse data management and knowledge representation. We use a novel approach to discovering dichotomies: instead of creating dedicated algorithms for easy (PTIME) and hard cases (NP-complete), we devise unified algorithms that are guaranteed to terminate in PTIME for easy cases. Using this approach, we discovered new tractable cases for the problem of minimal factorization of provenance formulas as well as dichotomies under bag semantics for the problems of resilience and causal responsibility.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reverse Data Management</kwd>
        <kwd>Factorization</kwd>
        <kwd>Resilience</kwd>
        <kwd>Causal Responsibility</kwd>
        <kwd>Dichotomy</kwd>
        <kwd>Bag Semantics</kwd>
        <kwd>Self-Join</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Problem 1 (Resilience). How surprising is it for an
Oscar winning actor to act in a movie directed by their
spouse? We can quantify this by calculating the resilience
of the query △ :− Oscar(actor), ActsIn(actor, movie),
DirectedBy(movie, dir), Spouse(actor, dir). Resilience does
not equate to simply the number of output rows but rather
asks for the minimum number of changes in the world
needed to have no satisfying output. For example, if we
do not include the spouse pair of Frances McDormand and
Joel Coen, the single deletion would take away 9 rows
from the output. Intuitively, if the resilience is small,
there have been a small number of events that have led
to an Oscar winning actor being in a movie directed by
their spouse.</p>
      <p>Interestingly, the resilience for this query can be
calculated in PTIME under set semantics, but not bag
semantics (such as when accounting for multiple Oscar wins). If
we now change the query to remove the constraint of the
actor having won an Oscar, then finding the resilience of
the resulting query △ :− ActsIn(actor, movie),
DirectedBy(movie, dir), Spouse(actor, dir) is NPC!
1. Introduction
Our goal is to understand the complexity of solving three
distinct, yet related, problems: resilience, causal
responsibility and minimal factorization. They underlie many
practical problems such as reverse data management [3],
(including view maintenance [4, 5, 6], explanations and
diagnostics [7, 8, 9]), knowledge representation as boolean
formulas [10], and probabilistic inference [11].</p>
      <p>We treat all problems with the same novel method:
Rather than deriving a dedicated PTIME algorithm for
certain queries (and proving hardness for the rest), we
propose a unified Integer Linear Program (ILP)
formulation for all conjunctive queries under all problem
variants. We then show that, for all PTIME queries, the Linear
Program (LP) relaxation of our ILP has the same optimal
value, proving that existing ILP solvers are guaranteed
to solve problems for those queries in PTIME. Through
this method, we are able to uncover new tractable cases
and show the first dichotomy under bag semantics in
this space [2] (Fig. 1). In addition, we provide a unified
hardness criterion by proving a prior conjecture [12] that
states that if a query can form an “Independent Join Path”,
then it must be hard [2]. This unified hardness criterion
helps us prove the hardness of some queries with
selfjoins, whose complexities were previously unknown.</p>
      <p>Resilience. What is the minimal number of tuples to
delete from a database to eliminate all query answers?</p>
      <p>Resilience was introduced [13] to capture the essence
of all reverse data management questions: What is a
minimum set of changes to a database to produce a
certain change in the output of a query? The same work
gave a dichotomy [13] of the complexity for resilience
for self-join free queries under set semantics. While a
dichotomy for the general self-join case remains open,
VLDB 2023 PhD Workshop, co-located with the 49th International Freire et al. [12] gave partial complexity results in this
Conference on Very Large Data Bases (VLDB 2023), August 28, 2023, space, and conjectured that the notion of Independent
Vancouver, Canada Join Paths (IJPs) is a suficient and necessary condition
$ makhija.n@northeastern.edu (N. Makhija) for hardness of resilience. We proved this conjecture for
 n00e0h0a-m00a0k3h-i0ja2.2g1it-h68u3b6.io(N(N.M.Makahkihjaij)a) self-join free queries (with a slight fix of the original
state© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License ment) [2], and proved that IJPs are a suficient condition
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) for hardness in queries with self-joins as well.</p>
      <p>Fully Deactivated Deactivated Co-Deactivated</p>
      <p />
    </sec>
    <sec id="sec-2">
      <title>Triads</title>
      <p />
      <sec id="sec-2-1">
        <title>PTIME</title>
      </sec>
      <sec id="sec-2-2">
        <title>PTIME</title>
        <p>NPC
NPC
?
#P-hard</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Triads</title>
      <p>NPC
NPC
?
#P-hard</p>
      <sec id="sec-3-1">
        <title>PTIME</title>
      </sec>
      <sec id="sec-3-2">
        <title>PTIME: dominating (A)</title>
      </sec>
      <sec id="sec-3-3">
        <title>NPC: dominated (R, S, T)</title>
      </sec>
      <sec id="sec-3-4">
        <title>PTIME</title>
        <p />
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Triad</title>
      <p>NPC
NPC
NPC



#P-hard
#P-hard</p>
    </sec>
    <sec id="sec-5">
      <title>Active</title>
    </sec>
    <sec id="sec-6">
      <title>Triad</title>
      <p>NPC
NPC
NPC
NPC
NPC


9</p>
    </sec>
    <sec id="sec-7">
      <title>Hierarchical Queries</title>
      <p>1
 2

2-MQP
Queries
 
 
 1  2  3</p>
    </sec>
    <sec id="sec-8">
      <title>RES(Sets):</title>
    </sec>
    <sec id="sec-9">
      <title>RSP(Sets):</title>
    </sec>
    <sec id="sec-10">
      <title>RES(Bags):</title>
    </sec>
    <sec id="sec-11">
      <title>RSP(Bags):</title>
    </sec>
    <sec id="sec-12">
      <title>FACT(Sets):</title>
    </sec>
    <sec id="sec-13">
      <title>PROB(Sets):</title>
      <sec id="sec-13-1">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-2">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-3">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-4">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-5">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-6">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-7">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-8">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-9">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-10">
        <title>PTIME</title>
      </sec>
      <sec id="sec-13-11">
        <title>PTIME #P-hard</title>
        <p>Linear</p>
        <p>Queries
 1
 2
 3
 1
 2</p>
        <p>However, this tuple still has “partial” responsibility.</p>
        <p>By measuring how far we are from a world where
the tuple is counterfactual, we can get a notion of its
responsibility to the output.a Interestingly, due to our
new fine-grained complexity results, we can find the
responsibility of a particular Oscar win in PTIME, but
ifnding the responsibility of a tuple from the ActsIn,
DirectedBy or Spouse Table is NPC.
aThe responsibility is inversely proportional to the minimum
number of tuples to be deleted | | and is given by 1/(1 + | |).</p>
        <p>This notion of causal responsibility is due to
foundational work by Halpern, Pearl, et al. [14] which defined
causal responsibility based on minimal interventions in
the causal graph. Meliou et al. [15] adapted this concept
to define causal responsibility for database queries and
showed a dichotomy under set semantics for self-join
free queries. Our work proves a dichotomy under bag
semantics, as well as more fine-grained result based on
the relation of the tuple we find responsibility of [2].</p>
        <p>Minimal Factorization. Given the provenance
formula for a Boolean query, what is its minimal size
equivalent expression?
Problem 3 (Minimal Factorization). For the same
query △ , how could we representing its output
eficiently? Imagine the output was due to a single Oscar
new. RES denotes resilience, RSP causal responsibility, FACT minimal factorization, while PROB denotes probabilistic query
winning actor 1, who was married once represented
by 1, and acted in two movies 1, 2, while their
spouse also directed these same movies according to
tuples 1, 2. We could then represent the output of this
query with the boolean provenance formula 1111+
1122. However, notice that a factorized
representation 11(11 + 22) is also equivalent (and minimal
in this case). We showed that the minimal factorization
for △ can be found in PTIME [1], however the
complexity of self-join free queries remains open in general.</p>
        <p>Minimal factorization solves the Minimum
Equivalent Expression (MEE) [16, 10] problem for provenance
formulas. Previously work [17] on this problem has
lead to work on factorized databases [18]. However, the
dichotomy for finding the exact minimal factorization
is open. Minimal factorizations of provenance can be
used to obtain probabilistic inference bounds. Prior
approaches for approximate probabilistic inference are
either incomplete i.e. focus on just PTIME cases [11, 19, 20],
or do not solve all PTIME cases exactly [11, 21]. Using
minimal factorization achieves the best of both worlds
i.e. it applies to easy and hard cases while recovering all
known PTIME cases exactly. While we do not yet prove
a dichotomy for minimal factorization, the quest for a
dichotomy has helped prove a large tractable class: queries
with 2 Minimal Query Plans (2-MQP).1 We also place the
set of tractable queries for minimal factorization between
resilience under set semantics and probabilistic inference
in the self-join free case (Fig. 1)
2. Unified ILP-Based Framework
For all three problems, we can construct ILPs whose size
(number of constraints and variables) is polynomial in
1We hypothesize that 2-MQP Queries are a subset of linear queries.
the size of the database instance. While solving ILPs is Result 2: Dichotomies under Bag Semantics.
RealNPC in general, we proved that the PTIME relaxations of world databases consist of bags instead of sets i.e. a
relaour programs are correct for all easy cases, thus creating tion may have multiple copies of the same tuple.
Howa unified framework that solves all known easy cases in ever, the complexity of the many problems under bag
PTIME. We now provide intuition for these ILPs. For semantics is not well understood. We show that both
more details of construction and examples, please see the resilience and responsibility are easy under bag
semanfull papers [1, 2]. tics if and only if the query is linear [2]. Notice that</p>
        <p>Resilience. Intuitively, each tuple in the input is asso- the tractability frontier for bag semantics difers from
ciated with a binary decision variable, which is set to 1 set semantics, where responsibility is a strictly harder
if the tuple should be deleted else set to 0. The objective problem than resilience (Fig. 1). We show that switching
of the ILP is to minimize the number of tuples, which is from set semantics to bag semantics need only change the
the same as minimizing the sum of the variables. For bag objective function of the ILP, and the constraint matrix
semantics, one can simply use a weighted sum, where the remains the same.
weight is the number of copies of a tuple in the database. Result 3: New Tractable Cases. We are able to
We have a constraint for each output row - that it must show that for any query with ≤ 2 minimal query plans,
be deleted. An output row is deleted if any of the tuples minimal factorization can be solved in PTIME as the
recontributing to it are deleted. Thus, for each output row, laxation of the minimal factorization ILP is always
corthe sum of the variables of the tuple in the row must be rect [1]. This large class of queries includes hierarchical
≥ 1 i.e. at least one of the tuples must be deleted. queries as a special case. In addition, a fine-grained
analy</p>
        <p>Causal Responsibility. To make a tuple counterfac- sis of the complexity of causal responsibility based on the
tual, one must delete all output rows it does not con- relation of the tuple we find the responsibility of, reveals
tribute to (which is identical to the resilience problem), additional tractable cases (Fig. 1).
but with the additional constraint that some output row Result 4: Instance-Based Tractability. We prove
must be preserved. Thus, the causal responsibility ILP has cases such that our unified algorithm is guaranteed to
the variables and constraints of the resilience ILP, plus terminate in PTIME, even for hard queries. For example,
additional variables to indicate if an output row is deleted, we show that when the provenance of a query output is
as well as one additional counterfactual constraint that read-once [20], all three problems can be solved in PTIME.
enforces that all output rows cannot be deleted. The interesting aspect is that our unified algorithm does</p>
        <p>Minimal Factorization. The ILP for Minimal Factor- not need to know about these conditions as input, it just
ization is based on the idea that diferent factorizations automatically leverages those during query time.
of a provenance formulas are equivalent to evaluating Result 5: Unified Hardness Criterion. Freire et
diferent output rows with diferent query plans. Thus, al. [12] conjectured that the ability to construct
Indethe ILP assigns a query plan to each output row.2 The pendent Join Paths (IJPs) is a necessary and suficient
objective, equal to the length of the factorized expres- criterion to prove hardness of resilience for a query.
sion, is calculated by summing up projections from the We proved this conjecture for the self-join free case,
diferent query plans a tuple may be used in. and the suficiency criterion for all cases 3, and use it
as a unified way to show hardness for all three
problems. With this unified hardness criterion, we obtain
3. Key Results considerably simplified proofs, and prove hardness for
a query with self-join whose complexity was previously
unknown [2]. We give a Disjunctive Logic Program (DLP)
formulation that can computationally derive such
hardness certificates, and thus has shown several queries
whose complexity was previously unknown to be hard.</p>
        <p>We also show, with the existence of an IJP for query
(), (, , ), (, , ),  (, , ) that the tractable
cases for minimal factorization are a strict subset of the
tractable cases for resilience under set semantics.</p>
        <p>Result 1: ILP Relaxations recover all known PTIME
cases. We prove that for all prior known and newly
found PTIME cases for all three problems, our ILPs are
solved in guaranteed PTIME by standard solvers. For the
problems of resilience [2] and minimal factorization [1],
we use the LP relaxation, i.e. all integrality constraints on
all decision variables are removed. However, for causal
responsibility, we relax all variables except those that
track if an output row is preserved, creating a Mixed
Integer Linear Program (MILP). While MILPs take
exponential time in general, we show that the proposed
MILP relaxation for causal responsibility can be solved
in PTIME if the query is known to be in PTIME [2].
4. Conclusion and Future Work
2We show that considering only the “minimal query plans” sufices
to obtain the minimal factorization.</p>
        <p>This overview gives the intuition for a novel way of
determining the complexity of problems such as resilience,
3With a slight generalization of the definition of IJP [2]</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>