=Paper= {{Paper |id=Vol-3409/paper2 |storemode=property |title=Diversity of Answers to Conjunctive Queries (short paper) |pdfUrl=https://ceur-ws.org/Vol-3409/paper2.pdf |volume=Vol-3409 |authors=Timo Merkl,Reinhard Pichler,Sebastian Skritek |dblpUrl=https://dblp.org/rec/conf/amw/MerklPS23 }} ==Diversity of Answers to Conjunctive Queries (short paper)== https://ceur-ws.org/Vol-3409/paper2.pdf
Diversity of Answers to Conjunctive Queries
(Extended Abstract)
Timo Camillo Merkl1 , Reinhard Pichler1 and Sebastian Skritek1
1
    TU Wien, Austria


                                         Abstract
                                         Enumeration problems aim at outputting, without repetition, the set of solutions to a given problem
                                         instance. However, outputting the entire solution set may be prohibitively expensive if it is too big. In
                                         this case, outputting a small, sufficiently diverse subset of the solutions would be preferable. This leads
                                         to the Diverse-version of the original enumeration problem, where the goal is to achieve a certain level d
                                         of diversity by selecting k solutions.
                                             In this paper, we look at the Diverse-version of the query answering problem for Conjunctive Queries
                                         and extensions thereof. That is, we study the problem if it is possible to achieve a certain level d of
                                         diversity by selecting k answers to the given query and, in the positive case, to actually compute such k
                                         answers.

                                         Keywords
                                         Query Answering, Diversity of Solutions, Complexity, Algorithms




1. Introduction
Answering database queries is one of the most fundamental tasks in computer science and
has thus been the topic of countless theoretical analyses from numerous different perspectives.
Classically, the focus of complexity studies by the database theory community has been on
decision problems such as deciding if the query result is non-empty or if a given tuple is
contained in the query result. In recent times, the enumeration problem (i.e., outputting all
answers to a query) has played a prominent role on the research agenda, see e.g., [1, 2, 3].
   It is well known that even seemingly simple problems, such as answering an acyclic Conjunc-
tive Query, can have a huge number of solutions. Consequently, specific notions of tractability
were introduced right from the beginning of research on enumeration problems [4] to separate
the computational intricacy of a problem from the mere size of the solution space. However,
even with these refined notions of tractability, the usefulness of flooding the user with tons of
solutions (many of them possibly differing only minimally) may be questionable. If the solution
space gets too big, it would be more useful to provide an overview by outputting a โ€œmeaningfulโ€
subset of the solutions.
   For instance, consider a variation of the car dealership example from [5]. Suppose that ๐ผ
models the preferences of a customer and ๐’ฎ (๐ผ ) are all cars that match these restrictions. Now,
AMW 2023: 15th Alberto Mendelzon International Workshop on Foundations of Data Management
Envelope-Open timo.merkl@tuwien.ac.at (T. C. Merkl); reinhard.pichler@tuwien.ac.at (R. Pichler);
sebastian.skritek@tuwien.ac.at (S. Skritek)
Orcid 0009-0003-7206-2518 (T. C. Merkl); 0000-0002-1760-122X (R. Pichler); 0000-0003-3054-7683 (S. Skritek)
                                       ยฉ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
in a large dealership, presenting all cars in ๐’ฎ (๐ผ ) to the customer would be infeasible. Instead, it
would be better to go through a rather small list of cars that are significantly different from
each other. With this, the customer can point at those cars which the further discussion with
the clerk should concentrate on.
   This example suggests that we should focus on finding a small diverse set of answers to
database queries. Due to the inherent hardness of computing maximally diverse solutions [5],
the database community โ€“ apart from limited exceptions [6] โ€“ mostly focused on heuristic and
approximation methods to find diverse answers (see [7] for an extensive survey). The present
work takes a step forward to fill this void, broadening our understanding of the theoretical
boundaries of diverse query answering while developing complementary exact algorithms.
More specifically, we analyze diversity problems related to answering Conjunctive Queries
(CQs) and extensions thereof.
Problem statement. To formalize the problems, we roughly follow the approach from [8] by
defining the diversity of a set of answers (tuples) to be their aggregated pairwise distances.
Furthermore, in the spirit of classical relational database theory, we consider the data untyped
and use the Hamming distance ฮ”(๐›พ , ๐›พ โ€ฒ ) as the measure of distance between two answer tuples
๐›พ , ๐›พ โ€ฒ . As far as the choice of an aggregator ๐‘“ is concerned, we impose the general restriction
that it must be computable in polynomial time. Formally, for a class ๐’ฌ of queries we study the
following problem Diverse-๐’ฌ:

                                              Diverse-๐’ฌ
        Instance:      A database instance ๐ผ, query ๐‘„ โˆˆ ๐’ฌ, and integers ๐‘˜ and ๐‘‘.
        Question:      Do there exist pairwise distinct answers ๐›พ1 , โ€ฆ , ๐›พ๐‘˜ โˆˆ ๐‘„(๐ผ ) (diversity
                       set) such that ๐‘“ ((ฮ”(๐›พ๐‘– , ๐›พ๐‘— ))1โ‰ค๐‘–<๐‘—โ‰ค๐‘˜ ) โ‰ฅ ๐‘‘?
   That is, we ask if a certain level ๐‘‘ of diversity can be achieved by choosing ๐‘˜ pairwise distinct
answers to a given query ๐‘„ over the database instance ๐ผ. Since we are intuitively looking for small
diversity sets, we usually consider ๐‘˜ to be a problem parameter in the sense of parameterized
complexity.
   For proving upper bounds on the complexity in terms of membership, we aim for the most
general setting, i.e., the most general class of aggregators ๐‘“ and the biggest query class ๐’ฌ. On
the other hand, for lower bounds in terms of hardness proofs, we aim for the most restrictive,
i.e., a concrete (but natural) aggregator ๐‘“ and the smallest query class ๐’ฌ. In particular, the
natural aggregators minimum and sum will be used to prove lower bounds while, for the
upper bound, we either impose no further restriction on ๐‘“ or require them to be monotone
in addition, i.e., ๐‘“ ((๐‘‘๐‘–,๐‘— )1โ‰ค๐‘–<๐‘—โ‰ค๐‘˜ ) โ‰ค ๐‘“ ((๐‘‘๐‘–,๐‘—โ€ฒ )1โ‰ค๐‘–<๐‘—โ‰ค๐‘˜ ) whenever ๐‘‘๐‘–,๐‘— โ‰ค ๐‘‘๐‘–,๐‘—โ€ฒ holds for every 1 โ‰ค ๐‘– <
๐‘— โ‰ค ๐‘˜. The corresponding diversity problems are denoted by Diversemin -๐’ฌ, Diversesum -๐’ฌ, and
Diversemon -๐’ฌ, respectively.
   As for the query classes ๐’ฌ, we start our analysis with the class CQ of Conjunctive Queries and
then extend our studies to the classes UCQ and CQยฌ of unions of CQs and CQs with negation.
As the most general query class, we will also look at the class FO of all first-order queries. Recall
that, even the question if an answer tuple exists at all is NP-complete for CQs [9]. We therefore
mostly restrict our study to CQs with bounded generalized hypertree width (ghw). For CQsยฌ ,
query answering remains NP-complete even if we only consider queries with ghw โ‰ค 1 [10].
Hence, for CQยฌ we impose bounded treewidth (tw), a stricter structural restriction. Lastly, for
arbitrary FO queries, we assume them to be fixed or, equivalently, bounded in size.


2. Preliminaries
Queries. Formally, we can consider our various query classes as fragments of first-order logic
with the corresponding database being a matching finite first-order structure and answers being
assignments on the free variables. CQs are formulae solely using the connectives {โˆƒ, โˆง}, while
UCQs additionally allow {โˆจ} as a top-level connective and CQsยฌ additionally allow negated
atoms.
Acyclicity and widths. (Generalized hyper-)tree decompositions and their corresponding widths
are staple tools to express the acyclicity of CQs (with negation) and define large tractable
query classes. A tree decompositions of a CQยฌ ๐‘„ is a tuple โŸจ๐‘‡ , ๐œ’ โŸฉ such that ๐‘‡ is a tree and
๐œ’ โˆถ ๐‘‰ (๐‘‡ ) โ†’ 2๐‘ฃ๐‘Ž๐‘Ÿ๐‘–๐‘Ž๐‘๐‘™๐‘’๐‘ (๐‘„) . Furthermore, for each atom ๐ด, all its variables have to appear in some
๐œ’ (๐‘ฃ) together, and for each variable ๐‘ฅ, the vertices ๐‘ฃ where ๐‘ฅ appears in ๐œ’ (๐‘ฃ) have to induce a
connected subtree in ๐‘‡. A generalized hypertree decomposition extends a tree decomposition by
a labeling ๐œ† โˆถ ๐‘‰ (๐‘‡ ) โ†’ 2๐‘Ž๐‘ก๐‘œ๐‘š๐‘ (๐‘„) such that, for every node ๐‘ฃ in ๐‘‡, every variable of ๐œ’ (๐‘ฃ) appears
in some atom of ๐œ†(๐‘ฃ). Then the treewidth of ๐‘„ and the generalized hypertree width of ๐‘„ are
defined as follows:

                tw(๐‘„) = min max |๐œ’ (๐‘ฃ)| โˆ’ 1,            ghw(๐‘„) = min max |๐œ†(๐‘ฃ)|.
                          โŸจ๐‘‡ ,๐œ’ โŸฉ   ๐‘ฃ                               โŸจ๐‘‡ ,๐œ’ ,๐œ†โŸฉ   ๐‘ฃ

CQs ๐‘„ are called acyclic (ACQs) if ghw(๐‘„) = 1 and UCQs ๐‘„ = โˆจ๐‘– ๐‘„๐‘– are called acyclic (UACQs) if
all conjunctions ๐‘„๐‘– are acyclic.
Parameterized Complexity. An instance of a parameterized problem is given as a pair (๐‘ฅ, ๐‘˜), where
๐‘ฅ is the actual problem instance and ๐‘˜ is a parameter. A problem is in FPT (fixed-parameter
tractable), if it can be solved in time ๐’ช(๐‘“ (๐‘˜) โ‹… |๐‘ฅ|๐‘ ) (for some computable function ๐‘“ and constant ๐‘)
and in XP, if it can be solved in time ๐’ช(|๐‘ฅ|๐‘“ (๐‘˜) ). Furthermore, there is a class W[1] in between
FPT and XP, i.e., FPT โІ W[1] โІ XP. It is a generally accepted assumption in parameterized
complexity theory that FPT โ‰  W[1] holds โ€“ similar but slightly stronger than the famous
P โ‰  NP assumption in classical complexity theory.


3. Main Results
The main novel lower bounds of our work are the following hardness results:

Theorem 1. On queries of bounded tw,                the problems Diversesum -CQ and
Diversemin -CQ are ๐˜ž[1]-hard, and on queries of bounded size, the same problems remain ๐˜• ๐˜—-
hard in the unparameterized case.

  The theorem follows by reduction from the (parameterized) Independent Set problem. In the
case where arbitrarily large, albeit almost acyclic, queries are allowed, the reduction preserves
the parameter and implies W[1]-hardness. On the other hand, getting by with bounded size
queries requires a more involved reduction, which then no longer preserves the parameter and
thus, only implies unparameterized NP-hardness.
  Somewhat surprisingly, dealing with UCQs is significantly harder than dealing with CQs.

Theorem 2. On UACQs and when the size of the sought-after diversity set ๐‘˜ is at most two, the
problems Diversesum -UCQ and Diversemin -UCQ remain ๐˜• ๐˜—-hard.

   Intuitively, this comes down to the fact that even if all conjuncts of a query are acyclic, the
interconnectedness of two or more conjuncts can be arbitrarily cyclic. Furthermore, when
searching for diverse answers to UCQs, it is not sufficient to solve the conjuncts independently.
   Complementing the W[1]-hardness result, we establish the following memberships:

Theorem 3. On queries of bounded ghw, the problem Diverse-CQ, and on queries of bounded tw,
the problem Diverse-CQยฌ is in ๐˜Ÿ ๐˜—.

   This runtime can be achieved by dynamic programming algorithms which maintain data
structures storing all possible ๐‘˜-tuples of partial solutions together with all possible pairwise
Hamming distances. This data structure is then propagated along the (generalized hyper-)tree
decompositions. In the end, a witnessing diversity set can be extracted from the root. Crucially,
the size of the data structure is exponential only in ๐‘˜ (and the (generalized hyper-)tree width).
   For queries of bounded size, an even faster algorithm is developed.

Theorem 4. On queries of bounded size, the problem Diversemon -๐˜ ๐˜– is in ๐˜ ๐˜— ๐˜›.

   Such a runtime is achieved by a kernelization algorithm. Essentially, first, all (polynomially
many) query answers are computed. Then, the answers are grouped by all possible subsets of
the columns, and whenever a group exceeds a certain threshold, insufficiently diverse members
of the group are discarded.
   Table 1 summarizes the discussed results.
                         Query class        Lower Bound     Upper Bound
                       bounded ghw CQ          W[1]             XP
                       bounded tw CQยฌ          W[1]             XP
                           UACQ             NP* for ๐‘˜ = 2
                       bounded size FO          NP*              FPT
Table 1
Classification of Diverse-๐’ฌ. * unparameterized.




4. Conclusion and Future Work
In this work, we took a fresh look at the Diversity problem of query answering. Looking at
different classes of queries from CQs to FO queries, we provide a fairly complete picture of
complexity results. Nevertheless, numerous directions for future work exist, including:
Restricting the database. We primarily focused on restrictions on the queries. However, restrict-
ing the database is also a viable option with far-reaching implications in related areas. Our full
paper contains some preliminary results [11].
Further parameterization. Exploring further parameterizations besides the size of the diversity
set seems interesting. For instance, choosing the diversity threshold ๐‘‘ lead to intriguing results
in [12].
Alternative distance measure. The Hamming distance might not a suitable distance measure
in all situation. It might thus be interesting to consider replacing it by more specific distance
measures derived from domain specific metrics.


Acknowledgments
This work has been funded by the Vienna Science and Technology Fund (WWTF) [10.47379/ICT2201,
10.47379/VRG18013, 10.47379/NXT22018]; and the Christian Doppler Research Association
(CDG) JRC LIVE.


References
 [1] A. Amarilli, L. Jachiet, M. Muรฑoz, C. Riveros, Efficient enumeration for annotated gram-
     mars, in: PODS, ACM, 2022, pp. 291โ€“300.
 [2] Y. Kobayashi, K. Kurita, K. Wasa, Linear-delay enumeration for minimal steiner problems,
     in: PODS, ACM, 2022, pp. 301โ€“313.
 [3] C. Lutz, M. Przybylko, Efficiently enumerating answers to ontology-mediated queries, in:
     PODS, ACM, 2022, pp. 277โ€“289.
 [4] D. S. Johnson, C. H. Papadimitriou, M. Yannakakis, On generating all maximal independent
     sets, Inf. Process. Lett. 27 (1988) 119โ€“123.
 [5] E. Hebrard, B. Hnich, B. Oโ€™Sullivan, T. Walsh, Finding diverse and similar solutions in
     constraint programming, in: AAAI, AAAI Press / The MIT Press, 2005, pp. 372โ€“377.
 [6] T. Deng, W. Fan, On the complexity of query result diversification, ACM Trans. Database
     Syst. 39 (2014) 15:1โ€“15:46.
 [7] K. Zheng, H. Wang, Z. Qi, J. Li, H. Gao, A survey of query result diversification, Knowl.
     Inf. Syst. 51 (2017) 1โ€“36.
 [8] L. Ingmar, M. G. de la Banda, P. J. Stuckey, G. Tack, Modelling diversity of solutions, in:
     AAAI, AAAI Press, 2020, pp. 1528โ€“1535.
 [9] A. K. Chandra, P. M. Merlin, Optimal implementation of conjunctive queries in relational
     data bases, in: STOC, ACM, 1977, pp. 77โ€“90.
[10] M. Samer, S. Szeider, Algorithms for propositional model counting, J. Discrete Algorithms
     8 (2010) 50โ€“64.
[11] T. C. Merkl, R. Pichler, S. Skritek, Diversity of answers to conjunctive queries, in: ICDT,
     volume 255 of LIPIcs, 2023, pp. 10:1โ€“10:19.
[12] F. V. Fomin, P. A. Golovach, L. Jaffke, G. Philip, D. Sagunov, Diverse pairs of matchings, in:
     ISAAC, 2020, pp. 26:1โ€“26:12.