-

Accessing Answers to Conjunctive Queries with Ideal Time Guarantees

Nofar Carmeli

0 0 Inria, LIRMM, University of Montpellier , CNRS , France

2023

36 2 4

When can we answer conjunctive queries with ideal time guarantees? We will start this talk by examining diferent kinds of query-answering tasks and the connections between them. These tasks include enumerating all answers, sampling answers without repetitions, and simulating a sorted array of the answers. From a data complexity point of view, the ideal time guarantees for these tasks are constant time per answer following a linear preprocessing (required to read the database input). Our goal is to avoid the polynomial preprocessing required to produce all answers. We will then have an examplebased discussion of the complexity landscape for these tasks. In particular, we will see how self-joins, constraints and unions can play a crucial role in determining the complexity and designing eficient algorithms.

eol>conjunctive queries fine-grained complexity constant-delay enumeration

2(actor, year) ←

Cast(movie, actor), Release(movie, year) 3(p1, p2, room, grade) ← Seating(p1, room), Seating(p2, room), Grade(p1, grade), Grade(p2, grade) 4(post, p1, p2) ← Posts(post, p1), Follows(p1, p2) 5(post, p1, p2) ← Posts(post, p0), Follows(p0, p1), Friends(p1, p2) require that the free variables are connected in a non-standard way. The hardness side of this dichotomy assumes the hardness of Boolean matrix multiplication and hyperclique detection in hypergraphs. For more details about this dichotomy, see [ 3 ]. Consider for example the first two queries defined in Figure 1. These two queries are classified as hard according to Theorem 1 as 1 is cyclic and 2 is acyclic but not free-connex. If the variable year was removed from the head of 2, this query would become acyclic free-connex and classified as easy.

2. Additional Tractable Cases

Can we find CQs that can be answered with ideal guarantees even though they are not freeconnex acyclic? As a first impression, we might think that the answer is ’no’ due to the negative side of Theroem 1. However, we will see that such cases exist.

As a first example, consider 3. This query is cyclic, but it can be answered with linear preprocessing and constant delay due to the self-joins it contains. Theroem 1 only applies to CQs in which every atom refers to a diferent relation. If this is not the case, we say that the query contains self-joins, and we can sometimes use these to design more eficient algorithms [ 4 ]. The complete classification of CQs with self-joins is not yet known.

Next, let us consider 2 again. This query asks for a list of actors and the years in which they were active. If we just consider the structure of the query and assume the input database can be general, this query is classified as hard according to Theroem 1 as it is not free-connex. However, considering the semantics of this query, we can assume that every movie has one release year. This information can be modeled using the functional dependency Release : movie → year. If we can assume that the input database conforms to this constraint, the problem becomes easier, and it can now be solved with ideal guarantees [ 5 ]. The complete classification of CQs with general functional dependencies is not yet known. We do have a classification for unary functional dependencies, where one variable implies another, for CQs without self-joins. This classification is done by extending the query according to the dependencies (in the case of this example, adding the variable year in all places that contain the variable movie), and checking whether the extended query is acyclic free-connex.

Finally, assume the CQ is used as part of a larger query, and consider the union of CQs 4 ∪ 5. Even though 5 is not free-connex, the entire union can be answered with ideal guarantees since we can intertwine the computation of 5 with that of 4. In fact, even a union that contains only non-free-connex CQs can become tractable in a similar way. The complete classification of unions of CQs is not yet known, but we have a classification that applies for some subclasses [ 6, 7 ].

3. Doing Even Better in the Tractable Cases

Sometimes the final goal of the user is not just to produce all answers to a CQ, but they can be interested in diferent query answering tasks such as computing the best answer or a median answer according to some ranking function. In other cases, the user may want to compute some statistics over the answers and be willing to compromise for the accuracy of these statistics in favor of speed. In such a case, a random sample of the answers can be useful. Figure 2 shows a summary of some query-answering tasks and the relation of the implication of eficient algorithms between tasks.

The previous section discusses enumeration without order requirements. Clearly, ranked enumeration and random-ordered enumeration are stricter requirements. However, ranked enumeration allows eficiently computing the best answer or in general the top k answers for any k. Random-ordered enumeration is in other words sampling without repetitions, and so this is stronger than the task of sampling with repetitions. Direct access is the task of simulating an array containing the query answers in a way that allows random access to answers in arbitrary indices. It also requires outputting an error message if the index requested is too large, and this can be used together with binary search to determine the number of answers (a task called counting) with only a logarithmic number of access calls. Direct access can also be used to achieve random-ordered enumeration by generating a random permutation of the indices and accessing the corresponding answers on the fly [ 8 ]. If the simulated array is required to be ordered, we call this task ranked access, and it can be used to directly access the middle answer for computing the median or to similarly compute any other quantile.

Assume we are interested in ordering the query answers in a lexicographic order. That is, we order the answers by the assignment to one variable, then break ties according to another variable, and so on. The free-connex acyclic CQs, which we know have extremely eficient enumeration algorithms, also have eficient algorithms for quantile computation, direct access and ranked enumeration [ 9, 10 ], with linear preprocessing and logarithmic time per outputted answer. Regarding ranked enumeration, this depends on the specific lexicographic order [ 9 ].

[1]

Brault-Baron , De la pertinence de l' énumération: complexité en logiques propositionnelle et du premier ordre , Ph.D. thesis , Université de Caen, 2013 .

[2]

Bagan ,

Durand , E. Grandjean, On acyclic conjunctive queries and constant delay enumeration , in: International Workshop on Computer Science Logic, Springer, 2007 , pp. 208 - 222 .

[3]

Berkholz ,

Gerhardt ,

Schweikardt , Constant delay enumeration for conjunctive queries: a tutorial , ACM SIGLOG News 7 ( 2020 ) 4 - 33 .

[4]

Carmeli , L. Segoufin, Conjunctive queries with self-joins, towards a fine-grained complexity analysis , in: PODS'23 , 2023 .

[5]

Carmeli ,

Kröll , Enumeration complexity of conjunctive queries with functional dependencies , Theory of Computing Systems 64 ( 2020 ) 828 - 860 .

[6]

Carmeli ,

Kröll , On the enumeration complexity of unions of conjunctive queries , ACM Transactions on Database Systems (TODS) 46 ( 2021 ) 1 - 41 .

[7]

Bringmann ,

Carmeli , Unbalanced triangle detection and enumeration hardness for unions of conjunctive queries , arXiv preprint arXiv:2210.11996 ( 2022 ).

[8]

Carmeli ,

Zeevi ,

Berkholz ,

Conte ,

Kimelfeld ,

Schweikardt , Answering (unions of) conjunctive queries using random access and random-order enumeration , ACM Transactions on Database Systems (TODS) 47 ( 2022 ) 1 - 49 .

[9]

Carmeli ,

Tziavelis ,

Gatterbauer ,

Kimelfeld ,

Riedewald , Tractable orders for direct access to ranked answers of conjunctive queries , ACM Transactions on Database Systems 48 ( 2023 ) 1 - 45 .

[10]

Tziavelis ,

Gatterbauer ,

Riedewald , Beyond equi-joins: Ranking, enumeration and factorization , in: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases , volume 14 , NIH Public Access, 2021 , p. 2599 .