-

1613-0073

Guarded Aggregate Queries

Matthias Lanzinger

matthias.lanzinger@tuwien.ac.at 0

Reinhard Pichler

reinhard.pichler@tuwien.ac.at 0

Alexander Selzer

alexander.selzer@tuwien.ac.at 0 0 TU Wien , Vienna , Austria

Workshop Proceedings Despite significant progress in query optimization techniques over the past decades and the advent of powerful computer infrastructure (including cluster environments), DBMSs still struggle with the potential explosion of intermediate results, even if the final output is small. This is particularly glaring in the context of analytical queries which often combine data from many tables to ultimately only produce comparatively tiny aggregate results.

CEUR ceur-ws.org

1. Introduction

guarded acyclic aggregate queries, Yannakakis-style query execution can actually be naturally integrated into standard SQL execution engines. As a result, intermediate materialisation of join results can often be drastically reduced or even avoided entirely when executing aggregate queries in relational DBMSs. We have implemented our approach in Spark SQL and tested it on various standard benchmarks. The experimental results reported in Section 3 show significant improvements can be achieved for queries that fall into our targeted class. Full details of this work are provided in [ 8 ].

2. Guarded Aggregate Queries

Recently, in [ 9 ] and [ 10 ], a particularly favourable class of ACQs with aggregates has been presented: the class of 0MA (short for “zero-materialisation answerable”) queries. These are acyclic queries that can be evaluated by executing only the first bottom-up traversal of Yannakakis’ algorithm. That is, we only need to perform the comparatively cheap semi-joins and can completely skip the typically significantly more expensive join phase. A query of the form Q = 1,…, , 1( 1),…, ( )( ) with = ( 1 ⋈ ⋯ ⋈ )) is 0MA if it satisfies the following conditions: • Guardedness, meaning that all grouping attributes 1, … , and all attributes 1, … , occurring in the aggregate expressions 1( 1), … , ( ) occur in some relation . • Set-safety of the aggregate functions 1 … , , meaning that duplicate elimination applied to the inner expression ( 1 ⋈ ⋯ ⋈ ) does not alter the result of the grouping and aggregate expression 1,…, , 1( 1),…, ( ).

It is known (see, [ 9 ]) that for queries in the class 0MA, materialisation of join results can be completely avoided. However, it is limited to a subset of the available SQL aggregate functions. Heavily used analytical functions such as COUNT without DISTINCT as well as SUM, AVG, or MEDIAN depend on full joins. In [ 4 ], it was shown how Yannakakis’ algorithm can be extended to ACQs with a COUNT(*) aggregate on top. We adapt this approach to integrate it into the logical QEP of relational DBMSs and we further extend it to related aggregates such as SUM, AVG, and MEDIAN. To this end, we keep the requirements of acyclicity and guardedness, while the set-safety requirement is now dropped. Our technique also works for COUNT(*), which we can view as counting the number of non-null entries grouped by the empty set of attributes; hence, it is trivially guarded. As in the 0MA-case, we start by checking acyclicity and, simultaneously, computing a join tree of the join query . If is acyclic and guarded, we take the node labelled by the “guard” as the root of the join tree. Of course, in case of COUNT(*), we may choose any node as the root of the join tree.

The key idea is to propagate frequencies up the join tree rather than duplicating tuples (cf. [ 4 ]). This propagation is realised by recursively constructing extended Relational Algebra expressions Freq() for every node of the join tree as follows: Every relation of the join query is extended by an additional attribute where we store frequency information for each tuple. We write to denote this additional attribute for the relation at node and we write ̄ for the remaining attributes of that relation. If is a leaf node of the join tree labelled by relation , then we initialise the attribute to 1. Formally, we thus have Freq() = × {(1)} . . Then we define Freq0() := × {(1)} Freq () := ̄ , Freq() := ←SUM(

−1 ⋅ ← (Freq ())

Now consider an internal node of the join tree with child nodes 1, … , . Again, we assume that is labelled by some relation

with attributes ̄ and we write for the additional attribute used for keeping track of frequencies. The extended Relational Algebra expression Freq() is constructed iteratively by defining subexpressions Freq () with ∈ {0, … , } . To avoid confusion, we refer to the frequency attribute of such a subexpression Freq () as each relation Freq () consists of the same attributes ̄ plus the additional frequency attribute . That is, Freq () for every ∈ {0, … , } and, ultimately, Freq() as follows:

)(Freq−1 () ⋈ Freq( )) Intuitively, after initialising

0 to 1 in Freq0() , the frequency values 1, … , are obtained by product, i.e.,

= = Π=1 . grouping over the attributes ̄ of and computing the number of possible extensions of each tuple ̄ in to the relations labelling the nodes in the subtrees rooted at 1, … , . By the connectedness condition of join trees, these extensions are independent of each other, i.e., they share no attributes outside ̄ . Moreover, the frequency attributes 1, … , are functionally dependent on the attributes ̄ . Hence, by distributivity, the value of obtained by iterated summation and multiplication for given tuple ̄ of is equal to computing, for every ∈ {1, … , } the sum of the frequencies of all join partners of ̄ in Freq( ) and then computing their

It remains to modify the aggregate functions COUNT(∗), COUNT() , SUM() , and AVG() so that they can operate on tuples with frequencies. For instance, let denote the frequency attribute in Freq( ) and let be an attribute of the relation labelling the root node of the join tree. Then, in SQL-notation, we can rewrite common aggregate expressions as follows. Note that the PERCENTILE function is not part of the ANSI SQL standard, but it is provided in Spark SQL, which we used for our prototype implementation.

• COUNT(∗) → SUM( ) • COUNT() → • SUM() → • AVG() → • MEDIAN() →

SUM(IF(ISNULL(), 0, )) SUM( ⋅ ) SUM( ⋅ )/COUNT()

PERCENTILE(0.5, , )

3. Implementation and Results

The approach outlined in Section 2 can be fully realised by rewriting logical query plans, even in systems that only execute query plans based on classical two-way join trees. We have implemented these optimisations for guarded aggregate queries in the form of standard logical optimisation rules in Spark SQL. Recall that the Freq operator allowed us to drastically reduce the number of joins needed for aggregate evaluation. In our Spark SQL implementation, we have managed to completely eliminate the need for joins in case of guarded aggregation by introducing a new physical operator FreqJoin. Intuitively, it does the work of Freq in the style of semi-joins by first (say in relation ) summing up the frequencies of all tuples in with the same join partner ( , ) in the other relation (say ) and then multiplying the frequency of with this sum.

Query path-03 path-04 path-05 path-06 path-07 path-08 tree-01 tree-02 tree-03

Query STATS-CEB e2e

JOB Q 2 JOB Q 3 JOB Q 5 JOB Q 17

JOB Q 20 TPC-H Q 2 SF200 TPC-H Q 11 SF200 TPC-H V.1 SF200 LSQB Q 1 SF300 LSQB Q 4 SF300

Ref 1558±7.3 5.6±1.21 5.2±0.32 1.23±1.22 118.5±32 23.98±1.60 179.4±6.5 361.0±13.3 168.4±4.4 3096±232 602±37

We evaluate the performance of the optimisation over 5 benchmark databases / datasets with diferent characteristics: Join Order Benchmark (JOB) [ 11 ], STATS / STATS-CEB [12], Large-Scale Subgraph Query Benchmark (LSQB) [13], SNAP (Stanford Network Analysis Project [14]), and TPC-H [15]. For further details on the experiments, see [ 8 ].

The overall performance of our proposed optimisations on the applicable queries is summarised in Table 1b. The numbers reported there are mean times over 5 runs of the same query with standard deviations given after ±. Our experiments on the SNAP graphs specifically are summarised in Table 1a. The fastest execution time achieved for each case is printed in boldface. In both tables, we refer to the reference performance of Spark SQL without any alterations as Ref. The results obtained by applications of the logical QEP optimisations are referred to as Opt+. The speed-up achieved by Opt+ over Ref is explicitly stated in Table 1b in the column Opt+ Speedup. Since our optimisations apply to all 146 queries of STATS-CEB, we report the end-to-end time of executing all queries in the benchmark.

Avoiding Materialisation. To validate that the significant speedups are a result of decreased materialisation we perform additional experiments for the STATS-CEB queries where we record the peak number of materialised tuples during query execution. We see that the reduction in materialisation is immense, especially in the queries that are most challenging for original SparkSQL. In particular, in the 20 hardest queries (by runtime) for SparkSQL, we observe a reduction of peak materialised tuples by at least 3 orders of magnitude when using Opt+.

4. Conclusion

We propose the integration of optimisation techniques for certain kinds of aggregate queries into relational DBMSs to eliminate the materialisation of intermediate join results. We have shown how queries with COUNT and related aggregates (such as SUM, AVG, and MEDIAN) can be evaluated more eficiently while operating fully within the standard two-way join plan paradigm of query evaluation engines. We have implemented these optimisations into Spark SQL and our experimental evaluation confirms that these techniques can provide significant performance improvements and avoid large amounts of unnecessary materialisation in a variety of settings.

Acknowledgments

This work has been supported by the Vienna Science and Technology Fund (WWTF) [10.47379/ICT2201, 10.47379/VRG18013, 10.47379/NXT22018]; and the Christian Doppler Research Association (CDG) JRC LIVE. We also acknowledge the assistance of the TU.it dataLAB Big Data team at TU-Wien. [12] Y. Han, Z. Wu, P. Wu, R. Zhu, J. Yang, L. W. Tan, K. Zeng, G. Cong, Y. Qin, A. Pfadler, Z. Qian, J. Zhou, J. Li, B. Cui, Cardinality estimation in DBMS: A comprehensive benchmark evaluation, Proc. VLDB Endow. 15 (2021) 752–765. URL: https://www.vldb.org/pvldb/ vol15/p752-zhu.pdf. doi:10.14778/3503585.3503586. [13] A. Mhedhbi, M. Lissandrini, L. Kuiper, J. Waudby, G. Szárnyas, LSQB: a large-scale subgraph query benchmark, in: Proceedings GRADES, ACM, 2021, pp. 8:1–8:11. URL: https://doi.org/10.1145/3461837.3464516. doi:10.1145/3461837.3464516. [14] J. Leskovec, A. Krevl, SNAP Datasets: Stanford large network dataset collection, http: //snap.stanford.edu/data, 2014. [15] TPC-H, Tpc-h benchmark, https://www.tpc.org/tpch/, 2022.

[1]

Atserias ,

Grohe ,

Marx , Size bounds and query plans for relational joins , SIAM J. Comput . 42 ( 2013 ) 1737 - 1767 . URL: https://doi.org/10.1137/110859440. doi: 10 .1137/ 110859440.

[2]

Mancini ,

Karthik ,

Chandra ,

Mageirakos ,

Ailamaki , Eficient Massively Parallel Join Optimization for Large Queries , in: Proceedings SIGMOD, ACM , 2022 , pp. 122 - 135 . URL: https://doi.org/10.1145/3514221.3517871. doi: 10 .1145/3514221.3517871.

[3]

Yannakakis , Algorithms for acyclic database schemes , in: Proceedings VLDB, VLDB , 1981 , pp. 82 - 94 .

[4]

Pichler ,

Skritek , Tractable counting of the answers to conjunctive queries , J. Comput. Syst. Sci . 79 ( 2013 ) 984 - 1001 . URL: https://doi.org/10.1016/j.jcss. 2013 . 01 .012. doi: 10 .1016/ j.jcss. 2013 . 01 .012.

[5]

Durand ,

Mengel , Structural tractability of counting of solutions to conjunctive queries , Theory Comput. Syst . 57 ( 2015 ) 1202 - 1249 . URL: https://doi.org/10.1007/s00224-014-9543-y. doi: 10 .1007/S00224-014-9543-Y.

[6]

M. A.

Khamis ,

H. Q.

Ngo ,

Rudra , FAQ: questions asked frequently , in: Proceedings PODS, ACM , 2016 , pp. 13 - 28 . URL: https://doi.org/10.1145/2902251.2902280. doi: 10 .1145/ 2902251.2902280.

[7]

M. R.

Joglekar ,

Puttagunta , C. Ré, AJAR: aggregations and joins over annotated relations , in: Proceedings PODS, ACM , 2016 , pp. 91 - 106 . URL: https://doi.org/10.1145/2902251. 2902293. doi: 10 .1145/2902251.2902293.

[8]

Lanzinger ,

Pichler ,

Selzer , Avoiding materialisation for guarded aggregate queries , CoRR abs/2406 .17076 ( 2024 ). URL: https://doi.org/10.48550/arXiv.2406.17076. doi: 10 .48550/ARXIV.2406.17076. arXiv: 2406 . 17076 .

[9]

Gottlob ,

Lanzinger ,

D. M.

Longo ,

Okulmus ,

Pichler ,

Selzer , Structure-guided query evaluation: Towards bridging the gap from theory to practice , CoRR abs/2303 .02723 ( 2023 ). URL: https://doi.org/10.48550/arXiv.2303.02723. doi: 10 .48550/arXiv.2303.02723. arXiv: 2303 . 02723 .

[10]

Gottlob ,

Lanzinger ,

D. M.

Longo ,

Okulmus ,

Pichler ,

Selzer , Reaching back to move forward: Using old ideas to achieve a new level of query optimization (short paper) , in: Proceedings AMW , volume 3409 of CEUR Workshop Proceedings, CEUR-WS.org , 2023 . URL: https://ceur-ws. org/ Vol- 3409 /paper6.pdf.

[11]

Leis ,

Gubichev ,

Mirchev ,

P. A.

Boncz ,

Kemper , T. Neumann, How good are query optimizers, really? , Proc. VLDB Endow . 9 ( 2015 ) 204 - 215 . URL: http://www.vldb. org/pvldb/vol9/p204-leis.pdf. doi:10.14778/2850583 .2850594.