<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Guarded Aggregate Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthias Lanzinger</string-name>
          <email>matthias.lanzinger@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reinhard Pichler</string-name>
          <email>reinhard.pichler@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Selzer</string-name>
          <email>alexander.selzer@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Workshop Proceedings Despite significant progress in query optimization techniques over the past decades and the advent of powerful computer infrastructure (including cluster environments), DBMSs still struggle with the potential explosion of intermediate results, even if the final output is small. This is particularly glaring in the context of analytical queries which often combine data from many tables to ultimately only produce comparatively tiny aggregate results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        guarded acyclic aggregate queries, Yannakakis-style query execution can actually be
naturally integrated into standard SQL execution engines. As a result, intermediate
materialisation of join results can often be drastically reduced or even avoided entirely when
executing aggregate queries in relational DBMSs. We have implemented our approach in Spark
SQL and tested it on various standard benchmarks. The experimental results reported in Section
3 show significant improvements can be achieved for queries that fall into our targeted class.
Full details of this work are provided in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Guarded Aggregate Queries</title>
      <p>
        Recently, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a particularly favourable class of ACQs with aggregates has been
presented: the class of 0MA (short for “zero-materialisation answerable”) queries. These
are acyclic queries that can be evaluated by executing only the first bottom-up traversal of
Yannakakis’ algorithm. That is, we only need to perform the comparatively cheap semi-joins
and can completely skip the typically significantly more expensive join phase. A query of
the form Q =   1,…,  ,  1( 1),…,  (  )(  ) with   =   ( 1 ⋈ ⋯ ⋈   )) is 0MA if it satisfies the
following conditions:
• Guardedness, meaning that all grouping attributes  1, … ,   and all attributes  1, … ,  
occurring in the aggregate expressions  1( 1), … ,   (  ) occur in some relation   .
• Set-safety of the aggregate functions  1 … ,   , meaning that duplicate elimination applied
to the inner expression   ( 1 ⋈ ⋯ ⋈   ) does not alter the result of the grouping and
aggregate expression   1,…,  ,  1( 1),…,  (  ).
      </p>
      <p>
        It is known (see, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) that for queries in the class 0MA, materialisation of join results can be
completely avoided. However, it is limited to a subset of the available SQL aggregate functions.
Heavily used analytical functions such as COUNT without DISTINCT as well as SUM, AVG, or
MEDIAN depend on full joins. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], it was shown how Yannakakis’ algorithm can be extended
to ACQs with a COUNT(*) aggregate on top. We adapt this approach to integrate it into the
logical QEP of relational DBMSs and we further extend it to related aggregates such as SUM, AVG,
and MEDIAN. To this end, we keep the requirements of acyclicity and guardedness, while the
set-safety requirement is now dropped. Our technique also works for COUNT(*), which we can
view as counting the number of non-null entries grouped by the empty set of attributes; hence,
it is trivially guarded. As in the 0MA-case, we start by checking acyclicity and, simultaneously,
computing a join tree of the join query   . If   is acyclic and guarded, we take the node
labelled by the “guard” as the root of the join tree. Of course, in case of COUNT(*), we may
choose any node as the root of the join tree.
      </p>
      <p>
        The key idea is to propagate frequencies up the join tree rather than duplicating tuples
(cf. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). This propagation is realised by recursively constructing extended Relational Algebra
expressions Freq() for every node  of the join tree as follows: Every relation of the join query
  is extended by an additional attribute where we store frequency information for each tuple.
We write   to denote this additional attribute for the relation at node  and we write  ̄ for the
remaining attributes of that relation. If  is a leaf node of the join tree labelled by relation  ,
then we initialise the attribute   to 1. Formally, we thus have Freq() =  × {(1)} .


 . Then we define
Freq0() :=  × {(1)}
Freq () :=   ̄ ,
Freq() :=  
 ←SUM(
      </p>
      <p>−1 ⋅
 ←  (Freq ())</p>
      <p>Now consider an internal node  of the join tree with child nodes  1, … ,   . Again, we assume
that  is labelled by some relation</p>
      <p>with attributes  ̄ and we write   for the additional
attribute used for keeping track of frequencies. The extended Relational Algebra expression
Freq() is constructed iteratively by defining subexpressions Freq () with  ∈ {0, … , } . To avoid
confusion, we refer to the frequency attribute of such a subexpression Freq () as  
each relation Freq () consists of the same attributes  ̄ plus the additional frequency attribute
 . That is,
Freq () for every  ∈ {0, … , } and, ultimately, Freq() as follows:</p>
      <p>)(Freq−1 () ⋈ Freq(  ))
Intuitively, after initialising</p>
      <p>0 to 1 in Freq0() , the frequency values  1, … ,   are obtained by
product, i.e.,</p>
      <p>=   = Π=1   .
grouping over the attributes  ̄ of  and computing the number of possible extensions of each
tuple  ̄ in  to the relations labelling the nodes in the subtrees rooted at  1, … ,   . By the
connectedness condition of join trees, these extensions are independent of each other, i.e., they
share no attributes outside  ̄ . Moreover, the frequency attributes  1, … ,   are functionally
dependent on the attributes  ̄ . Hence, by distributivity, the value of   obtained by iterated
summation and multiplication for given tuple  ̄ of  is equal to computing, for every  ∈ {1, … , }
the sum   of the frequencies of all join partners of  ̄ in Freq(  ) and then computing their</p>
      <p>It remains to modify the aggregate functions COUNT(∗), COUNT() , SUM() , and AVG() so that
they can operate on tuples with frequencies. For instance, let   denote the frequency attribute
in Freq( ) and let  be an attribute of the relation  labelling the root node  of the join tree.
Then, in SQL-notation, we can rewrite common aggregate expressions as follows. Note that
the PERCENTILE function is not part of the ANSI SQL standard, but it is provided in Spark SQL,
which we used for our prototype implementation.</p>
      <p>• COUNT(∗) → SUM(  )
• COUNT() →
• SUM() →
• AVG() →
• MEDIAN() →</p>
      <p>SUM(IF(ISNULL(), 0,   ))
SUM( ⋅   )
SUM( ⋅   )/COUNT()</p>
      <p>PERCENTILE(0.5, ,   )</p>
    </sec>
    <sec id="sec-4">
      <title>3. Implementation and Results</title>
      <p>The approach outlined in Section 2 can be fully realised by rewriting logical query plans, even
in systems that only execute query plans based on classical two-way join trees. We have
implemented these optimisations for guarded aggregate queries in the form of standard logical
optimisation rules in Spark SQL. Recall that the Freq operator allowed us to drastically reduce
the number of joins needed for aggregate evaluation. In our Spark SQL implementation, we
have managed to completely eliminate the need for joins in case of guarded aggregation by
introducing a new physical operator FreqJoin. Intuitively, it does the work of Freq in the style
of semi-joins by first (say in relation  ) summing up the frequencies of all tuples in  with the
same join partner ( ,   ) in the other relation (say  ) and then multiplying the frequency   of 
with this sum.</p>
      <p>Query
path-03
path-04
path-05
path-06
path-07
path-08
tree-01
tree-02
tree-03</p>
      <p>Query
STATS-CEB e2e</p>
      <p>JOB Q 2
JOB Q 3
JOB Q 5
JOB Q 17</p>
      <p>JOB Q 20
TPC-H Q 2 SF200
TPC-H Q 11 SF200
TPC-H V.1 SF200
LSQB Q 1 SF300
LSQB Q 4 SF300</p>
      <p>Ref
1558±7.3
5.6±1.21
5.2±0.32
1.23±1.22
118.5±32
23.98±1.60
179.4±6.5
361.0±13.3
168.4±4.4
3096±232
602±37</p>
      <p>
        We evaluate the performance of the optimisation over 5 benchmark databases / datasets with
diferent characteristics: Join Order Benchmark (JOB) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], STATS / STATS-CEB [12], Large-Scale
Subgraph Query Benchmark (LSQB) [13], SNAP (Stanford Network Analysis Project [14]), and
TPC-H [15]. For further details on the experiments, see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>The overall performance of our proposed optimisations on the applicable queries is
summarised in Table 1b. The numbers reported there are mean times over 5 runs of the same query
with standard deviations given after ±. Our experiments on the SNAP graphs specifically are
summarised in Table 1a. The fastest execution time achieved for each case is printed in boldface.
In both tables, we refer to the reference performance of Spark SQL without any alterations as
Ref. The results obtained by applications of the logical QEP optimisations are referred to as
Opt+. The speed-up achieved by Opt+ over Ref is explicitly stated in Table 1b in the column
Opt+ Speedup. Since our optimisations apply to all 146 queries of STATS-CEB, we report the
end-to-end time of executing all queries in the benchmark.</p>
      <p>Avoiding Materialisation. To validate that the significant speedups are a result of decreased
materialisation we perform additional experiments for the STATS-CEB queries where we record
the peak number of materialised tuples during query execution. We see that the reduction
in materialisation is immense, especially in the queries that are most challenging for original
SparkSQL. In particular, in the 20 hardest queries (by runtime) for SparkSQL, we observe a
reduction of peak materialised tuples by at least 3 orders of magnitude when using Opt+.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>We propose the integration of optimisation techniques for certain kinds of aggregate queries
into relational DBMSs to eliminate the materialisation of intermediate join results. We have
shown how queries with COUNT and related aggregates (such as SUM, AVG, and MEDIAN) can be
evaluated more eficiently while operating fully within the standard two-way join plan paradigm
of query evaluation engines. We have implemented these optimisations into Spark SQL and our
experimental evaluation confirms that these techniques can provide significant performance
improvements and avoid large amounts of unnecessary materialisation in a variety of settings.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been supported by the Vienna Science and Technology Fund (WWTF)
[10.47379/ICT2201, 10.47379/VRG18013, 10.47379/NXT22018]; and the Christian Doppler
Research Association (CDG) JRC LIVE. We also acknowledge the assistance of the TU.it dataLAB
Big Data team at TU-Wien.
[12] Y. Han, Z. Wu, P. Wu, R. Zhu, J. Yang, L. W. Tan, K. Zeng, G. Cong, Y. Qin, A. Pfadler,
Z. Qian, J. Zhou, J. Li, B. Cui, Cardinality estimation in DBMS: A comprehensive benchmark
evaluation, Proc. VLDB Endow. 15 (2021) 752–765. URL: https://www.vldb.org/pvldb/
vol15/p752-zhu.pdf. doi:10.14778/3503585.3503586.
[13] A. Mhedhbi, M. Lissandrini, L. Kuiper, J. Waudby, G. Szárnyas, LSQB: a large-scale
subgraph query benchmark, in: Proceedings GRADES, ACM, 2021, pp. 8:1–8:11. URL:
https://doi.org/10.1145/3461837.3464516. doi:10.1145/3461837.3464516.
[14] J. Leskovec, A. Krevl, SNAP Datasets: Stanford large network dataset collection, http:
//snap.stanford.edu/data, 2014.
[15] TPC-H, Tpc-h benchmark, https://www.tpc.org/tpch/, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Atserias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grohe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marx</surname>
          </string-name>
          ,
          <article-title>Size bounds and query plans for relational joins</article-title>
          ,
          <source>SIAM J. Comput</source>
          .
          <volume>42</volume>
          (
          <year>2013</year>
          )
          <fpage>1737</fpage>
          -
          <lpage>1767</lpage>
          . URL: https://doi.org/10.1137/110859440. doi:
          <volume>10</volume>
          .1137/ 110859440.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mancini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karthik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chandra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mageirakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ailamaki</surname>
          </string-name>
          ,
          <article-title>Eficient Massively Parallel Join Optimization for Large Queries</article-title>
          ,
          <source>in: Proceedings SIGMOD, ACM</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>135</lpage>
          . URL: https://doi.org/10.1145/3514221.3517871. doi:
          <volume>10</volume>
          .1145/3514221.3517871.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yannakakis</surname>
          </string-name>
          ,
          <article-title>Algorithms for acyclic database schemes</article-title>
          ,
          <source>in: Proceedings VLDB, VLDB</source>
          ,
          <year>1981</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skritek</surname>
          </string-name>
          ,
          <article-title>Tractable counting of the answers to conjunctive queries</article-title>
          ,
          <source>J. Comput. Syst. Sci</source>
          .
          <volume>79</volume>
          (
          <year>2013</year>
          )
          <fpage>984</fpage>
          -
          <lpage>1001</lpage>
          . URL: https://doi.org/10.1016/j.jcss.
          <year>2013</year>
          .
          <volume>01</volume>
          .012. doi:
          <volume>10</volume>
          .1016/ j.jcss.
          <year>2013</year>
          .
          <volume>01</volume>
          .012.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Durand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mengel</surname>
          </string-name>
          ,
          <article-title>Structural tractability of counting of solutions to conjunctive queries</article-title>
          ,
          <source>Theory Comput. Syst</source>
          .
          <volume>57</volume>
          (
          <year>2015</year>
          )
          <fpage>1202</fpage>
          -
          <lpage>1249</lpage>
          . URL: https://doi.org/10.1007/s00224-014-9543-y. doi:
          <volume>10</volume>
          .1007/S00224-014-9543-Y.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Khamis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rudra</surname>
          </string-name>
          ,
          <article-title>FAQ: questions asked frequently</article-title>
          ,
          <source>in: Proceedings PODS, ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>28</lpage>
          . URL: https://doi.org/10.1145/2902251.2902280. doi:
          <volume>10</volume>
          .1145/ 2902251.2902280.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Joglekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Puttagunta</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Ré, AJAR: aggregations and joins over annotated relations</article-title>
          ,
          <source>in: Proceedings PODS, ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>106</lpage>
          . URL: https://doi.org/10.1145/2902251. 2902293. doi:
          <volume>10</volume>
          .1145/2902251.2902293.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Selzer</surname>
          </string-name>
          ,
          <article-title>Avoiding materialisation for guarded aggregate queries</article-title>
          ,
          <source>CoRR abs/2406</source>
          .17076 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv.2406.17076. doi:
          <volume>10</volume>
          .48550/ARXIV.2406.17076. arXiv:
          <volume>2406</volume>
          .
          <fpage>17076</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Longo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Okulmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Selzer</surname>
          </string-name>
          ,
          <article-title>Structure-guided query evaluation: Towards bridging the gap from theory to practice</article-title>
          ,
          <source>CoRR abs/2303</source>
          .02723 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2303.02723. doi:
          <volume>10</volume>
          .48550/arXiv.2303.02723. arXiv:
          <volume>2303</volume>
          .
          <fpage>02723</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Longo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Okulmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pichler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Selzer</surname>
          </string-name>
          ,
          <article-title>Reaching back to move forward: Using old ideas to achieve a new level of query optimization (short paper)</article-title>
          ,
          <source>in: Proceedings AMW</source>
          , volume
          <volume>3409</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3409</volume>
          /paper6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Leis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gubichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mirchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Boncz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          , T. Neumann,
          <article-title>How good are query optimizers, really?</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>9</volume>
          (
          <year>2015</year>
          )
          <fpage>204</fpage>
          -
          <lpage>215</lpage>
          . URL: http://www.vldb. org/pvldb/vol9/p204-leis.pdf.
          <source>doi:10.14778/2850583</source>
          .2850594.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>