Query Rewriting Based on Meta-Granular
                   Aggregation

                     Piotr Wiśniewski1 and Krzysztof Stencel2
                 1
                     Faculty of Mathematics and Computer Science
                           Nicolaus Copernicus University
                                    Toruń, Poland
                            pikonrad@mat.uni.torun.pl
                              2
                                 Institute of Informatics
                              The University of Warsaw
                                   Warsaw, Poland
                                stencel@mimuw.edu.pl


      Abstract. Analytical database queries are exceptionally time consum-
      ing. Decision support systems employ various execution techniques in
      order to accelerate such queries and reduce their resource consumption.
      Probably the most important of them consists in materialization of par-
      tial results. However, any introduction of additional derived objects into
      the database schema increases the cost of software development, since
      programmers must take care of their usage and synchronization. In this
      paper we propose novel query rewriting methods that build queries us-
      ing partial aggregations materialized in additional tables. These methods
      are based on the concept of meta-granules that represent the informa-
      tion on grouping and used aggregations. Meta-granules have a natural
      partial order that guides the optimisation process. We also present an
      experimental evaluation of the proposed rewriting method.


1   Introduction
In the article [1] we presented the idea of materializing partial aggregations in
order to accelerate analytical queries. The inherent cost of this idea is attributed
to the need of database triggers that keep the materializations up to date in
real time. In particular we showed an application programming interface that
facilitates defining and using partial aggregations. We designed and implemented
appropriate mechanisms that automatically create necessary triggers.
    The solutions presented in [1] suffer from a noteworthy deficiency. The ap-
plication programmer is obliged to cater for additional database objects that
store materialized data. He/she not only has to create them, but also has to
address them through API methods in order to use them in analytical queries.
In particular, it is impossible to use these facilities through queries formulated
in HQL, i.e. the standard query language of Hibernate ORM [2].
    In this paper we address the abovementioned deficiency. We present an algo-
rithm to rewrite HQL queries so that the usage of materialized aggregations is
transparent to application programmers.
458     P. Wiśniewski, K. Stencel

    This algorithm is based on analyses of the grouping granularity and opportu-
nities to reconstruct necessary data from partial aggregations that are persisted
in the database. In order to control the complexity of the space of aggregations
that can possibly be materialized, we introduce the notion of a meta-granule. It
represents a potentially interesting aggregation level. The set of meta-granules
is partially ordered. The rewrite method efficiently analyses and traverses the
graph of this partial order.
    Our solution is based on the idea of materialized views. A recent example of
an implementation of such views are FlexViews [3] within MySQL based on the
results described in [4, 5]. FlexViews rely on applying changes that have been
written to the change log.
    This article is organized as follows. In Section 2 we recall the idea of partial
aggregations and introduce the running example used thorough the paper. We
also discuss the integration of the prototype with Hibernate, a major object-
relational mapping system. In Section 3 we formalize meta-granules and their
partial order. In Section 4 we introduce the query rewrite algorithm that utilizes
meta-granules. Section 5 summarizes the results of our experimental evaluation
of possible gains triggered by the proposed optimization algorithm. Section 6
concludes.


2     Partial aggregation


                 customer
                                        invoice
                 cid
                                        invid
                 fname
                                        date
                 sname
                                        cid
                 cgid


                                       invoiceline
                                       invlinid
                                       qty                    product
              customergroup
                                       price                   pid
               cgid                    invid                   name
               name                    pid

       Fig. 1. The original schema of the database on customers and invoices.


Example 1. Let us consider a database schema on customers and their invoices
as show on Figure 1. Assume that the company database often has to answer
queries that follow the pattern of the SELECT statement presented below.

SELECT invoiceline.pid, invoice.date,
       sum(invoiceline.qty)
  FROM invoice JOIN invoiceline USING (invid)
                       Query Rewriting Based on Meta-Granular Aggregation       459

 GROUP BY invoice.date, invoiceline.pid
HAVING date BETWEEN ’2011-07-16’ AND ’2011-07-22’

   In order to serve such queries efficiently we extend the database schema from
Figure 1 by adding the derived table dw invline value by customer date as
shown on Figure 2. These tables will store partial sums that are needed to quickly
answer the query shown above.


        dw_invoice_value_by_customer_date
         cid
         value
         date

        customer
                                 invoice
        cid
                                invid
        fname
                                date
        sname
                                cid
        cgid                                 dw_invline_value_by_product_date
                                             pid
                                             value
                               invoiceline   date
                               invlinid
                               qty                   product
     customergroup
                               price                 pid
     cgid                      invid                 name
     name                      pid

Fig. 2. The schema of the database on customers and invoices extended with derived
tables that store materialized aggregations.


   In our research we exploit the possibility to hide internals of optimization
algorithms in the layers of the object-relational mapping [6]. In our opinion this
additional layer of abstraction is a perfect place to put disparate peculiarities of
optimisation algorithms. The database server is not the only place to implement
query rewriting. Moreover, this ORM option is sometimes the only one, e.g. when
the database system is not open-source. It also facilitates writing reusable code
that does not depend on the SQL dialect of a particular DBMS. We applied this
approach successfully to recursive querying [7] and index selection [8].
   Thanks to the generators hidden in the object-relational mapping layer, the
extra tables from Figure 2 will be created automatically. For the application
programmer it is enough to augment the declaration of the Java entity class
InvoiceLine with the annotation shown on Listing 1.1.
Listing 1.1. Java class InvoiceLine with annotations that cause generation of mate-
rialized aggregations
@Entity
public c l a s s I n v o i c e l i n e {
    ...
460     P. Wiśniewski, K. Stencel

      @DWDim(Dim = ” d a t e ” )
      private I n v o i c e i n v o i c e ;
      private Long i n v l i n i d ;
      @DWDim
      private Product p r o d u c t ;
      @DWAgr( f u n c t i o n=”SUM” )
      private I n t e g e r qty ;
       ...
    When the materialized aggregations are computed and stored in the database,
DBMS can execute the following query using derived storage objects instead of
the original user query. The modified query will be served significantly faster
since it addresses pre-aggregated data.

SELECT pid, date, value
 FROM dw_invline_value_by_customer_date
 WHERE date BETWEEN ’2011-07-16’ AND ’2011-07-22’

    If we add an appropriate annotation to the entity class Invoice, the mate-
rialized aggregation dw invoice value by customer date can be automatically
generated. Figure 2 shows this derived table as well. It allows for a notable ac-
celeration of preparation of reports on sales partitioned by dates and customers.


3     Granularity and meta-granules

In this paper granularity is the partitioning implied by the grouping clause,
while meta-granules are schema items that unambiguously define this parti-
tion. Basic meta-granules are tables with data we want to summarize, e.g.
invoice line, unit is a meta-granule that represents individual rows of the
table invoice line. The meta-granule invoice line, product, date stands
for the partitioning formalized in our example query. Let us assume that we want
to get the total sales for a particular day. For an application programmer it is
obvious that instead of base data we can use existing materialized aggregations.
    A similar meta-granule invoice line, customer, date describes grouping
by the customer and the date of sale. For this meta-granule we created a ma-
terialized aggregation as shown of Figure 2. If a user now poses a query for the
total sales on a given day, we can choose among two meta-granules that can
accelerate his/her query.
    Let us introduce a partial order of meta-granules. A meta-granule g1 is smaller
or equal than a meta-granule g2 , if and only if each row in g2 can be computed
by aggregating some rows of g1 .
    In the analysis of our running examples we will use the following symbols of
meta-granules:

                       gil = invoice line: unit
                      gpd = invoice line: product, date
                     Query Rewriting Based on Meta-Granular Aggregation       461

                     gcd = invoice line: customer, date
                     gd = invoice line: date
Then, the following inequalities are satisfied:
                                  gil ≤ gpd ≤ gd
                                  gil ≤ gcd ≤ gd
On the other hand, the meta-granules gpd and gcd are incomparable. Thus, the
partial order of meta-granules is not linear.
   Let us analyze another example queries that allow identifying other meta-
granules.
SELECT invoice.date, cg.name
       sum(invoiceline.qty)
  FROM customergroup cg JOIN customer USING(cgid)
    JOIN invoice USING(cid)
    JOIN invoiceline USING (invid)
  GROUP BY invoice.date, cg.name
  HAVING date = ’2011-07-16’
This query induces the following meta-granule:
            gcgm = invoice line: customer group, month(date)
SELECT month(invoice.date), product.pid
       sum(invoiceline.qty)
  FROM invoice JOIN invoiceline USING (invid)
    JOIN product USING (pid)
  GROUP BY month(invoice.date), product.pid
This query induces the following meta-granule:
                gpm = invoice line: product, month(date)
SELECT invid, sum(invoiceline.qty)
  FROM invoice JOIN invoiceline USING (invid)
  GROUP BY invid
  HAVING date = ’2011-07-16’
This query induces the meta-granule:
                          gi = invoice line: invoice
    Our extended schema presented on Figure 2 does not contain these meta-
granules. If a meta-granule is associated with a materialized aggregation stored
in the database, this meta-granule will be called proper. Otherwise, the meta-
granule is called virtual.
    Figure 3 shows the partial order of all meta-granules enumerated in presented
examples. Virtual meta-granules are depicted as rectangles, while proper met-
granules are portrayed as ovals. Observe that each meta-granule is bigger or equal
to the proper meta-granule of basic facts, i.e. gil . Data in all meta-granules is
derived from gil by some aggregation.
462     P. Wiśniewski, K. Stencel

             cgm cgid, month           d date           pm pid, month


                      cd cid, date

                                                   pd pid, date

                         i invid


                                     il inv line

                     Fig. 3. The partial order of meta-granules


4     The rewriting algorithm

The optimization method based on meta-granules is composed of the following
steps. Assume that a query has been posed.

1. We identify the needed meta-granule. We check if the query has internal
   WHERE clause. Next, we analyze the grouping used and we compute the
   meta-granule required to compute the answer to the query.
2. If this meta-granule is proper, the query will get rewritten so that it uses
   the materialization associated with the meta-granule instead of the base fact
   table.
3. If this meta-granule is virtual, we will find the maximal proper meta-granule
   not greater than the meta-granule of the query. This maximal meta-granule
   will be used in the rewriting of the original query.

    Since the set of all meta-granules is finite, the set of meta-granules smaller
that the given virtual meta-granule always contains maximal elements. This set
is never empty, since there exists basic fact meta-granule that is smaller than
any mete-granule. However, there can be more than one maximal meta-granule
in this set. Let us consider the following example query:

SELECT invoice.date, sum(invoiceline.qty)
  FROM invoice JOIN invoiceline USING (invid)
  GROUP BY invoice.date, product.pid

This query induces the metagranule:

                           gd = invoice line: date

For this virtual meta-granule we have two maximal proper meta-granules. They
are gcd and gpd . The usage of any of them means a significant acceleration of the
execution of this query.
                     Query Rewriting Based on Meta-Granular Aggregation        463

5     Experimental evaluation

In this Section we show the potential gains of using the optimization algorithm
proposed in this paper. We used a computer with Pentium G2120 3.1 GHz (Ivy
Brigde), 8 GB RAM. The disk was 120 GB SDD SATA III for the system and
Raid 0 on 2x Caviar Black 1 TB 7400rpm for the database storage. We used plain
PostgreSQL 9.1 installed on Ubuntu 13.04. No upfront optimization or tuning
has been performed. The tested database have the schema shown on Figure 1.
The volumes of data are summarized by Table 1.


             Table 1. Row counts of tables from the example schema

                            Table name   Row count
                          customer          4 999 000
                          customergroup        50 000
                          invoice          99 973 000
                          invoiceline   1 049 614 234
                          product               9 000


    We use three databases instances with different sets of proper meta-granules
from Figure 3. The database plain contains only the basic meta-granule with
invoice lines. The database med additionally has proper meta-granules cd and
pd. The database full materializes all meta-granules from Figure 3. Table 2
recaps proper meta-granules of all these three databases. It also shows their
sizes in gigabytes.


      Table 2. Proper meta-granules and volumes of tested database instances

                 Database name  Proper meta-granules       Volume
                     plain     il                          101 GB
                      med      cd, il, pd                  110 GB
                      full     cd, cgm, d, i, il, pd, pm   120 GB


5.1   Rewriting queries

Let us start from a simple query that returns 10 customer groups with biggest
sales in March 2006. This query can be formulated as follows assuming the
schema from Figure 2.

SELECT cgid, sum(price * qty) as sum_val,
       extract(year FROM date) as year,
       extract(month FROM date) as month
464    P. Wiśniewski, K. Stencel

  FROM invline JOIN inv USING (invid)
               JOIN cust USING (cid)
  GROUP BY cgid, year, month
  HAVING extract(year FROM date) = 2006
     AND extract(month FROM date) = 3
  ORDER BY sum_val DESC
  LIMIT 10;

    Since in the database plain no non-trivial proper meta-granule is available,
out rewriting algorithm cannot modify this query. When run in this form, it
finishes in 1 067.5 seconds.
    However, in the database med we have the meta-granule cd at our disposal. It
contains data pre-aggregated by cid and date. Therefore, our algorithm rewrites
the query to use this proper meta-granule:

SELECT cgid, sum(sum_val) as cgmsum_val,
       extract(year FROM date) as year,
       extract(month FROM date) as month
  FROM aggr_cd JOIN cust USING (cid)
  GROUP BY cgid, year, month
  HAVING extract(year FROM date) = 2006
     AND extract(month FROM date) = 3
  ORDER BY cgmsum_val DESC
  LIMIT 10;

     Now the query will run 41.6 seconds. We have accelerated this query 25
times. Although, the database med does not contain all possible meta-granules,
the running time has been significantly reduced. Of course further increase of
efficiency is possible if we have even more proper meta-granules as in the database
full. In this case, the algorithm will choose using the meta-granule cgm to get
the following query that runs in half a second.

SELECT cgid, sum_val, year, month
  FROM aggr_cgm
  WHERE year = 2006
    AND month = 3
  ORDER BY sum_val DESC
  LIMIT 10

   In the second example query we ask for 15 best sale days in 2005. This
query can be formulated as shown below assuming the schema from Figure 2.
Only in this form it can be run against the database plain that has no proper
meta-granules. It takes 3 475.1 seconds to complete its execution.

SELECT date, sum(price * qty) as sum_val
  FROM invline JOIN inv USING (invid)
  GROUP BY date
                    Query Rewriting Based on Meta-Granular Aggregation     465

  ORDER BY sum_val DESC
  LIMIT 15

    With the database med the optimiser has two options. It can use either the
meta-granule cd or pd as presented below. It takes 56.2 sec to completion with
cd and 19.9 sec with pd. Since both queries are plain SQL, the cost model of
the underlying database should be used to determine the query to be run. In
this case we have two maximal meta-granules to choose.

SELECT date, sum(sum_val) as dsum_val
  FROM aggr_cd
  GROUP BY date
  ORDER BY dsum_val DESC
  LIMIT 15;

  SELECT date, sum(sum_val) as dsum_val
  FROM aggr_pd
  GROUP BY date
  ORDER BY dsum_val DESC
  LIMIT 15;

   When we have all possible proper meta-granules and in the database full we
can use the meta-granules d and run the following query in 41 milliseconds.

SELECT date, sum_val
  FROM aggr_d
  ORDER BY sum_val DESC
  LIMIT 15;

    The third query is to list 10 best selling products together with the sold
volume from September to December 2006. In absence of proper meta-granules
it can be formulated as follows. In this form for the database plain this query
runs for 1 074 seconds.

SELECT pid, sum(qty) as sum_qty,
       sum (price * qty) as sum_val
  FROM invline JOIN inv USING (invid)
  WHERE extract(year FROM date) = 2006
     AND extract(month FROM date) between 9 and 12
  GROUP BY pid
  ORDER BY sum_val DESC
  LIMIT 10;

    In the database med we can employ the proper meta-granule pd and get the
following query that completes in 38.8 seconds.
466     P. Wiśniewski, K. Stencel

SELECT pid, sum(sum_qty) as pmsum_qty,
       sum (sum_val) as pmsum_val
  FROM aggr_pd
  WHERE extract(year FROM date) = 2006
     AND extract(month FROM date) between 9 and 12
  GROUP BY pid
  ORDER BY pmsum_val DESC
  LIMIT 10;

    The abundant meta-granules of the database full allow executing the follow-
ing query instead. It finishes in 1282 miliseconds.

SELECT pid, sum(sum_qty) as pmsum_qty,
       sum (sum_val) as pmsum_val
  FROM aggr_pm
  WHERE year = 2006
     AND month between 9 and 12
  GROUP BY pid
  ORDER BY pmsum_val DESC
  LIMIT 10;

   The results of the tests are summarized in Table 3. Obviously, the usage
appropriate proper meta-granules significantly accelerates the queries. However,
even when only limited subset of meta-granules is proper, our rewriting algorithm
can notably boost the query execution. This is the case of the database med.
Although the optimization algorithm did not have the optimal meta-granule, it
could successfully employ the meta-granules that were at its disposal.


      Table 3. Summary of query execution times for tested database instances

                   Database Query 1      Query 2      Query 3
                    plain   1 067.5 s      3 475.1 s 1 074.2 s
                     med       41.6 s 56.2 s 19.9 s     38.8 s
                     full       0.5 s         0.04 s   0.001 s


5.2   Preparing meta-granules
As usual, keeping derived data structures in sync with the base data induces a
significant overhead. In this Section we show experiments that assess this over-
head. We performed inserting a number of invoices into each database instance.
On average each invoice contained 10 lines. We performed two subsequent runs
with 10 000 invoices and one run with 15 000 invoices and one with 20 000
invoices. Before each run (but the second with 10 000 invoices) the database
management system was shutdown in order to make the database buffer cold
                     Query Rewriting Based on Meta-Granular Aggregation         467

initially. Table 4 summarizes the run times. As we can see, avoiding creation of
some proper meta-granules (compare med with full ) spares a noteworthy amount
of time.

Table 4. Time spent on inserting new invoices and synchronizing proper meta-granules

           Invoices Buffer      plain          med             full
            10 000   cold     2m 58.335s    29m 06.670s     37m 56.663s
            10 000   hot      3m 03.308s     9m 05.212s     13m 01.924s
            15 000   cold     4m 30.261s    20m 59.395s     36m 30.021s
            20 000   cold     6m 32.502s    29m 59.064s     44m 38.321s


6   Conclusions

In this paper we presented a novel method to select materialized data in ana-
lytical query execution. A fact table can be pre-aggregated for numerous sets
of its dimensions. We call this sets of dimensions meta-granules and introduce
their partial order. “Bigger” meta-granules are more aggregated, i.e., contain
less specific data. Whenever an ad-hoc query is posed, the database system can
choose using some of the stored meta-granules as a means to accelerate the
query. Sometimes, DBMS can find a perfect meta-granule. If such meta-granule
does not exists, DBMS will not give up. According to the presented rewriting
algorithm, DBMS will use the maximal suitable meta-granule, i.e. the one that
is least coarse but still fits the query. Our solution has two benefits. First, the
database administrator does not have to create all imaginable materializations.
Second, even if some reasonable materialization has been forgotten, the database
system can still use exiting imperfect meta-granules to boost the query.
    We have also shown results of the experimental evaluation of out method. It
proves that even if some ideal meta-granules lack, the database system can still
offer satisfactory performance. The experiments also attest that the overhead
caused by the need to keep materialized data in sync is acceptable.


References

1. Gawarkiewicz, M., Wiśniewski, P.: Partial aggregation using Hibernate. In: FGIT.
   Volume 7105 of LNCS. (2011) 90–99
2. O’Neil, E.J.: Object/relational mapping 2008: Hibernate and the Entity Data Model
   (EDM). In Wang, J.T.L., ed.: SIGMOD Conference, ACM (2008) 1351–1356
3. Flexviews: Incrementally refreshable materialized views for MySQL (2012)
4. Mumick, I.S., Quass, D., Mumick, B.S.: Maintenance of data cubes and summary
   tables in a warehouse. In Peckham, J., ed.: SIGMOD Conference, ACM Press (1997)
   100–111
468     P. Wiśniewski, K. Stencel

5. Salem, K., Beyer, K., Lindsay, B., Cochrane, R.: How to roll a join: asynchronous
   incremental view maintenance. SIGMOD Rec. 29 (2000) 129–140
6. Melnik, S., Adya, A., Bernstein, P.A.: Compiling mappings to bridge applications
   and databases. ACM Trans. Database Syst. 33 (2008)
7. Szumowska, A., Burzańska, M., Wiśniewski, P., Stencel, K.: Efficient implemen-
   tation of recursive queries in major object relational mapping systems. In: FGIT.
   (2011) 78–89
8. Boniewicz, A., Gawarkiewicz, M., Wiśniewski, P.: Automatic selection of functional
   indexes for object relational mappings system. International Journal of Software
   Engineering and Its Applications 7 (2013)