<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimization of Percentage Cube Queries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ladjel Bellatreche LIAS/ISAE-ENSMA Poitiers</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carlos Ordonez University of Houston USA</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Houston</institution>
          ,
          <addr-line>Houston, TX 77204</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Javier García-García C3/UNAM</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Yiqun Zhang University of Houston USA</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>OLAP cubes are a powerful database technology to join tables and aggregate data to discover interesting trends. However, OLAP cubes exhibit limitations to uncover fractional relationships on a measure aggregated at multiple granularity levels. One prominent example is the percentage, an intuitive probablistic metric, used in practically every analytic application. With such motivation in mind, we introduce the percentage cube, a generalized data cube that takes percentages as the target aggregated measure. Speci cally, the percentage cube shows the fractional relationship on a measure in every cuboid between fact table rows grouped by a set of columns (detail individual groups) and their rolled-up aggregation by a subset of those grouping columns (total group). We inroduce minimal query syntax and we carefully study query optimization to compute percentage cubes. It turns out that percentage cubes are signi cantly more difficult to evaluate than standard data cubes because, in addition to the exponential number of cuboids, there exists a doubly exponential number of grouping column pairs (grouping columns at the individual level and at the total level) on which percentages are computed. Fortunately, it is feasible to prune the search space with a threshold similar to iceberg queries. Experiments on a DBMS compare our novel query optimizations against existing SQL OLAP window functions. Our benchmark results show that our proposed SQL extension is more abstract, more intuitive and faster than existing SQL functions to compute percentages on the cube.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Companies today rely heavily on decision support
systems for help on analytical tasks to stay competitive. Those
systems can identify important or interesting trends by
retrieving decision support information via cube queries. Data
cube, rst introduced in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], generalized the standard \GROUP
BY" operator to compute aggregations for every
combination of the grouping columns. Building data cubes has been
well recognized as one of the most important and essential
operations in OLAP. Research on building data cubes is
extensive and many methods have been proposed to compute
data cubes efficiently from relational data [
        <xref ref-type="bibr" rid="ref2 ref7 ref9">2, 9, 7</xref>
        ]. However,
the aggregation applied on the cube measure that most of
the research has been studying on never goes further than
the standard ones: sum(); avg(); count(); max() and min().
A very important and special aggregation that is missing
from the list is the percentage.
      </p>
      <p>
        Percentages are essential to data analysis. They can
express the proportionate relationship between two amounts
summarized by different levels. Sometimes, percentages are
less deceiving than absolute values so they are very
suitable for comparisons. Furthermore, percentages can also be
used as an intermediate step in some applications for more
complex analysis. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed ways to compute percentage
aggregation in RDBMS. But getting percentage values from
data cube is syntactically much more complicated and not
as straightforward as standard aggregations. In this paper,
we introduce a special form of data cube taking percentages
as the aggregated measure and we call it percentage cube.
A percentage cube shows the fractional relationship in
every cuboid between each aggregated measure and its further
summed-up measures aggregated by less detailed grouping
columns.
      </p>
      <p>Here we give an example of a percentage cube. Assume
we have a fact table about the sales amount of a company
in the rst two quarters of 2016 at some states as shown in
Table 1.</p>
      <p>The fact table F has two dimensions state and quarter,
and one measure salesAmt. In order to develop an insight of
the sales amount, we can build a multi-dimensional OLAP
cube as shown in Table 2.</p>
      <p>From Table 2 it is easy to nd out the sum of the sales
amount grouped by any possible combination of the
dimensions in the fact table. However, sometiems we are
interested in the fractional relationship like how much is the sales
amount in Q1 accounted for the sales amount of the rst half
of the year. With the ordinary OLAP cube, we have to rst
query the sales amount in Q1 (128M), then query the sales
amount of the rst half of the year (226M), and nally do
the division to get the answer (57%). It does not seem to be
very complicated when looking at this single question. But
data analysts explore the OLAP cube with a lot of cube
operations (roll-up, drill-down, slicing, and dicing etc.), the
effort of identifying the individual / total group and
issuing additional queries every time when we need to see the
percentage adds up quickly as a burden in the analysis.</p>
      <p>Instead, Table 3 shows a percentage cube we could build
on top of the fact table F .</p>
      <p>With this percentage cube table, we will be able to answer
the question (how much is the sales amount in Q1 accounted
for the sales amount of the rst half of the year) at the rst
glance of the table (the row in bold font). Compared to
the OLAP cube table (Table 2), each cuboid in the
percentage cube is expanded quite a bit. For example, for cuboid
fstate; quarterg, in the OLAP cube we only have four rows
of data showing the sales amount in every state and every
quarter. But in the percentage cube, depending on which
dimensions we are \total by" and \break down by", we have 12
rows of data showing the percentage taking different columns
as total amount (denominator) and individual amount
(numerator).</p>
      <p>In reality, analysts may not even need to look at this
percentage cube table in order to get the answer they want. It
is natural that percentages can be easily visualized as pie
charts. A percentage cube in this sense is no more than a
collection of pie charts. A user may even be able to navigate
through those pie charts in a hierachical manner, adding
grouping granularity step by step. Once the computation of
a percentage cube completes, the exploration on the cube
visualization will not need to go under any additional
computations therefore it has no additional costs. Figure 1 shows
an example.</p>
      <p>Unfortunately, existing SQL aggregate functions, OLAP
window functions, and Multidimensional Expressions (MDX)
are all insufficient for computing percentage cubes. The
computation is too complicated to express using existing
syntax, also the exponential number of grouping column
pairs added a lot of complexities. In this paper we introduce
simple percentage aggregate functions and percentage cube
syntax, as well as important recommendations to efficiently
evaluate them.</p>
      <p>This paper is organized as follows. Section 2 presents
definitions related to OLAP aggregations and the percentage
cube. Section 3 introduces the function to compute
percentage aggregation and then extends it to work efficiently
to evaluate a percentage cube. Section 4 contains
experimental evaluation. Section 5 discusses related approaches
and the uniqueness of our work. Section 6 concludes the
paper.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>DEFINITION</title>
      <p>
        Let F be a relation having a primary key represented by
a row identi er i, d categorical attributes and one numerical
attribute: F (i; D1; : : : ; Dd; A). Relation F is represented in
SQL as a table having a primary key, d categorical columns
and one numerical column. Categorical attributes
(dimensions) are used to group rows to aggregate the numerical
attribute (measure). Attribute A represents some
mathematical expression involving measures. In general F can
be a temporary table resulting from some query or a view.
We will use F to generate a percentage cube with all the
d dimensions and one measure [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Percentage
computation involves two levels of aggregations: the individual level
that will appear as numerator in the percentage
computation, and the total level that will show as the denominator.
In the DBMS, we name the resultant table of the
individual level aggregation \Findv" and the total level aggregation
\Ftotal". Both levels of aggregations aggregate the measure
A by different sets of grouping columns.
      </p>
      <p>Percentage computation in a percentage cube happens in
the unit of cuboid. When talking about a cuboid, we use G
to represent its grouping columns, that is, G represents the
dimensions in that cuboid which are not \ALL"s. Also we
let g = jGj to represent the number of the grouping columns
in a cuboid. In each cuboid, we use L = fL1; : : : ; Ljg
to represent the grouping columns used in the total level
aggregation. When computing percentages, measures
aggregated by L will serve as the total amount
(denominator). Therefore, columns in L can also be called the \total
by" columns. The total amounts then can be further
broken down to individual amounts using additional grouping
columns R = fR1; : : : ; Rkg; L \ R = ∅. Columns in R are
called \break down by" columns. Overall, the individual
level aggregation will use L [ R as its grouping columns.
Note that set L can be empty, in that case the percentages
are computed with respect to the total sum of A for all rows.
The total level and individual level has to differ, therefore
R ̸= ∅. In each cuboid where the two levels of aggregation
happen, L [ R = G. The percentage will be the quotient of
each aggregated measure from the individual level and its
corresponding value from the total level. All the individual
percentage values derived from the same total level group
will add up to 100%.</p>
    </sec>
    <sec id="sec-3">
      <title>FROM PERCENTAGE AGGREGATIONS</title>
    </sec>
    <sec id="sec-4">
      <title>TO THE PERCENTAGE CUBE</title>
      <p>In this section we will rst introduce our syntax for
percentage aggregations to the standard SQL, then we will show
two major methods to evaluate it and optimizations in a
columnar DBMS. We will show how we build the
percentage cube using the percentage aggregations. And to further
optimize, we will introduce a group frequency threshold ϕ to
our methods and we propose two strategies to do fast group
pruning.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Percentage Aggregation</title>
      <p>The basics of the percentage cube is percentage
aggregation. By far there is no syntax in standard SQL for
percentage aggregations, so in this section we will rst propose our
pct() function to compute them.</p>
      <p>pct(A TOTAL BY L1; : : : ; Lj</p>
      <p>BREAKDOWN BY R1; : : : ; Rk).</p>
      <p>The rst argument is the expression to aggregate represented
by A. The next two arguments represent the list of
grouping columns used in the total level aggregation and the
additional grouping columns to break the total amounts down
to the individual amounts. The following SQL statement
shows one typical pct() call:
SELECT L1; : : : ; Lj; R1; : : : ; Rk,
pct(A TOTAL BY L1; : : : ; Lj</p>
      <p>BREAKDOWN BY R1; : : : ; Rk)</p>
      <sec id="sec-5-1">
        <title>FROM F</title>
        <p>GROUP BY L1; : : : ; Lj; R1; : : : ; Rk;</p>
        <p>When using the pct() aggregate function, several rules
shall be enforced:
1. The \GROUP BY" clause is required because we need
to perform two-level aggregations.
2. Since set L can be empty, the \TOTAL BY" clause
inside the function call is optional, but the
\BREAKDOWN BY" clause is required because R ̸= ∅. Any
columns appeared in either of those two clauses must
be listed in the \GROUP BY" clause. In particular,
the \TOTAL BY" clause can have as many as d 1
columns.
3. Percentage aggregations can be applied on any queries
along with other aggregations based on the same GROUP
BY clause in the same statement. But for the
simpli cation and exposition purposes, we do not apply
percentage aggregations on queries having joins.
4. When having more than one pct() calls in one single
query, each of them can be used with different
subgrouping columns, but still all of their columns have
to be present in the \GROUP BY" clause.</p>
        <p>
          The pct() function computes one percentage per row and
has a similar behavior to the standard aggregate functions
sum(); avg(); count(); max() and min() that have only one
argument. The order of rows in the result table does not
have any impact on the correctness, but usually we return
the rows in the order given by the \GROUP BY" clause
because rows belong to the same group (i.e. rows making up
100%) can be displayed together. The pct() function returns
a real number in the range of [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] or NULL when dividing
by zero or doing operations with null values. If there are null
values, the sum() aggregate function determines the sums
to be used. That is, pct() preserves the semantics of sum(),
which skips null values.
        </p>
        <sec id="sec-5-1-1">
          <title>Example</title>
          <p>We still use our fact table shown in Table 1. The following
SQL statement shows one speci c example that computes
the percentage of the sales amount of each city out of every
quarter's total.</p>
          <p>SELECT quarter; city,
pct(salesAmt TOTAL BY quarter</p>
          <p>BREAKDOWN BY city)</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>FROM F</title>
        <p>GROUP BY quarter; city;
In this example, at the total level we rst group the total
sums by quarter, then we further break each group down to
individual level by city. The result table is shown in Table
4.</p>
        <p>Comparing Table 4 and Table 3 we will nd that a
percentage cube is no more than a collection of percentage
aggregation results.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>SQL Programming</title>
      <p>The pct() function call can be converted to standard SQL.
The general idea can be described as the following two steps:
1. Evaluate the two levels of aggregations respectively.
2. Compute the quotient of the aggregated measures as
the individual percentage value from Findv and Ftotal
where both of them match in their L columns.</p>
      <p>In practice, how do we compute the two levels of
aggregations is the key factor to distinguish between different
methods. In this section we discuss two methods, one is the
OLAP window function method, the other is the
GROUPBY method using standard aggregations.</p>
      <sec id="sec-6-1">
        <title>The OLAP window function Method</title>
        <p>Queries with OLAP functions can apply aggregations on
window partitions speci ed by the \OVER" clauses. There
can be several window partitions with different grouping
columns in each OLAP query. That makes this method the
only way we can get Findv and Ftotal from the fact table
with only one single query without joins. The problem with
OLAP window function is that although the aggregate
function is calculated with respect to all the rows in each
partition, the result is applied to each row. Therefore in our case
the result table may have duplicated rows with the same
percentage values. The following example shows the SQL
query to compute the percentage of the sales amount for
each state out of every quarter's total using the raw OLAP
window function method:
SELECT quarter; state, (CASE WHEN Y &lt;&gt; 0 THEN X=Y</p>
        <p>ELSE NULL END) AS pct</p>
        <sec id="sec-6-1-1">
          <title>FROM</title>
          <p>(SELECT quarter; state,
sum(salesAmt) OVER (PARTITION BY quarter; state)
AS X,
sum(salesAmt) OVER (PARTITION BY quarter) AS Y
FROM F ) foo;</p>
          <p>To get the results correct, we can get rid of the duplicates
with the following two methods:
(1) Use \DISTINCT" keyword.
The problem with this method is that the use of
\DISTINCT" keyword will introduce sorting to the query
execution plan. The sorting can be very costly if no axillary
structures (indexes, projections) are applied. Then we came
up with an alternative approach:
(2) Use \row number()"
row number() is another OLAP function that can assign a
sequential number to each row within a window partition
(starting at 1 for the rst row). When using this method,
we assign row identifers in each partition de ned by L [ R,
then we just need to select one of such tuples per group to
eliminate the duplicates.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>The GROUP-BY Method</title>
        <p>In this method using standard aggregations, the two levels
of aggregations are pre-computed and stored in temporary
tables Ftotal and Findv respectively. The percentage value
is evaluated in the last step by joining Findv and Ftotal on
the L columns and computing Findv:A=Ftotal:A. It is always
important to check before computing that Ftotal:A can not
be zero as the denominator.</p>
        <p>Now we discuss the evaluation of Ftotal and Findv. It is
obvious that Findv can only be computed from the fact table
F .</p>
        <p>SELECT L1,. . . ,Lj,R1,. . . ,Rk, sum(A)
INTO Findv FROM F
GROUP BY L1,. . . ,Lj,R1,. . . ,Rk;
Ftotal, however, as a more brief summary of the fact table
with fewer grouping columns (only columns in L) than Findv,
can be derived either also from the fact table F , or directly
from Findv. Evaluating aggregate functions requires the
input table to be fully scanned, therefore the size of the input
table will have an impact on the performance. When the
size of F is much larger than Findv, in which case the
cardinality of the grouping columns is relatively small, getting
Ftotal from Findv will be much faster than from F because
fewer rows are scanned.</p>
        <p>SELECT L1,L2,. . . ,Lj ,sum(A)
INTO Ftotal
FROM Findv j F
GROUP BY L1,L2,. . . ,Lj ;
The nal result can be either inserted to another temporary
table or in-place updated on Findv itself. The in-place
updating way can avoid creating the temporary table with the
same size as Findv. That would be very helpful in the case
when disk space is limited.</p>
        <sec id="sec-6-2-1">
          <title>INSERT INTO Fpct</title>
          <p>SELECT Findv:L1,. . . ,Findv:Lj,Findv:R1,. . . ,Findv:Rk,
(CASE WHEN Ftotal:A &lt;&gt; 0 THEN Findv:A=Ftotal:A
ELSE NULL END) AS pct
FROM Ftotal JOIN Findv
ON Ftotal:L1 = Findv:L1,. . . ,Ftotal:Lj = Findv:Lj;</p>
          <p>Compared to the two methods we introduced just now,
we argue that our syntax for percentage is a lot simpler and
do not require join semantics.
3.3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Percentage Cube</title>
      <p>Recall that percentage cubes further develop ordinary data
cubes. Even though they share a similarity that can give us
an insight of data in a hierarchical manner, they are indeed
very different. A data cube has a multidimensional
structure that summarizes the measures over cube dimensions
grouped at all different levels of details. A data cube whose
dimensionality is d will have 2d different cuboids. While a
percentage cube, in addition to summarizing the measure
in cuboids like a data cube does, it categorizes the
dimensions in each cuboid into set L and R in all possible ways.
Then vertical percentage aggregation is evaluated based on
each L and R key setting. The computational complexity
of the percentage cube can be summarized in the following
properties:
Property 1: The number of different grouping column
combinations in a cuboid with g grouping columns is 2g 1.
Property 2: The total number of all different grouping
column combinations in a percentage cube with d dimensions
d
is ∑ (id)(2i 1) = O(22d).</p>
      <p>i=1</p>
      <p>So a percentage cube can be much larger than an ordinary
data cube in size and it is a lot more difficult to evaluate.
Figure 2 shows a speci c example when d = 2, the data cube
will have 4 cuboids while the percentage cube will have in
total 5 different grouping column combinations (The last
cuboid f ; g will not be included in the percentage cube
because the set R cannot be empty). The gap is not so
big because the d we show is small due to space limit, since
both the number of cuboids in the cube and the number of
possible grouping column combinations in one cuboid grow
exponentially as d increases, this gap will become
surprisingly large when d gets large.</p>
      <p>Due to the similarity of the representation of percentage
cubes and percentage aggregations, it is not surprising the
problem of building a percentage cube can be broken down
to evaluating multiple percentage aggregations and they can
share similar SQL syntax. Below we propose our SQL
syntax to create a percentage cube on the fact table we showed
in Table 1. When creating a percentage cube, the pct()
function call will no longer require a \TOTAL BY" or
\BREAKDOWN BY" clause.</p>
      <p>SELECT quarter; state; pct(salesAmt) FROM F
GROUP BY quarter; state
WITH PERCENTAGE CUBE;</p>
      <p>We describe the algorithm to evaluate a percentage cube
using percentage queries in Algorithm 1. The outer loop in
Algorithm 1 iterates over each cuboid. Then we will exhaust
all the possible ways in the inner loop to allocate the columns
in set G that are available in the current cuboid to set L and
R (columns in set L are the grouping columns for the total
level aggregation and columns in set R are the additional
grouping columns to break down the total amounts). For
each L and R allocation, we evaluate a percentage
aggregation and union all the aggregation results together to be the
nal percentage cube table.</p>
      <p>Data: fact table F , measure A, cube dimension list</p>
      <p>M = fD1; : : : ; Ddg
Result: d-dimension percentage cube
Result table RT = ∅ ;
for each G M; G ̸= ∅ do
for each L G do</p>
      <p>R = G n L
RTtemp = pct(A TOTAL BY L</p>
      <p>BREAKDOWN BY R);
end</p>
      <p>RT = RT [ RTtemp ;
end
return RT ;
Algorithm 1: Algorithm to evaluate percentage cube.</p>
      <p>There is one small difference in the output schema between
an individual percentage aggregation and a percentage cube.
In a percentage cube, we add two more columns called \total
by" and \break down by" to keep track of the total and the
individual level setting (See Table 3). This is because unlike
individual percentage queries having only one total and the
individual level setting in the output, the percentage cube
explores all the potential combinations. An entry having
column fA; Bg may be \total by" A and \break down by" B
or the opposite, or even \break down by" both A and B.</p>
      <p>We also need to point out that for each cuboid, no matter
how the grouping column setting L and R change, the
individual level aggregation Findv will stay the same. This is
because the Findv is grouped by L and R meanwhile L[R = G
which will stay the same in one cuboid. Based on this
observation, unlike in vertical percentage aggregations that we
compute Findv from F in each pct() call, here we only
compute Findv once for every cuboid, then the result will be
materialized for the rest of the L and R combinations in the
same cuboid to avoid duplicated computations of Findv.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Group Frequency Threshold</title>
      <p>Since the data cube with d dimensions will have 2d cuboids
as well as numerous entries, it is already very difficult to
evaluate such a huge amount of output. Moreover, as we
have discussed in section 3.3, a percentage cube is much
larger than an ordinary data cube because in addition to
2d cuboids, there are a lot more potential grouping column
combinations in each cuboid. Thus evaluating percentage
cubes will be much more demanding than evaluating data
cubes.</p>
      <p>To take a closer look at the problem, not all percentage
groups (i.e., groups formed by the total level aggregation)
are bene cial to users. Although in some groups we can see
entries with remarkable percentage values, the group itself
may be small in group frequency or the aggregated measure.
Discoveries on such groups do not make much sense and
can offer very limited help for analysis. Predictably it is
common to have such small groups in the percentage cube
especially when d is large. If we could avoid computing
those meaningless groups, the overall evaluation time can
be correspondingly reduced.</p>
      <p>
        On the data cube side, a similar problem of eliminating
GROUP-BY partitions with an aggregate value (e.g., count)
below some support thresholds can be addressed by
computing Iceberg cubes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For Iceberg queries it is proven that a
threshold is needed, for percentage cube it is even more
necessary. Similarly we introduce a threshold to prune groups
under a speci ed size. We call this threshold group
threshold, represented by ϕ. In percentage cubes, all the groups are
generated by the total level aggregation (Ftotal). Therefore,
unlike in Iceberg cubes we prune the partitions, in
percentage cubes we prune groups under a speci ed size ϕ, that is,
to lter the aggregated count() of groups formed by all the
possible L sets through this frequency threshold.
      </p>
      <p>
        Previous studies have developed two major approaches
to compute Iceberg cubes, top-down [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and bottom-up[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The most important difference between those two methods is
that the bottom-up algorithm can take advantage of
Apriori pruning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Such pruning strategy can also apply on
percentage cubes. In this section, we discuss two pruning
strategies: direct pruning and bottom-up cascaded pruning.
      </p>
      <sec id="sec-8-1">
        <title>Direct pruning</title>
        <p>Direct pruning further develops the Algorithm 1. It
validates the threshold on all possible grouping column
combinations directly without sharing pruning results between
computations at difference grouping levels. In order to let
the computation of Ftotal continue to reuse the result of
Findv table that comes from the coarser level of details, we
also put count(1) in Findv results. When computing Ftotal
from F , the group frequency is evaluated by count().
However, when utilizing Findv to get Ftotal, the group frequency
is evaluated by summing up the counts in Findv. The
threshold is enforced in Ftotal query by specifying the limit in the
\HAVING" clause.</p>
        <sec id="sec-8-1-1">
          <title>SELECT L1,L2,. . . ,Lj,sum(A),count(1) AS cnt</title>
          <p>INTO Ftotal
FROM F
GROUP BY L1,L2,. . . ,Lj
HAVING count(1)&gt;ϕ;
SELECT L1,L2,. . . ,Lj,sum(A),sum(cnt) AS cnt
INTO Ftotal
FROM Findv
GROUP BY L1,L2,. . . ,Lj
HAVING sum(cnt)&gt;ϕ;</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>Cascaded pruning</title>
        <p>In order to take full advantage of previous pruning results,
we propose a new algorithm that iterates on all the L sets
that the d dimensions can possibly have from bottom to top.
If the count() of any group formed by one L set failed to
meet the minimum threshold then the group will be pruned
and cease to go deeper down through checks with other
L sets that has more dimensions included. On the other
hand, quali ed groups will be materialized in temporary
tables with their grouping column values, the count() and the
aggregated measures so that this materialized table can be
later used as Ftotal in percentage computation, or for group
checks with more detailed L sets.</p>
        <p>A L set may appear in multiple cuboids. For example,
taking fA; B; C; Dg as the base cuboid, a L set fA; Bg can
be valid in cuboid fA; B; Cg, fA; B; Dg and fA; B; C; Dg.
Therefore the materialized table for each L set can be reused
as Ftotal in multiple cuboids. The Findv in this pruning
strategy will be computed in the the middle of percentage
computation. Recall that in Algorithm 1, we rst
determine the cuboid to get S, the cuboid's dimension list, in the
outermost loop, then we get each legal L and R from S in
inner loops. Keep in mind that the Findv is also grouped
by S, therefore in this computational order, every Findv
is computed, intensively utilized and discarded. However,
in cascaded pruning L is determined rst in the outermost
loop, so Findv will have a scattered utilization. It becomes
very important that we keep the Findv results after it is rst
computed for future usage. A collateral change is that when
computing Ftotal, now it is not always true that Ftotal can
be computed based on Findv because the Findv it needs may
have not been computed yet.</p>
        <p>Since cascaded pruning has a more complex logic, we use
a Java program as the query generator and issuer. Adopting
Java program has its own trade-offs. The program will
participate the evaluation throughout the entire process. We
will have to suffer the communication cost of JDBC and the
running cost of the Java program itself. This problem can
be meliorated in the future by integrating the algorithm into
the DBMS.</p>
        <p>In order to label the L set on each materialized table, we
assign each cube dimension with an integer identi er
according to the position it is standing in the dimension list. The
identi er for the i th dimension will be 2i 1. With the
dimension identi er, any L set or R set or cuboid
dimension list can be represented by a representation code that
comes from the bitwise OR operation on all the dimensions'
identi er in the set, and the code for its parent set can be
evaluated by eliminating the highest non-zero bit. All
materialized Findv and ltered Ftotal will be named as Findvi and
Ftotali where i stands for the dimension representation code.
Figure 3 shows the relation between the representation code
and the dimension set it represents. Also by eliminating
the highest non-zero bit, it can be linked with its parent
dimension set.</p>
        <p>We show the cascaded pruning algorithm to evaluate the
percentage cube with group frequency threshold ϕ in
Algorithm 2.
Set Up</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-10">
      <title>Experimental Environment</title>
      <p>We conducted our experiments on a 2-node cluster of Intel
dual core workstations running at a 2.13 GHz clock rate.
Each node has 2GB main memory and 160GB disk storage
with SATA II interface. The nodes are running Linux
CentOS release 5.10 and are connected by a 1 Gbps switched
Ethernet network.</p>
      <p>The HP Vertica database management system was
installed and working under a two-node parallel con guration
for obtaining the query execution times shown in this
section's tables. The reported times in the table are rounded
down to integral numbers and are the average of seven
identical runs after removing the best and worst total elapsed
time.</p>
      <sec id="sec-10-1">
        <title>Data Set</title>
        <p>The data set we used to evaluate the aggregation queries
with different optimization strategies is the synthetic data
sets generated by the TPC-H data generator. In most cases
we use the fact table transactionLine as input and the column
\quantity" as measure. Table 5 shows the speci c columns
from the TPC-H fact table that we used as left key and
right key to evaluate percentage queries. jL1j and jR1j are
the cardinalities of L1 and R1 respectively. Table 6 shows
the candidate dimensions and their cardinalities we used to
evaluate percentage cubes. Table 7 shows the dimensions
we chose to generate the Percentage Cube at various cube
dimensionality. In this paper we vary the dimensionality of
the cube d from 2 to 6. Very large d does not make much
sense for our computation, because the groups will be too
small.</p>
        <p>Percentage cube evaluation got the idea from pct() for
percentage aggregations. So we rst compare the two
methods to implement pct() as we discussed in Section 3.2, i.e.
GROUP-BY and OLAP. We performed this comparison on
a replicated fact table whose scale factor = 4. In this
comparison we included optimization options for both methods.
For the GROUP-BY method, we evaluate Ftotal from F
and Findv separately, while for the OLAP window function
method we eliminate duplicates by using \DISTINCT" and
row number() separately. Table 8 shows the result of the
comparison. For each grouping column combination in each
row, we highlight the fastest con guration with bold font.
Generally speaking evaluation by the GROUP-BY method
is much faster than by the OLAP window function method.
For the OLAP method, using row number() instead of
\DISTINCT" keyword may accelerate the evaluation speed by
about 2 times. But it is still obviously slower than the
GROUP-BY method. For the GROUP-BY method, we can
see the cardinality of the right key jR1j, has a direct
impact on the performance of two strategies to generate Ftotal.
When jR1j is relatively small, say jR1j = 7, for all different
jL1j we have tested, generating Ftotal from Findv is about
2-3 times faster than from F . However as jR1j gets larger,
say jR1j = 25, there is no very big difference where Ftotal is
generated from.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Cube Generation: GROUP-BY vs. OLAP</title>
      <p>To one step further, we now put the pct() calls together
using the Algorithm 1 to get the entire percentage cube. Since
evaluating a percentage cube is much more demanding than
evaluating individual percentage queries, when comparing
those two methods in cube generation, we choose the best
performing parameter based on the observation we made in
section 4.3. That is, for GROUP-BY method, we generate
Ftotal from Findv, and for OLAP window function method,
we use row number() OLAP function to eliminate
duplicated rows. Table 9 shows the result of this comparison. The
result shows that our GROUP-BY method is about 10 times
faster than the OLAP window function method for all d's.
When d gets larger, it will take the OLAP window function
method hours to nish. The two methods show enormous
difference in their performances because our GROUP-BY
method takes advantage of the materialized Findv. So we
can not only avoid duplicated computation of Findv for each
group in the same cuboid, but also bene t the generation of
Ftotal for different L set.
4.5</p>
    </sec>
    <sec id="sec-12">
      <title>Comparing Pruning Strategies</title>
      <p>Even though the GROUP-BY method can offer a decent
performance when evaluating the percentage cube, but it is
still not enough especially for fact tables with large d and
N . As we discussed in section 3.4, we want to further
optimize this evaluation process by introducing group frequency
threshold to prune the groups with low count(). In this
section, we compare the cube generation with various group
threshold using direct pruning on Algorithm 1 and cascaded
pruning in Algorithm 2. We choose the value of the
threshold as certain percentages of total row number of the fact
table N . Here we choose the count() threshold as 10% of
N , 8% of N and 6% of N as well as no threshold applied.
The result is shown in Table 10. We rst look at
evaluation times for each algorithm separately. It shows although
the time increases when we decrease the threshold, the
difference is not so big. This is because the data distribution
of the data set we used is almost uniform. A more skewed
distribution of values will result in a more obvious increase
in evaluation times for the decreasing thresholds. But on
the other hand we can see from the result the evaluation
time increases greatly for runs with no threshold.
Therefore it proves to us that it is necessary to prune the groups
with small size because not only users have few interest in
them but also by pruning them the evaluation time can be
shortened a lot. Then we compare the evaluation time
between algorithms. When d continues to grow, bottom-up
algorithm shows much better performance for cases when
thresholds are applied. But two algorithms shows almost
no difference when threshold ϕ = 0. Direct pruning can
run without the support the Java program, however for the
cascaded pruning, the Java program has to participate the
processing throughout evaluation. So the cost of
communication via JDBC and the running the Java program can
compromise the true performance of the algorithm.</p>
      <p>Data: fact table F , measure A, cube dimension list</p>
      <p>M = fD1; : : : ; Dpg, group threshold ϕ
Result: d-dimension percentage cube
Result table RT = ∅ ;
Ftotal0 = count&gt;ϕ( count(1);sum(A)(F )) ;
for each L M; L ̸= ∅ do
i = getRepresentationCode(L) ;
pCode = getP arentCode(i) ;
if pCode = 0 then</p>
      <p>Ftotali =</p>
      <p>count&gt;ϕ( L;count(1);sum(A)(F )) ;
else
if FtotalpCode does not exist then</p>
      <p>continue next L ;
else
if Findvi exists then</p>
      <p>Ftotali =
sum(count)&gt;ϕ( L;sum(count);sum(A)(
FtotalpCode &gt;&lt;FtotalpCode :L=Findvi :L Findvi ))
;
else
end</p>
      <p>Ftotali = count(1)&gt;ϕ( L;count(1);sum(A)(</p>
      <p>FtotalpCode &gt;&lt;FtotalpCode :L=F:L F )) ;
end
end
if jFtotali j ̸= 0 then</p>
      <p>Materialize Ftotali ;
end
for each R (M n L) do</p>
      <p>S = L [ R ;
sCode = getRepresentationCode(S) ;
if FindvsCode does not exist then</p>
      <p>Materialize FindvsCode = S;sum(A)(F ) ;
end
end
RTtemp =
;
RT = RT [ RTtemp ;</p>
      <p>S;FindvsCode :A=Ftotali :A(</p>
      <p>Ftotali 1Ftotali :L=FindvsCode :L FindvsCode )
end
return RT ;
Algorithm 2: Cascaded pruning algorithm to evaluate
percentage cube with threshold.
5.</p>
    </sec>
    <sec id="sec-13">
      <title>RELATED WORK</title>
      <p>
        Some SQL extensions to help data mining tasks are
proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These include a primitive to compute samples
and another one to transpose the columns of a table. SQL
extensions to perform spreadsheet-like operations with
array capabilities are introduced in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Those extensions are
not adequate to compute percentage aggregations because
they have the purpose of avoiding joins to express
formulas, but are not optimized to handle two-level aggregations
or perform transposition. The optimizations and proposed
code generation framework discussed in this work can be
combined with that approach.
      </p>
      <p>
        Percentage aggregation function was rst proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
However we clarify the de nition of the left keys and right
keys in the function calls with the introduction of
\BREAKDOWN BY" and \TOTAL BY" clauses. Our notation is
clearer with no confusions. We also improved the OLAP
evaluation method with the row number() approach. This
row number() approach is 2 times faster than the previous
\DISTINCT" approach. Moreover, we extended the
problem of evaluating percentage aggregations to the
percentage cube. Percentage cubes have wide applicability. To the
best of our knowledge, percentage cubes have never been
explored in any other research works. Its evaluation is much
more difficult than individual percentage queries and has
different optimizations like group pruning.
6.
      </p>
    </sec>
    <sec id="sec-14">
      <title>CONCLUSION</title>
      <p>We proposed a generalized form of data cube, the
percentage cube, that takes percentages as the aggregated measure.
Percentage cubes show the fractional relationship within
each cube cuboid between a measure aggregated by all given
grouping columns and the same measure aggregated by
every proper subset of the grouping columns. Since such kind
of aggregation has not been explored before and it is
missing from standard SQL, we introduced minimal SQL syntax
changes, following standard cube syntax. We introduced
the pct() aggregate function and we studied query
processing in SQL, using standard OLAP window functions and a
GROUP-BY bottom-up method based on standard
aggregation functions. We studied query optimization on both
methods including efficiently reusing intermediate results,
avoiding external sorts from the query plan and
exploiting indexing. Experimental results showed that our novel
GROUP-BY method works much faster than existing OLAP
window functions. We also introduced direct and cascaded
strategies based a percentage threshold to prune the cube
exploration search space and decrease processing time of the
entire percentage cube.</p>
      <p>There are many opportunities for future research. First,
the percentage cube explores all possible grouping column
combinations and the results have to be computed through
joins. Due to the large amount of grouping column
combinations it has been difficult to take advantage of projections in
columnar DBMSs. How to improve performance of
percentage cube with projections remains to be studied. Second,
our current cascaded pruning strategy relies on a Java
program throughout the evaluation process whose performance
is compromised due to the overhead of JDBC
communication. We expect better performance will be achieved by a
tighter integration of the cube exploration algorithm into
the DBMS.</p>
    </sec>
    <sec id="sec-15">
      <title>ACKNOWLEDGMENTS</title>
      <p>The third author (Javier Garc a-Garc a) was sponsored by
the information technology project PEI 2016, No. 231476
\Nube-colaborativa (SIG) multisensorial para el manejo de
riesgos en yacimientos no convencionales" and the
Laboratorio Nacional de Ciencias de la Complejidad.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          , and et al.
          <article-title>Fast algorithms for mining association rules</article-title>
          .
          <source>In Proc. 20th int. conf. very large data bases, VLDB</source>
          , volume
          <volume>1215</volume>
          , pages
          <fpage>487</fpage>
          {
          <fpage>499</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Beyer</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          .
          <article-title>Bottom-up computation of sparse and iceberg cube</article-title>
          .
          <source>In ACM SIGMOD Record</source>
          , volume
          <volume>28</volume>
          , pages
          <fpage>359</fpage>
          {
          <fpage>370</fpage>
          . ACM,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Clear</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dunn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Harvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.L.</given-names>
            <surname>Heytens</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Lohman</surname>
          </string-name>
          .
          <article-title>Non-stop SQL/MX primitives for knowledge discovery</article-title>
          .
          <source>In ACM KDD Conference</source>
          , pages
          <volume>425</volume>
          {
          <fpage>429</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Layman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Pirahesh</surname>
          </string-name>
          .
          <article-title>Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-total</article-title>
          .
          <source>In ICDE Conference</source>
          , pages
          <volume>152</volume>
          {
          <fpage>159</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>et</surname>
            <given-names>al. M.</given-names>
          </string-name>
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shivakumar</surname>
          </string-name>
          .
          <article-title>Computing iceberg queries efficiently</article-title>
          .
          <source>In Internaational Conference on Very Large Databases (VLDB'98)</source>
          , New York,
          <year>August 1998</year>
          . Stanford InfoLab,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          .
          <article-title>Vertical and horizontal percentage aggregations</article-title>
          .
          <source>In Proc. ACM SIGMOD Conference</source>
          , pages
          <volume>866</volume>
          {
          <fpage>871</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chu</surname>
          </string-name>
          et al.
          <article-title>Scalable data cube analysis over big data</article-title>
          .
          <source>arXiv preprint arXiv:1311.5663</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Witkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bellamkonda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bozkaya</surname>
          </string-name>
          , G. Dorman,
          <string-name>
            <given-names>N.</given-names>
            <surname>Folkert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          .
          <article-title>Spreadsheets in RDBMS for OLAP</article-title>
          .
          <source>In Proc. ACM SIGMOD Conference</source>
          , pages
          <volume>52</volume>
          {
          <fpage>63</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Xin</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.W.</given-names>
            <surname>Wah</surname>
          </string-name>
          .
          <article-title>Computing iceberg cubes by top-down and bottom-up integration: The starcubing approach</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <volume>111</volume>
          {
          <fpage>126</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.M.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.F.</given-names>
            <surname>Naughton</surname>
          </string-name>
          .
          <article-title>An array-based algorithm for simultaneous multidimensional aggregates</article-title>
          .
          <source>In ACM SIGMOD Record</source>
          , volume
          <volume>26</volume>
          , pages
          <fpage>159</fpage>
          {
          <fpage>170</fpage>
          . ACM,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>