=Paper= {{Paper |id=Vol-28/paper-4 |storemode=property |title=Efficient query processing with associated horizontal class partitioning in an object relational data warehousing environment |pdfUrl=https://ceur-ws.org/Vol-28/paper4.pdf |volume=Vol-28 |dblpUrl=https://dblp.org/rec/conf/dmdw/GopalkrishnanLK00 }} ==Efficient query processing with associated horizontal class partitioning in an object relational data warehousing environment== https://ceur-ws.org/Vol-28/paper4.pdf
    Efficient Query Processing with Associated Horizontal Class
Partitioning in an Object Relational Data Warehousing Environment
                                Vivekanand Gopalkrishnan                                                     Qing Li
                             Department of Computer Science,                                     Department of Computer Science,
                              City University of Hong Kong,                                       City University of Hong Kong,
                               Kowloon, Hong Kong, PRC                                             Kowloon, Hong Kong, PRC
                                  vivek@cs.cityu.edu.hk                                                 csqli@cityu.edu.hk
                                                                                Kamalakar Karlapalem
                                                                          Department of Computer Science,
                                                                         University of Science and Technology
                                                                          Clear Water Bay, Hong Kong, PRC
                                                                                   kamal@cs.ust.hk

                                         Abstract                                      consistently and instantaneously. Maintaining the
                                                                                       integrity of these Indexes and Views [GM95], [MK99]
     In an Object Relational Data Warehousing                                          imposes a challenging problem when the source data
     (ORDW) environment, the semantics of data and                                     changes frequently, when the size of the DW keeps
     queries can be explicitly captured, represented,                                  growing, and/or when the user queries become
     and utilized based on is-a and class composition                                  increasingly complex. An extensible framework that can
     hierarchies, thereby resulting in more efficient                                  accommodate dynamic warehousing [Dayal99] of
     OLAP query processing. In this paper, we show                                     changing data gracefully, and have adaptive handles for
     the efficacy in building semantic-rich hybrid                                     processing OLAP queries efficiently is needed.
     class partitions by incorporating the Associated
     Horizontal Class Partitioning (AHCP) technique                                    In [VLK98], we showed that besides establishing a
     on the ORDW schema. Given a set of queries,                                       semantically richer framework for multi-dimension
     we use primary and derived partitioning                                           hierarchies, the Object Relational View (ORV) model
     algorithms to select (near) optimal AHCPs,                                        provides excellent support for complex object retrieval. In
     thereby embedding query semantics into the                                        [VLK99] we presented the Object Relational Data
     partitioned framework. Finally, by a cost model,                                  Warehousing (ORDW) approach to address some of the
     we analyze the effectiveness of our approach vis-                                 issues discussed in [VLK98] on data warehousing. More
     a-vis the unpartitioned approach.                                                 specifically, we devised a translation mechanism from
                                                                                       the star/snowflake schema to an object oriented (O-O)
                                                                                       representation. In [VLK00], we advocated a query
1 Introduction                                                                         processing strategy implementing the Structural Join
Data warehouse (DW) equips users with more effective                                   Index Hierarchy (SJIH) on ORDW.
decision support tools by integrating enterprise-wide
corporate data into a single repository from which                                     In this paper, we show the efficacy in building semantic-
business end-users can run reports and perform ad hoc                                  rich hybrid class partitions by incorporating the
data analysis [CD97]. As DWs contain enormous amount                                   Associated Horizontal Class Partitioning (AHCP)
of data, often from different sources, we need highly                                  technique on the ORDW schema. Given a set of queries,
efficient indexing structures [Sar97], [GHRU97],                                       we use primary and derived partitioning algorithms to
[VLK00], materialized (stored) Views [Rou97], and                                      select (near) optimal AHCPs, thereby embedding query
query processing techniques [VLK99] to efficiently                                     semantics into the partitioned framework. Finally, by a
answer on-line analytical processing (OLAP) queries.                                   cost model, we analyze the effectiveness of our approach
Materialized Views represent integrated data based on                                  vis-a-vis the unpartitioned approach.
complex aggregate queries, and should be available
                                                                                      To put our research in perspective, we review some
                                                                                      related work and briefly outline our previous work in the
 The copyright of this paper belongs to the paper’s authors. Permission to copy field of ORDW and Class Partitioning on OODBs in
 without fee all or part of this material is granted provided that the copies are not
 made or distributed for direct commercial advantage.
                                                                                      section 2. We further motivate our study by presenting on
                                                                                      the ORDW schema some sample queries whose patterns
 Proceedings of the International Workshop on Design and
                                                                                      are classified based on DW operations and by OO
 Management of Data Warehouses (DMDW'2000)
                                                                                      concepts. Obtaining an optimal partitioning scheme to
 Stockholm, Sweden, June 5-6, 2000                                                    process this set of queries is the focus of section 3, where
 (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.)                                    we employ a hill-climbing heuristic algorithm to select a
 http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/




Vivekanand G., Q. Li, K. Karlapalem                                                                                                           4-1
(near) optimal AHCPs. This algorithm is profiling                   other composition links. Moreover, we also demonstrate
driven, and can be further extended to incorporate other            is-a hierarchies (denoted by dotted lines) obtained by
semantics. In section 4, we compare results of retrieval            horizontally partitioning the ORDW schema.
costs using the AHCPs vs. the unpartitioned case. Section
5 concludes the paper with a brief look at future work.
                                                                                                 Sales                 Order


2 Background and Motivating Example                                  Product    Retailer                                       Date


                                                                                                                    Month             Week
2.1 Related Work                                                                 City            Customer
                                                                    Category
Partitioning has been vastly researched in Relational and                        State     Teenager         Adult   Season      Quarter
OO database systems. Excellent work has been done in                                                     Customer
Vertical Partitioning (VP) and Horizontal Partitioning                Type      Country                                        Year
(HP) in both systems, but the unique features of OO                  Product    Address                                               Time

systems have made it possible to experiment with
different variations such as Derived Horizontal Class                              Fig. 1. The ORDW Schema.
Partitioning (DHCP), Associated Horizontal Class                    The figure shows the class composition hierarchy for the
Partitioning (AHCP), Path Partitioning (PP) and Method              Time dimension, and the is-a hierarchy (shaded area) for
Induced Partitioning (MIP). [KL00] presents a                       the Customer dimension.
comprehensive framework for devising partitioning
schemes based on different types of methods and their               A Fact is the “subject” of the OLAP queries, and is
classification. The issue of fragmentation transparency is          quantified by its dimensions and “values”. Dimensions
addressed by considering           appropriate     method           can be hierarchical and composite in nature, whereas
transformation techniques. While those methods were                 Values are numerical data. When a Dimension is
extremely successful in the transactional environment of            complex enough to contain various other components
an OODB, to the best of our knowledge, no work has                  that can themselves be classified as Dimensions and
been done in partitioning of an Object Relational DB or             Values then that Dimension can be a “Fact” of another
Object Relational Data Warehouse (ORDW).                            OLAP query. In this case, we can consider the schema to
                                                                    be that of “Nested Fact” or “Fact within Fact”.
Recently, we have conducted some preliminary studies on
developing an ORDW framework. In [VLK98], we                        Whereas when two (or more) Facts share (one or more)
showed that the ORV (Object Relational View) model                  Dimensions, then the OLAP queries can be considered as
offers inherent features that are conducive to managing a           “Inter-Fact”. To support OLAP applications, we define a
data warehouse. We listed the various issues that arise             group of OLAP queries OQG (modified from [VLK00]
during the design of an OR-DWMS (Object Relational                  by adding predicates), which are invoked as a set (not
Data Warehouse Management Systems). Here, OR means                  necessarily in an order). Query patterns in the OQG may
an object-oriented front-end or views to underlying                 not be restricted to a particular composition hierarchy or
relational data sources. Based on the issues discussed in           inheritance hierarchy. The access paths may involve
[VLK98], we put forward a three-phase design approach               multiple paths emanating from the same complex object,
in [VLK99], which also provided a query-driven                      as well as interact with entities in completely unrelated
translation mechanism from the star/snowflake schema to             complex objects.
an object oriented (O-O) representation. Some query
processing strategies utilizing Structural Join Index               For this discussion, queries involving Nested Facts can be
Hierarchy (SJIH) techniques for complex queries on                  considered as subsets of inter-Fact queries. They are
composite objects were addressed in [VLK00].                        distinguished by the presence of a semantic disjointness
                                                                    between the Facts involved. It must be noted though that
2.2 Motivating Query Examples
                                                                    this disjointedness does not preclude the Facts from
To further motivate our subsequent discussions, let us              sharing the same component objects. A query-processing
consider the sample ORDW schema as shown in Figure                  scheme built on separate Facts will inadvertently need
1, taken from [VLK99]. This schema is a simple single-              costly joins. This inefficiency is amplified for queries
star/snowflake schema, for a sales application. As seen in          with low selectivity and high frequency. This calls for a
the figure, dimension classes (DCs) are connected by                need for a partitioning scheme that transcends Facts and
solid arrows (composition hierarchies) to the main fact             is not restricted by the hierarchies mentioned. It must be
class (FC). In addition, there are other inter-dimension            noted that such a partitioning scheme may well be
hierarchies (as obtained by vertical partitioning1), and            overlapping and hence will suffer on the storage space.

                                                                    Based on classifications by DW operations & by OO
1
    This paper deals only with Associated Horizontal Partitioning   concepts, we consider the following queries listed in
     techniques, and the above schema has been arrived at using     Table 1 as our sample OQG for subsequent discussions.
     other techniques.



Vivekanand G., Q. Li, K. Karlapalem                                                                                                          4-2
         Table 1. Sample OLAP queries - OQG                    frequency to determine the selection of minimal complete
                                                               set of partition fragments for optimal storage,
 No.                Query                    Query type        maintenance and retrieval costs.
 Q1     Sales by Prod by State in US     Only along cch
                                         (pivot)               3.1.1 Primary Horizontal Partitions (PHP)
 Q2     Sales by Prod by State by Year   -> Drill-down
        in US                                                  Classes in the ORDW schema can be denoted as Cip,
 Q3     Sales by Prod for Categ=Elec     -> Roll-up            which indicates the i th class in the p th path. The root
 Q4     Sales by Prod by City for        Only along cch,       class (FC) is denoted as C0. Primary Horizontal Partitions
        Categ=Elec                       Drill-down            on these classes can be denoted as sub-classes and placed
 Q5     Sales by Prod by Country for     Only along cch,       in the is-a hierarchy under the original partitioned class.
        Categ=Elec                       Roll-up               Note that the (sub) is-a hierarchy in our examples is
 Q6     Sales by Prod to Teenagers by    Only along cch,       denoted by the subscript i.j , which denotes the j th sub-
        State for Categ=Elec & in US     Slice_and_dice        class of the i th class (in the p th path). The Primary
 Q7     Sales of Prod 1 compared with    Only along cch,       Horizontal Partitioning (PHP) operation is denoted as :
        Sales of Prod 2 to Teenagers     Drill-down,
        for Categ=Elec                   Slice_and_dice                       PHP(Cip )p1 → { Ci.1p , Ci.2p ,..., Ci.np }
 Q8     % increase in Sales to           Combination of is-
                                                               where (Cip ) is the Class that is Primary Horizontally
        Teenagers over Sales to          a & cch, Drill-
        Adults, of Prod 1 / 2 for        down,
                                                               Partitioned according to a predicate (p1), resulting in n
        Categ=Elec & in US               Slice_and_dice        fragments which are treated as classes {Ci.np}. Note
                                                               however, that since FC is the only root in the realm of
Some OLAP queries could be on the entire range of Sales        our OLAP OQGs, any primary partition of the root need
and would need to access multiple dimensions for the           not display the path suffix; ie. (C0.10 = C0.1).
"Group BY". However, as seen above, some queries could                                                                                                 LEGEND
have a predicate range such as "Categ=Elec" or                                                                                                                       Is-a
                                                                                        C 0.11 2          C 0.12 2      C 0.13 2                                 CCH
"Country=US". In such cases, the search space on the                                                                                                C i.jp       AHCP
Fact "Sales" is reduced by a factor equal to the selectivity                                                                                                 Fragment
                                                                                                                                                             ation Join
of the predicate. However, this does not help during
                                                                   C 0.21 1        C 0.22 1                                        C 0.1 0             C 0.2 0
query processing (normal unpartitioned case), as the
entire FC is processed while searching for relevant                                                             C0
tuples. Even in cases where indexes are built [VLK00],
the benefit could be reduced, as index creation takes up
more time due to the enormity of the FC. Further, as the                        C 11                           C 12                          C 13
OLAP queries involve multiple paths (multiple selections
and group bys), the size of the Forward and Reverse Joins                                       C 1.1 2                  C 1.3 2
is dependent on the size of the Root (FC). This calls for                       C 21                          C 1.2 2                        C 23
the need to partition the FC according to the query
characteristics.
                                                                   C 2.1 1                    C 2.2 1                   C 3 2.21 1                      C 3 2.22 1

3 AHCP Selection Methodology
The Associated Horizontal Class Partitioning (AHCP)             Fig. 2. An AHCP example on the Fact and Dimensions
methodology creates semantic-rich hybrid class partitions      The example in figure 2 shows classes C0, C21 and C12 in
for efficient query processing. It is a technique by which
                                                               the class composition hierarchy (CCH). Some of the
several classes can be partitioned according to the
semantics of another class in its aggregation hierarchy.       PHPs are { C1.12 , C1.22 and C1.32 }, connected by dashed
We employ the AHCP on our ORDW schema, and                     lines (is-a) to the super-class C12 which was partitioned.
propose to extend its applicability from class composition
hierarchies to also include is-a hierarchies and links         As the PHPs can be considered as subclasses of the class
quantified by partial participation, thereby encompassing      on which the PHP were performed, they are placed in the
the Complete Warehouse Schema (CWS) in the ORDW.               is-a hierarchy of the schema. We can have any no. of
                                                               PHPs on a single class based on a number of predicates.
3.1 AHCP fundamentals                                          For a single simple predicate, the PHPs are disjoint, i.e.
The total cost of the AHCP framework can be broadly            they don't share any objects. PHP schemes based on
categorized as partition storage cost, retrieval cost and      multiple or complex predicates on the same class, may
maintenance cost. In this paper, we also incorporate           induce overlapping fragments, however we do not
query-centric information including selectivity and            consider such schemes in this work to avoid complexity.



Vivekanand G., Q. Li, K. Karlapalem                                                                                                                                   4-3
3.1.2 Associated Horizontal Class Partitions (AHCP)               3.2.1 Storage Cost
After the PHPs are created, the AHCP operation may be             The Storage cost (SC) has two components : Primary
performed on some other classes in the schema. As noted           Horizontal Partition (PHP), and the Associated
above, most queries in the OQG access the root (FC) for           Horizontal Class Partition (AHCP). It can be stated as :
its value based attributes, and hence this paper deals
                                                                  SC = SCPHP + SCAHCP
primarily with AHCP of the root class. The AHCP
operation can be denoted as follows :                                 They are given as follows :
 AHCP (Cjq, PHP(Cip )p1 ) → { Cqj.i1p , Cqj.i2p ,..., Cqj.imp }   1. SCPHP (C1) :
where (Cjq ) is the Class that is Associate Horizontally          We assume that in most cases, and especially in this
Class Partitioned (AHCP) according to the PHP on class            paper, we consider only one PHP per class. This ensures
Cip, resulting in m fragments which are treated as classes        that the partitions are disjoint for simple predicates. In
{ Cqj.imp }. Here again since FC is the only root in the          such cases, there's negligible overhead for storage cost as
realm of our OLAP OQGs, any AHCP of the root need                 SCPHP (C1) = | C1 | (no. of pages occupied by the class C1
not display the path suffix; ie. (C00.112 = C0.112).              + catalog entries for the no. of PHPs of C1. These catalog
                                                                  entries give details of the partitioned Class structure,
As seen in the figure, the examples indicate that two sets        extent and qualifying rules. Hence, they're very small and
of AHCPs are created from the root C0. They can be                can easily be accommodated in memory (in both the
created by :                                                      medium and large memory hypothesis.
  AHCP (C0, PHP(C12 )p1 ) → { C0.112 , C0.122 , C0.132 } ;
                                                                  In case of multiple complex predicates on a Dimension
     AHCP (C0, PHP(C21 )p1 ) → { C0.211 , C0.221 }                (C1), resulting in overlapping fragments, we propose not
These partition fragments are denoted as subclasses in            to replicate the entire class extent, but rather only
the figure by means of the shaded boxes to indicate               replicate the Class OIDs (and some frequently accessed
Associate Partitioning.                                           attributes) in the separate Partitions.

The AHCP operation can also be preformed on classes               In this case, the storage overhead can be estimated as :
other than the root, ie. the Dimension Classes. For               SCPHP(C1) =        || C1 || x NoAttr x (sizeof(Attr)) x NoPHP
example, as seen in the figure, C23can be AHCPed based
on the PHPs of C21.                                                   where NoAttr = No.of Attributes replicated.
                                                                      where NoPHP = No.of Partition schemes.
      AHCP (C23, PHP(C21 )p1 ) → { C3.211 , C3.221 }
The result is shown in shaded boxes under C23 in the              Given a maximum of 2 replicated attributes or 20% of
figure.                                                           the class structure, and a uniform size of attributes, we
                                                                  can accommodate up to 5 different Partitioning schemes
An important point to be noted here is that while the             for an increase of 100% in SCPHP (C1).
Fragments obtained by any single AHCP operation on the            2. SCAHCP (C0) :
root are always disjoint, the same cannot be said about
Fragments obtained by AHCP on any other (Dimension)               This is by far the biggest increment for storage cost in the
class. This indicates the storage overhead to be incurred         AHCP ORDW. As noted above, the root (C0) would be
while performing AHCPs on the Dimension classes, and              the widely used as the candidate for performing AHCP.
must be taken into account by the cost model.                     Since any predicate on a single dimension can only
                                                                  induce disjoint partitions in the root, the partitioning
3.2 AHCP cost model                                               overhead is negligible for multiple partitioning schemes
In an ORDW, partitioning can be implemented by means              in a single DC.
of Method Induced Partitioning techniques [KL00].                       SCAHCP (C0) = | C0 | + NoPHP x SizeCat(PHPi).
Moreover, due to the structural and cardinal differences
inherent between Dimension Classes (DC) and the Fact                where SizeCat = Catalog entry size (structure, extent,
Class (FC), we can assume that the DCs need not be                                   qualifying rules).
physically partitioned as they may be wholly or partially
stored in memory (under both medium and large memory              However, as we incorporate multiple predicates on
hypothesis). Hence, the cost of the traditional join              different dimension classes, SCAHCP (C0) grows linearly
between the PHP fragments and the AHCP fragments can              as the no. of dimensions (assuming only single complex
be ignored. This join can be achieved by employing the            predicates on each dimension). This can be a large
methods of the FC.



Vivekanand G., Q. Li, K. Karlapalem                                                                                          4-4
overhead, as C0 as the FC, is very large (~order of         composed of AHCP loading cost. Since this is smaller
Gigabytes).                                                 than the complete FC by a factor of min (selpi ), where
Hence, we intend to reduce this overhead by means of a      selpi indicates the selectivity of the predicates on query Qi
Multiple Partition Processing Plan (MP3), based on          , we achieve a considerable savings in retrieval cost.
MVPP [YKL97]. This would entail a compromise
between duplication and efficiency of the partitions, as    This saving is also obtained when indexing schemes like
sub-fragments will have to be created to support the        the SJIH [VLK00] are built on top of the AHCPs, and
AHCPs. The Join needed to produce the final result from     also when aggregate views have to be developed.
these sub-fragments constitutes the increase in retrieval
cost.                                                       3.3 AHCP selection procedure
                                                            We approach the problem of performing AHCP in the
3.2.2 Maintenance cost                                      ORDW in a different manner from the case of DHCP in a
As noted above, since inter-fragment join is avoided        normal OODB [BKS98]. IN [BKS98], various techniques
between the PHP and AHCP fragments, maintenance cost        (candidates) were considered to decide the best PHP
is considerably simplified due to the AHCP operation.       candidate for performing DHCP. Here we consider all the
                                                            PHP candidates, and our AHCP algorithm generates a
As the ORDW is a read mostly and append only                optimal combination of complete and minimal set of
environment, we can safely estimate the maintenance         AHCPs.
cost even though the schema is vastly enhanced (and
complicated) by semantics. For example, once the            3.3.1 AHCP Algorithm (also called MP3 algorithm)
Warehouse has achieved full functionality, in each          The algorithm can be broken into three parts :
update cycle of the ORDW, we can expect up to 0.5%
addition of the FC (this is a very conservative estimate    1.   Generating an exhaustive set of AHCPs based on
based on our same DW, maintaining 10 years worth of              query characteristics (selectivity, fan-out) obtained
"Sales" data and updated daily). The updates to DCs can          from the entire query space.
be ignored mainly because their percentage will be even              1.1 For each query Qi in the OQG, generate
smaller and also because most of the DCs will be in                       logical associated fragments {C0.jp} from
memory anyway. Only these 0.5% FC objects have to be                      {Cjp} that satisfies sub-expressions of Qi
processed in order to maintain the Partitioning scheme.                   completely.
                                                                     1.2 Perform an intersection of the C0.jp
The Maintenance cost for the AHCP partitioning scheme                     fragments for all Qi . This creates the
(MC) can be defined as the extra cost of maintaining the                  complete disjoint AHCP set, on which the
AHCPs and the PHPs catalogs.                                              queries will be based.
          MC = MCCat(AHCPi) + MCCat(PHPi)
                                                            2.   Assigning query weights depending on priority and
Since MCCat(PHPi) is negligible as the PHPs are in               importance (frequency).
memory, the main cost is on the AHCP maintenance,                2.1 For each query Qi , evaluate the minimal set of
which is comprised of maintaining catalog entries of the             query processing fragments : QPFi = { C0.j1p1 ,
AHCP, Generally this meta-information is small enough                C0.j2p2 , … , C0.jmpm }
to be stored completely in memory.                               2.2 Create query plans for each Qi having nodes
                                                                     involving unions of fragments, which exist in
3.2.3 Retrieval cost                                                 multiple QPFi .
To determine retrieval cost, we break up the complex             2.3 Assign cumulative weights to the nodes
queries into smaller atomic sub-query expressions. We                depending on their utility to consecutive Qi
denote this by means of a MQO (Multiple Query                        (based on frequency and cardinality).
Optimization) graph in the MP3, which is further
explained in section 3.3.                                   3.  Selecting a minimal complete set of AHCPs based on
                                                                the query weights, subject to storage and
The Retrieval Cost (RC) is the cost of parsing the              maintenance cost.
catalog, accessing the relevant AHCPs (as union) and the        3.1 For each Qi , perform top-down evaluation of
cost of the join with corresponding PHPs.                            nodes in its query plan.
                                                                3.2 Select lower nodes (breakup) if the retrieval cost
         RC = RCCat + RCAHCP + RCPHP + RCjoin
                                                                     is lesser.
However, as we store the PHPs and the join in memory,       This part is similar to that of the Algorithm for selecting
and the Catalog is relatively small, RC is mainly           views to be materialized given an MVPP [YKL97].



Vivekanand G., Q. Li, K. Karlapalem                                                                                  4-5
               Q1       Q2            Q3              Q4        Q5             Q6                           LEGEND                      We see that there are 4 main predicates by
                                                                                                                        Is-a
                                                                                                                       CCH     which the Dimensions are partitioned,
 Steps 2-3
                                                                                                              materialized
                                                                                                               fragments
                                                                                                                               viz. p1 : "Country = US", p2a : "Customer = Teen " ,
                                                                                                                               p2b : "Customer = Adult", p3a : "Product = P1 ", p3b :
                                                                                                    MQO -
                                                                                              Intermediate nodes               "Product = P2", and p4 : "Categ = Elec". Performing the
                                                                                                                               AHCP function with respect to these PHPs as shown in
                                                                                                                               section 3.1, we arrive at an exhaustive set of AHCPs of
                                                                                                                               the Sales Class (FC).
       C 0.21.11.1 120 C 0.21.12.1 120       C 0.21.13.1 120        C 0.21.11.2 120       C 0.21.12.2 120   C 0.21.13.2 120
       C 0.22.11.1 120 C 0.22.12.1 120       C 0.22.13.1 120        C 0.22.11.2 120       C 0.22.12.2 120   C 0.22.13.2 120    Step 1.2 : Perform an intersection of the C0.jp fragments
 Step 1.2
                                                                                                                               for all Qi .
                                                 Complete
                                               disjoint AHCP
                                                                                                                                        Consequently, by intersection as shown in table
                                                                                                                               2, we see that a complete set of 16 different AHCPs of the
                                                                                                                               FC (Sales) can be created based on these 4 predicates,
                             AHCP - 2      C 0.11 2        C 0.12 2            C 0.13 2
 Step 1.1                                                                                                                      encompassing all possible and non-empty fragments :
            C 0.21 1       C 0.22 1                                                          C 0.1 0               C 0.2 0
                                                                                                                                      Table 2. Fragments obtained after intersection
            AHCP - 1                                                                                          PHP - 1
                                                               C0
                                                                                                                                 F1    p1 ^ p2a ^ p3a ^       F9    !p1 ^ p2a ^ p3a ^
                                                                                                                                       p4                           p4
       Fig.3. Multiple Partition Processing Plan (MP3)                                                                           F2    p1 ^ p2a ^ p3b ^       F10   !p1 ^ p2a ^ p3b ^
                                                                                                                                       p4                           p4
Figure 3 shows examples of AHCPs (AHCP-1, AHCP-2)                                                                                F3    p1 ^ p2a ^ !p3a ^      F11   !p1 ^ p2a ^ !p3a ^
and PHPs (PHP-1) on the Fact Class (C0). These                                                                                         !p3b ^ p4                    !p3b ^ p4
fragements can then be merged by intersecting them to                                                                            F4    p1 ^ p2a ^ !p4         F12   !p1 ^ p2a ^ !p4
generate a complete disjoint set of Partitions. It must be                                                                       F5    p1 ^ p2b ^ p3a ^       F13   !p1 ^ p2b ^ p3a ^
noted that this is obtained from the query characteristics,                                                                            p4                           p4
and the Partitions are very exhaustive. Due to this reason,                                                                      F6    p1 ^ p2b ^ p3b ^       F14   !p1 ^ p2b ^ p3b ^
it may not be feasible to materialize the fragments all,                                                                               p4                           p4
and hence the MP3 is used to determine which should be                                                                           F7    p1 ^ p2b ^ !p3a ^      F15   !p1 ^ p2b ^ !p3a ^
materialized and which should be kept virtual [VLK98].                                                                                 !p3b ^ p4                    !p3a ^ p4
The cost model is based on the MVPP [YKL97] and                                                                                  F8    p1 ^ p2b ^ !p4         F16   !p1 ^ p2b ^ !p4
incorporates SC, MC and RC. As shown in the figure,
the shaded classes are materialized.                                                                                           Step 2.1: For each query Qi of the OQG, evaluate the
                                                                                                                               minimal set of query processing fragments.
                                                                                                                                       The query processing fragments (QPF ) are
4 AHCP evaluation                                                                                                              shown in table 3:
In this section, we analyze the fragment retrieval cost for
processing queries in OQG using AHCP. A comparison                                                                                    Table 3. Query Processing Fragments (QPF)
of the results with that of plain query processing                                                                               QPF1 , QPF2               F1, F2, F3, F4, F5, F6, F7,
approach using pointer chasing is then conducted.                                                                                                          F8
                                                                                                                                 QPF3 , QPF4, QPF5         F1, F2, F3, F4, F5, F6, F7,
4.1 Fragment retrieval cost
                                                                                                                                                           F9, F10, F11, F13, F14, F15
In order to evaluate the AHCP methodology, we use the                                                                            QPF6                      F1, F2, F3
sample ORDW schema and queries as detailed in section                                                                            QPF7                      F1, F2, F9, F10
2. Here we note that there are eight queries in the OQG,                                                                         QPF8                      F1, F2, F5, F6
and we assume them all to be of equal importance.
                                                                                                                               Step 2.2 : Create query plans for each Qi having nodes
Running our example through the algorithm given in                                                                             involving unions of fragments which exist in multiple
section 3.3 :                                                                                                                  QPFi .
                                                                                                                                        The intermediate nodes are created by a
Step 1.1 : For each query Qi in the OQG, generate                                                                              combination of fragments noting their affinity in the
logical associated fragments {C0.jp} from {Cjp} that                                                                           QPFs. For the sake of completeness, we also create
satisfies sub-expressions of Qi completely.                                                                                    unaccessed nodes as shown in table 4, for example, N12
                                                                                                                               (F12 U F16), though these fragments are not accessed by




Vivekanand G., Q. Li, K. Karlapalem                                                                                                                                                      4-6
any query in the OQG.                                            Initially all top -level nodes can be considered marked for
                                                                 materialization.
               Table 4. Intermediate Nodes
 Node     Definition          Node      Definition               Step 3.2 : Select lower nodes (breakup) if the retrieval
 N1       F1 U F2             N7        N2 U N6 U N9             cost is lesser.
 N2       N1 U F3             N8        F9 U F10                           This is a recursive step, in which the node is
 N3       F5 U F6             N9        N1 U N8                  unmarked (for materialization) if any node under it has a
                                                                 weight higher than itself. In that case, the lower nodes
 N4       N2 U N3             N10       F11 U F13 U F14
                                                                 are considered marked for materialization, and the
                                        U F15
                                                                 process is repeated with them.
 N5       N3 U F7             N11       N5 U N8 U N10
 N6       N5 U F4 U F8        N12       F12 U F16
                                                                 For example, processing for Q8, we mark N4 as it’s the
                                                                 first node :
Step 2.3 : Assign cumulative weights to the nodes
                                                                 but the weights are : N4 : 1, N2 : 2, N3 : 3.
depending on their utility to consecutive Qi (based on
                                                                 hence N4 is discarded for N2 and N3.
frequency and cardinality).
                                                                 Now N2 : 2, N1 : 4, F3 : 2.
        For each of the queries Qi, we know the optimal
                                                                 So N2 is discarded for N1 and F3.
query processing plan opi , which is an ordered list of
                                                                 .. and so on.
nodes and fragments. We also know the frequency (fqi ) of
each query, and the selectivity (selpj) of the clause that its
                                                                 Repeating this process for all the queries, the following
(sub-query) is based on. Depending on those parameters,
                                                                 nodes are materialized :
we give weights to the nodes in the opi of each query.
                                                                          F3, F4, F7, F8, N1, N3, N8, N10.
                                                                 This is our optimal minimal AHCP set.
For example, processing for Q1, we have :
 =                                          traversal cost
∴ the weights for all these nodes (and fragments) is
                                                                 In this section, we evaluate our AHCP scheme for its
f1 * sel1 .
                                                                 performance gain over the un-partitioned case during
Processing for Q6, we have :
                                                                 query retrieval. As noted in the previous section, we have
 = 
                                                                 derived an optimal complete minimal AHCP set of the
∴ the weights for all these nodes (and fragments) is
                                                                 Sales FC.
f6 * sel6 .
.. and so on.
                                                                 The DCs and associated joins are in memory and
                                                                 evaluating a query branch dealing with them would
For simplification, we consider equal frequencies and
                                                                 involve CPU cost. This is ignored here, as the disk I/O
100% selectivity in the fragments, hence at the end of
                                                                 cost is the major component of response time in most
this step, we have weights as shown in table 5:
                                                                 query retrieval costs.
      Table 5. Weights for the Fragments and Nodes
                                                                 The following study shows disk i/o cost ratios for varying
 Frag     Weight       Frag    Weight      Node     Weight       relative frequencies of queries in the OQG.
 F1       4            F11     1           N5       2
 F2       4            F12     0           N6       1            cost ratio (CR) = cost of disk i/o for unpartitioned case
 F3       2            F13     1           N7       1                              cost of disk i/o after AHCP
 F4       1            F14     1           N8       3
 F5       3            F15     1           N9       2            The query frequencies are varied from 10% to 90%. As
 F6       3            F16     0           N10      1            these are relative frequencies, it must be noted that the
 F7       1            N1      4           N11      1            frequencies of the other queries in OQG are modified
 F8       1            N2      2           N12      0            equally in each case. The parameters for the study are
 F9       3            N3      3                                 stated in the Appendix.
 F10      3            N4      1
                                                                 As can be seen from table 6, there's always a minimum
Step 3.1 : For each Qi , perform top-down evaluation of          gain obtained when the ORDW is partitioned; the range
nodes in its query plan.                                         of the gain varies from 1% to 50% in this case study.
         As the  are ordered (tree structured), for        Note that these results appear to exhibit a linear relation
each Qi, we can traverse the list in a top-down manner.          between the selectivity of the query and the cost gain




Vivekanand G., Q. Li, K. Karlapalem                                                                                      4-7
obtained from the AHCP operation. However this should        instead of the entire Fact table. We are currently in the
be interpreted only as the best-case scenario, because in    process of combining these complementary approaches
real-world cases some level of data replication is           into a single framework. We are building an
expected which can cause redundant data access. This         experimental ORDW prototype system that will be
may lead to higher cost for the partitioned case than what   validated by empirical studies based on TPC-H
                                                             benchmark queries.
this example indicates, although the difference will not
be too significant.
                                                             Acknowledgement - This work has been supported
            Table 6. Cost Ratio observations                 by City University of Hong Kong under grant# 7100078.
     relative      10%     30%     50%     70%     90%
    frequency                                                References
        Q1          0.05    0.15    0.25    0.35    0.45
        Q2          0.05    0.15    0.25    0.35    0.45     [BKS98] L. Bellatreche, K. Karlapalem and A. Simonet,
                                                               “Algorithms and Support for Horizontal Class
        Q3          0.03    0.09    0.15    0.21    0.27
                                                               Partitioning    in    Object-Oriented     Databases”,
        Q4          0.03    0.09    0.15    0.21    0.27       Distributed and Parallel Databases, Kluwer
        Q5          0.03    0.09    0.15    0.21    0.27       Academic Publishers, accepted in 1998 (to appear).
        Q6          0.02    0.06     0.1    0.14    0.18
        Q7          0.01    0.03    0.05    0.07    0.09     [CD97] Surajit Chaudhuri and Umeshwar Dayal, “An
        Q8          0.01    0.03    0.05    0.07    0.09       Overview of Data Warehousing and OLAP
                                                               Technology”, ACM SIGMOD Record, 26(1), March
                                                               1997, pp. 65-74.
5 Conclusions and future work                                [CCS93] E.F. Codd, S.B. Codd, and C.T. Salley,
In this paper, we have presented a methodology towards         “Providing OLAP (on-line analytical processing) to
efficient query processing in an object-relational data        user-analysts: An IT mandate”, Tech. Report, 1993.
warehousing (ORDW) environment, through devising
                                                             [Dayal99]  Umeshwar        Dayal,  “Dynamic     Data
and incorporating Associated Horizontal Class
                                                               Warehousing”, Proc. First International Conference
Partitioning (AHCP) techniques over the ORDW schema.           on Data Warehousing and Knowledge Discovery
Our methodology starts with a given set of data                (DaWaK), Florence, Italy, 1999.
warehouse queries, comes up a near-optimal AHCP
scheme for the queries, and selects AHCP fragments as        [Fun98] Chi-wai Fung, “Vertical Class Partitioning and
materialized views to facilitate efficient evaluation of       Complex Object Retrieval in Object Oriented
these queries. Through an initial analytical study, we are     Databases”, PhD. Thesis, Department of Computer
already able to demonstrate the gains of our approach          Science, HKUST, Dec 1998.
vis-a-vis the unpartitioned approach in terms of disk I/O
in the ORDW environment.                                     [FKL98] Chi-wai Fung, Kamalakar Karlapalem, Qing
                                                               Li, “Structural Join Index Hierarchy: A Mechanism
Note that the work we have described in this paper             for Efficient Complex Object Retrieval”, Proc. FODO
(hence the result obtained) should be only regarded as an      Conference 1998, pp. 127-136.
intermediate stage towards efficient ORDW query
processing; further advanced techniques and mechanisms       [KL00] K. Karlapalem and Q. Li, “A Framework for
should and can be naturally added. In particular, an           Class Partitioning in Object-Oriented Databases”,
adaptive and extensible indexing framework is currently        Distributed and Parallel Databases, 8(3):317-350,
being developed, so as to better accommodate the               Kluwer Academic Publishers, 2000 (in press).
requirements of dynamic data warehousing [Dayal99]
which demands to incorporate more semantics into the         [GGT96] G. Gardarin, J.R. Gruser, Z.H. Tang, “Cost-
data warehouse schemata. As shown in [VLK00], a                based Selection of Path Expression Processing
query-driven indexing mechanism built on the SJIH              Algorithms in Object-Oriented Databases”, Proc.
(structural join index hierarchy) [FKL98] seems to be          VLDB 1996, pp. 390-401.
very effective, and is supplementary to the work
described in this paper. Moreover, the creation and          [GHRU97] H. Gupta, V. Harinarayanan, A. Rajaraman,
maintenance algorithms of materialized views and OLAP          and J.D. Ullman, “Index Selection for OLAP”, Proc.
cubes benefit from the reduced search space obtained due       ICDE 1997, pp. 208-219.
to the AHCP scheme. Since OLAP queries involve
multiple paths (multiple selections and group bys), the      [GM95] A. Gupta, and I. S. Mumick, “Maintenance of
Forward and Reverse Joins are considerably reduced by          Materialized Views: Problems, Techniques, and
employing them on a subset of the AHCP fragments               Applications”, IEEE Data Eng. Bulletin, June 1995.



Vivekanand G., Q. Li, K. Karlapalem                                                                               4-8
[MK99] Mukesh Mohania and Y. Kambayashi, “Making          Appendix
  Aggregate Views Self-Maintainable”, Data and
                                                          Table A. Query Parameters
  Knowledge Engineering, accepted in 1999 (to appear).
                                                          fo = fan-out
                                                          R - reference (reverse links)
[OQ97] P. O'Neil, D. Quass, “Improved query
                                                          ||Ci|| - cardinality
  performance with variant indexes”, Proc. ACM
  SIGMOD '97, pp. 38-49.
                                                           Reference (i→j)       fo       R     ||Ci||   ||Cj||
[Rou97] Nick Roussopoulos, “Materialized Views and         Sales→Product         1        100   50M      .5M
  Data Warehouses”, Proc. KRDB 1997, pp. 12.1-12.6.        Sales→Customer        1        50    50M      1M
                                                           Sales→Teenager        1        250   50M      .2M
[Sar97] Sunita Sarawagi, “Indexing OLAP Data”, IEEE        Sales→Date            1        500   50M      36.5K
   Data Engineering Bulletin, 1997, 20:36-43.              Prod→Category         1        10    .5M      1K
                                                           Product→Retailer      50       100   .5M      50K
[VLK98] Vivekanand Gopalkrishnan, Qing Li,                 Category→Type         100      5     1000     10
  Kamalakar Karlapalcem, “Issues of Object-Relational      Retailer→City         1        4     50,K     12.5K
  View Design in Data Warehousing Environment”,            Customer→City         1        80    1M       12.5K
  Proc. IEEE SMC Conference 1998, pp. 2732-2737.           Year→Mon              12       1     10       120
                                                           Mon→Date              30       1     120      3.6K
[VLK99] Vivekanand Gopalkrishnan, Qing Li,                 Year→Date             365      1     10       3.6K
  Kamalakar Karlapalem, “Star/Snow-flake Schema            Country→State         25       1     10       250
  Driven Object-Relational Data Warehouse Design and
                                                           State→City            5        1     250      1.2K
  Query Processing Strategies”, Proc. First Int'l
                                                           Country→City          125      1     10       1.2K
  Conference on Data Warehousing and Knowledge
  Discovery (DaWaK), LNCS 1676, pp. 11-22, Florence,
  Italy, 1999.                                            Table B. Selectivity (%) :
[VLK00] Vivekanand Gopalkrishnan, Qing Li,                 Country = 'US'          50
  Kamalakar Karlapalem, “Efficient Query Processing
                                                           Category = 'Elec'       30
  with Structural Join Indexing in an Object Relational
  Data Warehousing Environment”, Proc. 11th                Product = 'P1'          5
  Information Resources Management Association Int'l       Product = 'P2'          5
  Conference (IRMA'00), Anchorage, Alaska, May 21-         Customer = 'Teen'       20
  24, 2000 (to appear).

[YKL97] Jian Yang, Kamalakar Karlapalem, Qing Li,
  “Algorithms for Materialized View Design in Data
  Warehousing Environment”, Proc. VLDB 1997, pp.
  136-145.




Vivekanand G., Q. Li, K. Karlapalem                                                                               4-9