Efficient Query Processing with Associated Horizontal Class Partitioning in an Object Relational Data Warehousing Environment Vivekanand Gopalkrishnan Qing Li Department of Computer Science, Department of Computer Science, City University of Hong Kong, City University of Hong Kong, Kowloon, Hong Kong, PRC Kowloon, Hong Kong, PRC vivek@cs.cityu.edu.hk csqli@cityu.edu.hk Kamalakar Karlapalem Department of Computer Science, University of Science and Technology Clear Water Bay, Hong Kong, PRC kamal@cs.ust.hk Abstract consistently and instantaneously. Maintaining the integrity of these Indexes and Views [GM95], [MK99] In an Object Relational Data Warehousing imposes a challenging problem when the source data (ORDW) environment, the semantics of data and changes frequently, when the size of the DW keeps queries can be explicitly captured, represented, growing, and/or when the user queries become and utilized based on is-a and class composition increasingly complex. An extensible framework that can hierarchies, thereby resulting in more efficient accommodate dynamic warehousing [Dayal99] of OLAP query processing. In this paper, we show changing data gracefully, and have adaptive handles for the efficacy in building semantic-rich hybrid processing OLAP queries efficiently is needed. class partitions by incorporating the Associated Horizontal Class Partitioning (AHCP) technique In [VLK98], we showed that besides establishing a on the ORDW schema. Given a set of queries, semantically richer framework for multi-dimension we use primary and derived partitioning hierarchies, the Object Relational View (ORV) model algorithms to select (near) optimal AHCPs, provides excellent support for complex object retrieval. In thereby embedding query semantics into the [VLK99] we presented the Object Relational Data partitioned framework. Finally, by a cost model, Warehousing (ORDW) approach to address some of the we analyze the effectiveness of our approach vis- issues discussed in [VLK98] on data warehousing. More a-vis the unpartitioned approach. specifically, we devised a translation mechanism from the star/snowflake schema to an object oriented (O-O) representation. In [VLK00], we advocated a query 1 Introduction processing strategy implementing the Structural Join Data warehouse (DW) equips users with more effective Index Hierarchy (SJIH) on ORDW. decision support tools by integrating enterprise-wide corporate data into a single repository from which In this paper, we show the efficacy in building semantic- business end-users can run reports and perform ad hoc rich hybrid class partitions by incorporating the data analysis [CD97]. As DWs contain enormous amount Associated Horizontal Class Partitioning (AHCP) of data, often from different sources, we need highly technique on the ORDW schema. Given a set of queries, efficient indexing structures [Sar97], [GHRU97], we use primary and derived partitioning algorithms to [VLK00], materialized (stored) Views [Rou97], and select (near) optimal AHCPs, thereby embedding query query processing techniques [VLK99] to efficiently semantics into the partitioned framework. Finally, by a answer on-line analytical processing (OLAP) queries. cost model, we analyze the effectiveness of our approach Materialized Views represent integrated data based on vis-a-vis the unpartitioned approach. complex aggregate queries, and should be available To put our research in perspective, we review some related work and briefly outline our previous work in the The copyright of this paper belongs to the paper’s authors. Permission to copy field of ORDW and Class Partitioning on OODBs in without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage. section 2. We further motivate our study by presenting on the ORDW schema some sample queries whose patterns Proceedings of the International Workshop on Design and are classified based on DW operations and by OO Management of Data Warehouses (DMDW'2000) concepts. Obtaining an optimal partitioning scheme to Stockholm, Sweden, June 5-6, 2000 process this set of queries is the focus of section 3, where (M. Jeusfeld, H. Shu, M. Staudt, G. Vossen, eds.) we employ a hill-climbing heuristic algorithm to select a http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-28/ Vivekanand G., Q. Li, K. Karlapalem 4-1 (near) optimal AHCPs. This algorithm is profiling other composition links. Moreover, we also demonstrate driven, and can be further extended to incorporate other is-a hierarchies (denoted by dotted lines) obtained by semantics. In section 4, we compare results of retrieval horizontally partitioning the ORDW schema. costs using the AHCPs vs. the unpartitioned case. Section 5 concludes the paper with a brief look at future work. Sales Order 2 Background and Motivating Example Product Retailer Date Month Week 2.1 Related Work City Customer Category Partitioning has been vastly researched in Relational and State Teenager Adult Season Quarter OO database systems. Excellent work has been done in Customer Vertical Partitioning (VP) and Horizontal Partitioning Type Country Year (HP) in both systems, but the unique features of OO Product Address Time systems have made it possible to experiment with different variations such as Derived Horizontal Class Fig. 1. The ORDW Schema. Partitioning (DHCP), Associated Horizontal Class The figure shows the class composition hierarchy for the Partitioning (AHCP), Path Partitioning (PP) and Method Time dimension, and the is-a hierarchy (shaded area) for Induced Partitioning (MIP). [KL00] presents a the Customer dimension. comprehensive framework for devising partitioning schemes based on different types of methods and their A Fact is the “subject” of the OLAP queries, and is classification. The issue of fragmentation transparency is quantified by its dimensions and “values”. Dimensions addressed by considering appropriate method can be hierarchical and composite in nature, whereas transformation techniques. While those methods were Values are numerical data. When a Dimension is extremely successful in the transactional environment of complex enough to contain various other components an OODB, to the best of our knowledge, no work has that can themselves be classified as Dimensions and been done in partitioning of an Object Relational DB or Values then that Dimension can be a “Fact” of another Object Relational Data Warehouse (ORDW). OLAP query. In this case, we can consider the schema to be that of “Nested Fact” or “Fact within Fact”. Recently, we have conducted some preliminary studies on developing an ORDW framework. In [VLK98], we Whereas when two (or more) Facts share (one or more) showed that the ORV (Object Relational View) model Dimensions, then the OLAP queries can be considered as offers inherent features that are conducive to managing a “Inter-Fact”. To support OLAP applications, we define a data warehouse. We listed the various issues that arise group of OLAP queries OQG (modified from [VLK00] during the design of an OR-DWMS (Object Relational by adding predicates), which are invoked as a set (not Data Warehouse Management Systems). Here, OR means necessarily in an order). Query patterns in the OQG may an object-oriented front-end or views to underlying not be restricted to a particular composition hierarchy or relational data sources. Based on the issues discussed in inheritance hierarchy. The access paths may involve [VLK98], we put forward a three-phase design approach multiple paths emanating from the same complex object, in [VLK99], which also provided a query-driven as well as interact with entities in completely unrelated translation mechanism from the star/snowflake schema to complex objects. an object oriented (O-O) representation. Some query processing strategies utilizing Structural Join Index For this discussion, queries involving Nested Facts can be Hierarchy (SJIH) techniques for complex queries on considered as subsets of inter-Fact queries. They are composite objects were addressed in [VLK00]. distinguished by the presence of a semantic disjointness between the Facts involved. It must be noted though that 2.2 Motivating Query Examples this disjointedness does not preclude the Facts from To further motivate our subsequent discussions, let us sharing the same component objects. A query-processing consider the sample ORDW schema as shown in Figure scheme built on separate Facts will inadvertently need 1, taken from [VLK99]. This schema is a simple single- costly joins. This inefficiency is amplified for queries star/snowflake schema, for a sales application. As seen in with low selectivity and high frequency. This calls for a the figure, dimension classes (DCs) are connected by need for a partitioning scheme that transcends Facts and solid arrows (composition hierarchies) to the main fact is not restricted by the hierarchies mentioned. It must be class (FC). In addition, there are other inter-dimension noted that such a partitioning scheme may well be hierarchies (as obtained by vertical partitioning1), and overlapping and hence will suffer on the storage space. Based on classifications by DW operations & by OO 1 This paper deals only with Associated Horizontal Partitioning concepts, we consider the following queries listed in techniques, and the above schema has been arrived at using Table 1 as our sample OQG for subsequent discussions. other techniques. Vivekanand G., Q. Li, K. Karlapalem 4-2 Table 1. Sample OLAP queries - OQG frequency to determine the selection of minimal complete set of partition fragments for optimal storage, No. Query Query type maintenance and retrieval costs. Q1 Sales by Prod by State in US Only along cch (pivot) 3.1.1 Primary Horizontal Partitions (PHP) Q2 Sales by Prod by State by Year -> Drill-down in US Classes in the ORDW schema can be denoted as Cip, Q3 Sales by Prod for Categ=Elec -> Roll-up which indicates the i th class in the p th path. The root Q4 Sales by Prod by City for Only along cch, class (FC) is denoted as C0. Primary Horizontal Partitions Categ=Elec Drill-down on these classes can be denoted as sub-classes and placed Q5 Sales by Prod by Country for Only along cch, in the is-a hierarchy under the original partitioned class. Categ=Elec Roll-up Note that the (sub) is-a hierarchy in our examples is Q6 Sales by Prod to Teenagers by Only along cch, denoted by the subscript i.j , which denotes the j th sub- State for Categ=Elec & in US Slice_and_dice class of the i th class (in the p th path). The Primary Q7 Sales of Prod 1 compared with Only along cch, Horizontal Partitioning (PHP) operation is denoted as : Sales of Prod 2 to Teenagers Drill-down, for Categ=Elec Slice_and_dice PHP(Cip )p1 → { Ci.1p , Ci.2p ,..., Ci.np } Q8 % increase in Sales to Combination of is- where (Cip ) is the Class that is Primary Horizontally Teenagers over Sales to a & cch, Drill- Adults, of Prod 1 / 2 for down, Partitioned according to a predicate (p1), resulting in n Categ=Elec & in US Slice_and_dice fragments which are treated as classes {Ci.np}. Note however, that since FC is the only root in the realm of Some OLAP queries could be on the entire range of Sales our OLAP OQGs, any primary partition of the root need and would need to access multiple dimensions for the not display the path suffix; ie. (C0.10 = C0.1). "Group BY". However, as seen above, some queries could LEGEND have a predicate range such as "Categ=Elec" or Is-a C 0.11 2 C 0.12 2 C 0.13 2 CCH "Country=US". In such cases, the search space on the C i.jp AHCP Fact "Sales" is reduced by a factor equal to the selectivity Fragment ation Join of the predicate. However, this does not help during C 0.21 1 C 0.22 1 C 0.1 0 C 0.2 0 query processing (normal unpartitioned case), as the entire FC is processed while searching for relevant C0 tuples. Even in cases where indexes are built [VLK00], the benefit could be reduced, as index creation takes up more time due to the enormity of the FC. Further, as the C 11 C 12 C 13 OLAP queries involve multiple paths (multiple selections and group bys), the size of the Forward and Reverse Joins C 1.1 2 C 1.3 2 is dependent on the size of the Root (FC). This calls for C 21 C 1.2 2 C 23 the need to partition the FC according to the query characteristics. C 2.1 1 C 2.2 1 C 3 2.21 1 C 3 2.22 1 3 AHCP Selection Methodology The Associated Horizontal Class Partitioning (AHCP) Fig. 2. An AHCP example on the Fact and Dimensions methodology creates semantic-rich hybrid class partitions The example in figure 2 shows classes C0, C21 and C12 in for efficient query processing. It is a technique by which the class composition hierarchy (CCH). Some of the several classes can be partitioned according to the semantics of another class in its aggregation hierarchy. PHPs are { C1.12 , C1.22 and C1.32 }, connected by dashed We employ the AHCP on our ORDW schema, and lines (is-a) to the super-class C12 which was partitioned. propose to extend its applicability from class composition hierarchies to also include is-a hierarchies and links As the PHPs can be considered as subclasses of the class quantified by partial participation, thereby encompassing on which the PHP were performed, they are placed in the the Complete Warehouse Schema (CWS) in the ORDW. is-a hierarchy of the schema. We can have any no. of PHPs on a single class based on a number of predicates. 3.1 AHCP fundamentals For a single simple predicate, the PHPs are disjoint, i.e. The total cost of the AHCP framework can be broadly they don't share any objects. PHP schemes based on categorized as partition storage cost, retrieval cost and multiple or complex predicates on the same class, may maintenance cost. In this paper, we also incorporate induce overlapping fragments, however we do not query-centric information including selectivity and consider such schemes in this work to avoid complexity. Vivekanand G., Q. Li, K. Karlapalem 4-3 3.1.2 Associated Horizontal Class Partitions (AHCP) 3.2.1 Storage Cost After the PHPs are created, the AHCP operation may be The Storage cost (SC) has two components : Primary performed on some other classes in the schema. As noted Horizontal Partition (PHP), and the Associated above, most queries in the OQG access the root (FC) for Horizontal Class Partition (AHCP). It can be stated as : its value based attributes, and hence this paper deals SC = SCPHP + SCAHCP primarily with AHCP of the root class. The AHCP operation can be denoted as follows : They are given as follows : AHCP (Cjq, PHP(Cip )p1 ) → { Cqj.i1p , Cqj.i2p ,..., Cqj.imp } 1. SCPHP (C1) : where (Cjq ) is the Class that is Associate Horizontally We assume that in most cases, and especially in this Class Partitioned (AHCP) according to the PHP on class paper, we consider only one PHP per class. This ensures Cip, resulting in m fragments which are treated as classes that the partitions are disjoint for simple predicates. In { Cqj.imp }. Here again since FC is the only root in the such cases, there's negligible overhead for storage cost as realm of our OLAP OQGs, any AHCP of the root need SCPHP (C1) = | C1 | (no. of pages occupied by the class C1 not display the path suffix; ie. (C00.112 = C0.112). + catalog entries for the no. of PHPs of C1. These catalog entries give details of the partitioned Class structure, As seen in the figure, the examples indicate that two sets extent and qualifying rules. Hence, they're very small and of AHCPs are created from the root C0. They can be can easily be accommodated in memory (in both the created by : medium and large memory hypothesis. AHCP (C0, PHP(C12 )p1 ) → { C0.112 , C0.122 , C0.132 } ; In case of multiple complex predicates on a Dimension AHCP (C0, PHP(C21 )p1 ) → { C0.211 , C0.221 } (C1), resulting in overlapping fragments, we propose not These partition fragments are denoted as subclasses in to replicate the entire class extent, but rather only the figure by means of the shaded boxes to indicate replicate the Class OIDs (and some frequently accessed Associate Partitioning. attributes) in the separate Partitions. The AHCP operation can also be preformed on classes In this case, the storage overhead can be estimated as : other than the root, ie. the Dimension Classes. For SCPHP(C1) = || C1 || x NoAttr x (sizeof(Attr)) x NoPHP example, as seen in the figure, C23can be AHCPed based on the PHPs of C21. where NoAttr = No.of Attributes replicated. where NoPHP = No.of Partition schemes. AHCP (C23, PHP(C21 )p1 ) → { C3.211 , C3.221 } The result is shown in shaded boxes under C23 in the Given a maximum of 2 replicated attributes or 20% of figure. the class structure, and a uniform size of attributes, we can accommodate up to 5 different Partitioning schemes An important point to be noted here is that while the for an increase of 100% in SCPHP (C1). Fragments obtained by any single AHCP operation on the 2. SCAHCP (C0) : root are always disjoint, the same cannot be said about Fragments obtained by AHCP on any other (Dimension) This is by far the biggest increment for storage cost in the class. This indicates the storage overhead to be incurred AHCP ORDW. As noted above, the root (C0) would be while performing AHCPs on the Dimension classes, and the widely used as the candidate for performing AHCP. must be taken into account by the cost model. Since any predicate on a single dimension can only induce disjoint partitions in the root, the partitioning 3.2 AHCP cost model overhead is negligible for multiple partitioning schemes In an ORDW, partitioning can be implemented by means in a single DC. of Method Induced Partitioning techniques [KL00]. SCAHCP (C0) = | C0 | + NoPHP x SizeCat(PHPi). Moreover, due to the structural and cardinal differences inherent between Dimension Classes (DC) and the Fact where SizeCat = Catalog entry size (structure, extent, Class (FC), we can assume that the DCs need not be qualifying rules). physically partitioned as they may be wholly or partially stored in memory (under both medium and large memory However, as we incorporate multiple predicates on hypothesis). Hence, the cost of the traditional join different dimension classes, SCAHCP (C0) grows linearly between the PHP fragments and the AHCP fragments can as the no. of dimensions (assuming only single complex be ignored. This join can be achieved by employing the predicates on each dimension). This can be a large methods of the FC. Vivekanand G., Q. Li, K. Karlapalem 4-4 overhead, as C0 as the FC, is very large (~order of composed of AHCP loading cost. Since this is smaller Gigabytes). than the complete FC by a factor of min (selpi ), where Hence, we intend to reduce this overhead by means of a selpi indicates the selectivity of the predicates on query Qi Multiple Partition Processing Plan (MP3), based on , we achieve a considerable savings in retrieval cost. MVPP [YKL97]. This would entail a compromise between duplication and efficiency of the partitions, as This saving is also obtained when indexing schemes like sub-fragments will have to be created to support the the SJIH [VLK00] are built on top of the AHCPs, and AHCPs. The Join needed to produce the final result from also when aggregate views have to be developed. these sub-fragments constitutes the increase in retrieval cost. 3.3 AHCP selection procedure We approach the problem of performing AHCP in the 3.2.2 Maintenance cost ORDW in a different manner from the case of DHCP in a As noted above, since inter-fragment join is avoided normal OODB [BKS98]. IN [BKS98], various techniques between the PHP and AHCP fragments, maintenance cost (candidates) were considered to decide the best PHP is considerably simplified due to the AHCP operation. candidate for performing DHCP. Here we consider all the PHP candidates, and our AHCP algorithm generates a As the ORDW is a read mostly and append only optimal combination of complete and minimal set of environment, we can safely estimate the maintenance AHCPs. cost even though the schema is vastly enhanced (and complicated) by semantics. For example, once the 3.3.1 AHCP Algorithm (also called MP3 algorithm) Warehouse has achieved full functionality, in each The algorithm can be broken into three parts : update cycle of the ORDW, we can expect up to 0.5% addition of the FC (this is a very conservative estimate 1. Generating an exhaustive set of AHCPs based on based on our same DW, maintaining 10 years worth of query characteristics (selectivity, fan-out) obtained "Sales" data and updated daily). The updates to DCs can from the entire query space. be ignored mainly because their percentage will be even 1.1 For each query Qi in the OQG, generate smaller and also because most of the DCs will be in logical associated fragments {C0.jp} from memory anyway. Only these 0.5% FC objects have to be {Cjp} that satisfies sub-expressions of Qi processed in order to maintain the Partitioning scheme. completely. 1.2 Perform an intersection of the C0.jp The Maintenance cost for the AHCP partitioning scheme fragments for all Qi . This creates the (MC) can be defined as the extra cost of maintaining the complete disjoint AHCP set, on which the AHCPs and the PHPs catalogs. queries will be based. MC = MCCat(AHCPi) + MCCat(PHPi) 2. Assigning query weights depending on priority and Since MCCat(PHPi) is negligible as the PHPs are in importance (frequency). memory, the main cost is on the AHCP maintenance, 2.1 For each query Qi , evaluate the minimal set of which is comprised of maintaining catalog entries of the query processing fragments : QPFi = { C0.j1p1 , AHCP, Generally this meta-information is small enough C0.j2p2 , … , C0.jmpm } to be stored completely in memory. 2.2 Create query plans for each Qi having nodes involving unions of fragments, which exist in 3.2.3 Retrieval cost multiple QPFi . To determine retrieval cost, we break up the complex 2.3 Assign cumulative weights to the nodes queries into smaller atomic sub-query expressions. We depending on their utility to consecutive Qi denote this by means of a MQO (Multiple Query (based on frequency and cardinality). Optimization) graph in the MP3, which is further explained in section 3.3. 3. Selecting a minimal complete set of AHCPs based on the query weights, subject to storage and The Retrieval Cost (RC) is the cost of parsing the maintenance cost. catalog, accessing the relevant AHCPs (as union) and the 3.1 For each Qi , perform top-down evaluation of cost of the join with corresponding PHPs. nodes in its query plan. 3.2 Select lower nodes (breakup) if the retrieval cost RC = RCCat + RCAHCP + RCPHP + RCjoin is lesser. However, as we store the PHPs and the join in memory, This part is similar to that of the Algorithm for selecting and the Catalog is relatively small, RC is mainly views to be materialized given an MVPP [YKL97]. Vivekanand G., Q. Li, K. Karlapalem 4-5 Q1 Q2 Q3 Q4 Q5 Q6 LEGEND We see that there are 4 main predicates by Is-a CCH which the Dimensions are partitioned, Steps 2-3 materialized fragments viz. p1 : "Country = US", p2a : "Customer = Teen " , p2b : "Customer = Adult", p3a : "Product = P1 ", p3b : MQO - Intermediate nodes "Product = P2", and p4 : "Categ = Elec". Performing the AHCP function with respect to these PHPs as shown in section 3.1, we arrive at an exhaustive set of AHCPs of the Sales Class (FC). C 0.21.11.1 120 C 0.21.12.1 120 C 0.21.13.1 120 C 0.21.11.2 120 C 0.21.12.2 120 C 0.21.13.2 120 C 0.22.11.1 120 C 0.22.12.1 120 C 0.22.13.1 120 C 0.22.11.2 120 C 0.22.12.2 120 C 0.22.13.2 120 Step 1.2 : Perform an intersection of the C0.jp fragments Step 1.2 for all Qi . Complete disjoint AHCP Consequently, by intersection as shown in table 2, we see that a complete set of 16 different AHCPs of the FC (Sales) can be created based on these 4 predicates, AHCP - 2 C 0.11 2 C 0.12 2 C 0.13 2 Step 1.1 encompassing all possible and non-empty fragments : C 0.21 1 C 0.22 1 C 0.1 0 C 0.2 0 Table 2. Fragments obtained after intersection AHCP - 1 PHP - 1 C0 F1 p1 ^ p2a ^ p3a ^ F9 !p1 ^ p2a ^ p3a ^ p4 p4 Fig.3. Multiple Partition Processing Plan (MP3) F2 p1 ^ p2a ^ p3b ^ F10 !p1 ^ p2a ^ p3b ^ p4 p4 Figure 3 shows examples of AHCPs (AHCP-1, AHCP-2) F3 p1 ^ p2a ^ !p3a ^ F11 !p1 ^ p2a ^ !p3a ^ and PHPs (PHP-1) on the Fact Class (C0). These !p3b ^ p4 !p3b ^ p4 fragements can then be merged by intersecting them to F4 p1 ^ p2a ^ !p4 F12 !p1 ^ p2a ^ !p4 generate a complete disjoint set of Partitions. It must be F5 p1 ^ p2b ^ p3a ^ F13 !p1 ^ p2b ^ p3a ^ noted that this is obtained from the query characteristics, p4 p4 and the Partitions are very exhaustive. Due to this reason, F6 p1 ^ p2b ^ p3b ^ F14 !p1 ^ p2b ^ p3b ^ it may not be feasible to materialize the fragments all, p4 p4 and hence the MP3 is used to determine which should be F7 p1 ^ p2b ^ !p3a ^ F15 !p1 ^ p2b ^ !p3a ^ materialized and which should be kept virtual [VLK98]. !p3b ^ p4 !p3a ^ p4 The cost model is based on the MVPP [YKL97] and F8 p1 ^ p2b ^ !p4 F16 !p1 ^ p2b ^ !p4 incorporates SC, MC and RC. As shown in the figure, the shaded classes are materialized. Step 2.1: For each query Qi of the OQG, evaluate the minimal set of query processing fragments. The query processing fragments (QPF ) are 4 AHCP evaluation shown in table 3: In this section, we analyze the fragment retrieval cost for processing queries in OQG using AHCP. A comparison Table 3. Query Processing Fragments (QPF) of the results with that of plain query processing QPF1 , QPF2 F1, F2, F3, F4, F5, F6, F7, approach using pointer chasing is then conducted. F8 QPF3 , QPF4, QPF5 F1, F2, F3, F4, F5, F6, F7, 4.1 Fragment retrieval cost F9, F10, F11, F13, F14, F15 In order to evaluate the AHCP methodology, we use the QPF6 F1, F2, F3 sample ORDW schema and queries as detailed in section QPF7 F1, F2, F9, F10 2. Here we note that there are eight queries in the OQG, QPF8 F1, F2, F5, F6 and we assume them all to be of equal importance. Step 2.2 : Create query plans for each Qi having nodes Running our example through the algorithm given in involving unions of fragments which exist in multiple section 3.3 : QPFi . The intermediate nodes are created by a Step 1.1 : For each query Qi in the OQG, generate combination of fragments noting their affinity in the logical associated fragments {C0.jp} from {Cjp} that QPFs. For the sake of completeness, we also create satisfies sub-expressions of Qi completely. unaccessed nodes as shown in table 4, for example, N12 (F12 U F16), though these fragments are not accessed by Vivekanand G., Q. Li, K. Karlapalem 4-6 any query in the OQG. Initially all top -level nodes can be considered marked for materialization. Table 4. Intermediate Nodes Node Definition Node Definition Step 3.2 : Select lower nodes (breakup) if the retrieval N1 F1 U F2 N7 N2 U N6 U N9 cost is lesser. N2 N1 U F3 N8 F9 U F10 This is a recursive step, in which the node is N3 F5 U F6 N9 N1 U N8 unmarked (for materialization) if any node under it has a weight higher than itself. In that case, the lower nodes N4 N2 U N3 N10 F11 U F13 U F14 are considered marked for materialization, and the U F15 process is repeated with them. N5 N3 U F7 N11 N5 U N8 U N10 N6 N5 U F4 U F8 N12 F12 U F16 For example, processing for Q8, we mark N4 as it’s the first node : Step 2.3 : Assign cumulative weights to the nodes but the weights are : N4 : 1, N2 : 2, N3 : 3. depending on their utility to consecutive Qi (based on hence N4 is discarded for N2 and N3. frequency and cardinality). Now N2 : 2, N1 : 4, F3 : 2. For each of the queries Qi, we know the optimal So N2 is discarded for N1 and F3. query processing plan opi , which is an ordered list of .. and so on. nodes and fragments. We also know the frequency (fqi ) of each query, and the selectivity (selpj) of the clause that its Repeating this process for all the queries, the following (sub-query) is based on. Depending on those parameters, nodes are materialized : we give weights to the nodes in the opi of each query. F3, F4, F7, F8, N1, N3, N8, N10. This is our optimal minimal AHCP set. For example, processing for Q1, we have : = traversal cost ∴ the weights for all these nodes (and fragments) is In this section, we evaluate our AHCP scheme for its f1 * sel1 . performance gain over the un-partitioned case during Processing for Q6, we have : query retrieval. As noted in the previous section, we have = derived an optimal complete minimal AHCP set of the ∴ the weights for all these nodes (and fragments) is Sales FC. f6 * sel6 . .. and so on. The DCs and associated joins are in memory and evaluating a query branch dealing with them would For simplification, we consider equal frequencies and involve CPU cost. This is ignored here, as the disk I/O 100% selectivity in the fragments, hence at the end of cost is the major component of response time in most this step, we have weights as shown in table 5: query retrieval costs. Table 5. Weights for the Fragments and Nodes The following study shows disk i/o cost ratios for varying Frag Weight Frag Weight Node Weight relative frequencies of queries in the OQG. F1 4 F11 1 N5 2 F2 4 F12 0 N6 1 cost ratio (CR) = cost of disk i/o for unpartitioned case F3 2 F13 1 N7 1 cost of disk i/o after AHCP F4 1 F14 1 N8 3 F5 3 F15 1 N9 2 The query frequencies are varied from 10% to 90%. As F6 3 F16 0 N10 1 these are relative frequencies, it must be noted that the F7 1 N1 4 N11 1 frequencies of the other queries in OQG are modified F8 1 N2 2 N12 0 equally in each case. The parameters for the study are F9 3 N3 3 stated in the Appendix. F10 3 N4 1 As can be seen from table 6, there's always a minimum Step 3.1 : For each Qi , perform top-down evaluation of gain obtained when the ORDW is partitioned; the range nodes in its query plan. of the gain varies from 1% to 50% in this case study. As the are ordered (tree structured), for Note that these results appear to exhibit a linear relation each Qi, we can traverse the list in a top-down manner. between the selectivity of the query and the cost gain Vivekanand G., Q. Li, K. Karlapalem 4-7 obtained from the AHCP operation. However this should instead of the entire Fact table. We are currently in the be interpreted only as the best-case scenario, because in process of combining these complementary approaches real-world cases some level of data replication is into a single framework. We are building an expected which can cause redundant data access. This experimental ORDW prototype system that will be may lead to higher cost for the partitioned case than what validated by empirical studies based on TPC-H benchmark queries. this example indicates, although the difference will not be too significant. Acknowledgement - This work has been supported Table 6. Cost Ratio observations by City University of Hong Kong under grant# 7100078. relative 10% 30% 50% 70% 90% frequency References Q1 0.05 0.15 0.25 0.35 0.45 Q2 0.05 0.15 0.25 0.35 0.45 [BKS98] L. Bellatreche, K. Karlapalem and A. Simonet, “Algorithms and Support for Horizontal Class Q3 0.03 0.09 0.15 0.21 0.27 Partitioning in Object-Oriented Databases”, Q4 0.03 0.09 0.15 0.21 0.27 Distributed and Parallel Databases, Kluwer Q5 0.03 0.09 0.15 0.21 0.27 Academic Publishers, accepted in 1998 (to appear). Q6 0.02 0.06 0.1 0.14 0.18 Q7 0.01 0.03 0.05 0.07 0.09 [CD97] Surajit Chaudhuri and Umeshwar Dayal, “An Q8 0.01 0.03 0.05 0.07 0.09 Overview of Data Warehousing and OLAP Technology”, ACM SIGMOD Record, 26(1), March 1997, pp. 65-74. 5 Conclusions and future work [CCS93] E.F. Codd, S.B. Codd, and C.T. Salley, In this paper, we have presented a methodology towards “Providing OLAP (on-line analytical processing) to efficient query processing in an object-relational data user-analysts: An IT mandate”, Tech. Report, 1993. warehousing (ORDW) environment, through devising [Dayal99] Umeshwar Dayal, “Dynamic Data and incorporating Associated Horizontal Class Warehousing”, Proc. First International Conference Partitioning (AHCP) techniques over the ORDW schema. on Data Warehousing and Knowledge Discovery Our methodology starts with a given set of data (DaWaK), Florence, Italy, 1999. warehouse queries, comes up a near-optimal AHCP scheme for the queries, and selects AHCP fragments as [Fun98] Chi-wai Fung, “Vertical Class Partitioning and materialized views to facilitate efficient evaluation of Complex Object Retrieval in Object Oriented these queries. Through an initial analytical study, we are Databases”, PhD. Thesis, Department of Computer already able to demonstrate the gains of our approach Science, HKUST, Dec 1998. vis-a-vis the unpartitioned approach in terms of disk I/O in the ORDW environment. [FKL98] Chi-wai Fung, Kamalakar Karlapalem, Qing Li, “Structural Join Index Hierarchy: A Mechanism Note that the work we have described in this paper for Efficient Complex Object Retrieval”, Proc. FODO (hence the result obtained) should be only regarded as an Conference 1998, pp. 127-136. intermediate stage towards efficient ORDW query processing; further advanced techniques and mechanisms [KL00] K. Karlapalem and Q. Li, “A Framework for should and can be naturally added. In particular, an Class Partitioning in Object-Oriented Databases”, adaptive and extensible indexing framework is currently Distributed and Parallel Databases, 8(3):317-350, being developed, so as to better accommodate the Kluwer Academic Publishers, 2000 (in press). requirements of dynamic data warehousing [Dayal99] which demands to incorporate more semantics into the [GGT96] G. Gardarin, J.R. Gruser, Z.H. Tang, “Cost- data warehouse schemata. As shown in [VLK00], a based Selection of Path Expression Processing query-driven indexing mechanism built on the SJIH Algorithms in Object-Oriented Databases”, Proc. (structural join index hierarchy) [FKL98] seems to be VLDB 1996, pp. 390-401. very effective, and is supplementary to the work described in this paper. Moreover, the creation and [GHRU97] H. Gupta, V. Harinarayanan, A. Rajaraman, maintenance algorithms of materialized views and OLAP and J.D. Ullman, “Index Selection for OLAP”, Proc. cubes benefit from the reduced search space obtained due ICDE 1997, pp. 208-219. to the AHCP scheme. Since OLAP queries involve multiple paths (multiple selections and group bys), the [GM95] A. Gupta, and I. S. Mumick, “Maintenance of Forward and Reverse Joins are considerably reduced by Materialized Views: Problems, Techniques, and employing them on a subset of the AHCP fragments Applications”, IEEE Data Eng. Bulletin, June 1995. Vivekanand G., Q. Li, K. Karlapalem 4-8 [MK99] Mukesh Mohania and Y. Kambayashi, “Making Appendix Aggregate Views Self-Maintainable”, Data and Table A. Query Parameters Knowledge Engineering, accepted in 1999 (to appear). fo = fan-out R - reference (reverse links) [OQ97] P. O'Neil, D. Quass, “Improved query ||Ci|| - cardinality performance with variant indexes”, Proc. ACM SIGMOD '97, pp. 38-49. Reference (i→j) fo R ||Ci|| ||Cj|| [Rou97] Nick Roussopoulos, “Materialized Views and Sales→Product 1 100 50M .5M Data Warehouses”, Proc. KRDB 1997, pp. 12.1-12.6. Sales→Customer 1 50 50M 1M Sales→Teenager 1 250 50M .2M [Sar97] Sunita Sarawagi, “Indexing OLAP Data”, IEEE Sales→Date 1 500 50M 36.5K Data Engineering Bulletin, 1997, 20:36-43. Prod→Category 1 10 .5M 1K Product→Retailer 50 100 .5M 50K [VLK98] Vivekanand Gopalkrishnan, Qing Li, Category→Type 100 5 1000 10 Kamalakar Karlapalcem, “Issues of Object-Relational Retailer→City 1 4 50,K 12.5K View Design in Data Warehousing Environment”, Customer→City 1 80 1M 12.5K Proc. IEEE SMC Conference 1998, pp. 2732-2737. Year→Mon 12 1 10 120 Mon→Date 30 1 120 3.6K [VLK99] Vivekanand Gopalkrishnan, Qing Li, Year→Date 365 1 10 3.6K Kamalakar Karlapalem, “Star/Snow-flake Schema Country→State 25 1 10 250 Driven Object-Relational Data Warehouse Design and State→City 5 1 250 1.2K Query Processing Strategies”, Proc. First Int'l Country→City 125 1 10 1.2K Conference on Data Warehousing and Knowledge Discovery (DaWaK), LNCS 1676, pp. 11-22, Florence, Italy, 1999. Table B. Selectivity (%) : [VLK00] Vivekanand Gopalkrishnan, Qing Li, Country = 'US' 50 Kamalakar Karlapalem, “Efficient Query Processing Category = 'Elec' 30 with Structural Join Indexing in an Object Relational Data Warehousing Environment”, Proc. 11th Product = 'P1' 5 Information Resources Management Association Int'l Product = 'P2' 5 Conference (IRMA'00), Anchorage, Alaska, May 21- Customer = 'Teen' 20 24, 2000 (to appear). [YKL97] Jian Yang, Kamalakar Karlapalem, Qing Li, “Algorithms for Materialized View Design in Data Warehousing Environment”, Proc. VLDB 1997, pp. 136-145. Vivekanand G., Q. Li, K. Karlapalem 4-9