New Objective Function for Vertical Partitioning in Database System. © Thanh Hung Ngo Don State Technical University nthungla@yahoo.com PhD supervisor: Michael V. Grankov Abstract. we use an attribute usage matrix (AUM) as the input data. We will consider that number of sub-relations In this paper we introduce the objective equaling to two, and both of them being located in the function for vertical partitioning in database same site. systems. It has been built with the new evaluative criterion: cache hit probability. 2 Cache system and database system Testing the validity of the derived evaluative formula via program simulation shows its high specification accuracy. Let’s consider a table T in a database system. Tuples of Index terms: vertical partitioning, objective T have M attributes, and cache memory for its caching function, evaluation criterion, cache hit is limited to L bytes. probability. Assume that each transaction queries a subset of attributes of a tuple by its record number in the table. It means that a transaction is a sequence of three 1 Introduction operations: select table, select a subset of attributes and select a tuple by record number in table. Then a Vertical partitioning of a table splits it into two or more tables (which we refer to sub-tables), each of which ( transaction can be defined as a triplet T, Δ j , ID j , ) contains a disjoint subset of the attributes of the original where, T – table name, Δ j – the sub-set of query table, except for the key attributes. Since a subset of attributes, ID j – the query record number. Assume that attributes is more frequently accessed by transactions than the subset of the rest attributes, so vertical these selects are independent between one and the partitioning can reduce the amount of data that needs to others, and the select of record number obeys uniform be scanned to answer the transactions. law. Let us have Q different transactions and there As indicated in [1], many data partitioning attribute usage is described in table 1: algorithms have been developed basing on statistical and pattern classification and analysis. They cluster data Trans\Attrs A1 A2 … AM using various criteria. The most common used criterion Tr1 u11 u12 … u1M is the square-error. The lack of these algorithms is that Tr2 u21 u22 … u2M they don’t estimate the “goodness” of partition schemes … … … … … in the relation with an index of system performance. In TrQ uQ1 uQ2 … uQM [2], authors develop the complex method, in which they first create a set of “candidate” partition schemes, and Table 1: attribute usage matrix. then analyze each of them via optimization unit of where uij equals either pi or 0, and pi is the execution database system to find the best scheme. The work is of probability of transaction Tri. Thus the following great interested, but it is too complicated. On the other expressions are the case: hand the performance of optimization unit strongly p1 + p2 + K + pQ = 1 , depends on the current tuning of database system. In this paper we will build the objective function 0 ≤ p i ≤ 1, 1 ≤ i ≤ Q . for vertical partitioning with the evaluative criterion: the Cache is divided into parts, each of which contains cache hit probability. As many algorithms in this area tuples of one table. Write to cache and read from it are executed per tuple. A tuple is loaded into cache if at Proceedings of the Spring Young Researcher's least one of its attribute is claimed. At first time cache Colloquium On Database and Information Systems is empty. In operation it is filled with tuples. For the SYRCoDIS, St.-Petersburg, Russia, 2008 current transaction if copy of the query tuple is in cache, then it is fetched from cache to answer the query. 4 Deriving of the objective function Otherwise it is fetched from disk and its copy is inserted into cache. In the first case we say that it was a In this section we will derive the objective function, cache hit, in the last case – a cache miss. After cache which estimate the cache hit probability for caching the having been filled, a new tuple will substitute one original table and its sub-tables. The cache system and among the present tuples. We will consider that the database system have been specified in section 2. The cache replacement strategy being randomized. Since the method implementing the vertical partitioning was period, while cache hasn’t been filled yet, is a described in section 3. transitional period, we can derive the objective 4.1 Preliminaries function, having skipped this period. The assumptions, are described in this section, has The following are used in derivation of the objective been made as a result of deductive analysis the literature function. [1], [2], [3], [4]. Note that these assumptions, except for N – the tuple count of table T as well as T1 and T2. the attribute usage matrix, are only for the purpose of M – the attribute count of tuples of table T. mathematical deriving and validity testing of the li – length of attribute i of table T for i = 1, 2, …, M. estimated formula, but not for implemented areas of the lT, lT1, lT2 – tuple length of table T, T1, T2. derived objective function. L, L1, L2 – size of part of cache memory, allocated for caching table T, T1, T2. L1, L2 – the parameters, 3 Method implementing vertical which satisfy the restriction (L1 + L2 ≤ L). partitioning K, K1, K2 – number of tuples of table T, T1, T2, which can exist in the corresponding part of cache In this and in the following sections we will consider the memory at the same time. vertical partitioning of table T into two sub-tables T1, T2. Operation [ ] – operation achieving the integer part Attribute sets of tables T, T1, T2 are S, S1, S2 of a fraction. correspondingly. These following expressions are the case: 4.2 Cache hit probability for caching the original S1 ∪ S 2 = S , table S1 ∩ S 2 = SK T , In this case any transaction queries a tuple of table T, S1 > SK T , which may be one among K tuples, copy of which has already existed in cache, or may be one among the rest S 2 ≥ SK T , (N – K) tuples, which have not any copies in cache. If where, SKT – the sub-set, containing all key attribute of the first occurrence is the case, finding of query tuple in table T. cache will successfully finish, i.e. it will be a cache hit. Then the table T is replaced with two sub-tables T1, Since the select of tuples obeys the uniform law, the T2. Using method simulating the vertical partitioning cache hit probability is estimated by expression: [2], we create regular tables, one for each sub-table. The K queries referencing the original table T, is rewritten to PT = (1) N execute against the sub-tables T1, T2. If assume that any query contains at least one of non-key attributes, then a ⎛ ⎡ ⎤⎞ ⎜ ⎢ ⎥⎟ ( ) query T, Δ j , ID j is replaced only in one of three ⎛ ⎡ L ⎤⎞ where, K = min⎜ N , ⎢ ⎥ ⎟ = min⎜ N , ⎢ ⎜ ⎟ L ⎥⎟ . ⎣ l ⎦ ⎜ ∑ l ⎟ ⎝ ⎠ ⎜ ⎢⎢ ∀r : A ∈S ⎥⎥ ⎟ T r variants: • 1-st variant: by one query, referencing sub- ⎝ ⎣ r ⎦⎠ ( ) table T1, T1 , Δ j , ID j , if all the query 4.3 Cache hit probability for caching the sub-tables attributes are in S , i.e. (Δ j ⊂ S1 ) ; 1 Let’s mark the occurrence, when the original query is • 2-nd variant: by one query, referencing sub- replaced in the first, the second, the third variant, the ( ) table T2, T2 , Δ j , ID j , if all the query event C1, C2, C3; the occurrence, when the finding of attributes are in S , i.e. (Δ j ⊂ S 2 ) ; query tuple in cache finishes successfully, the event B. 2 According to formula of complete probability, cache hit • 3-rd variant: by two queries probability is estimated by expression: ( ) T1 , Δ j ∩ S1 , ID j and 3 PT T = ∑ ( p(B / C i ) ⋅ p(C i )) (T2 , Δ j ∩ (S 2 \ SK T ), ID j ) simultaneously in 1 2 i =1 For i = 1: the rest cases, i.e. p (C1 ) = ∑ (( ) ) (( Δ j ∩ S1 ≠ ∅ ∧ Δ j ∩ S2 ⊄ SKT . ) ) ∀j :Δ j ⊂ S1 pj ; The cache allocated for caching the table T is now divided into two parts for caching its sub-tables. Sizes ( ) since the query T, Δ j , ID j is replaced by the of these parts are the parameters for partitioning ( ) query T1 , Δ j , ID j , then the cache accessing algorithm. for the original query is successful, if and only if the cache accessing for the substituting query allocated for its caching, the best scheme of partitions is is successful. Similarly to (1): defined as the result of the programming problem (*): K p (B / C 1 ) = 1 P (S1 , K1 , K 2 ) = a ⋅ K1 + b ⋅ K 2 + c ⋅ K1 ⋅ K 2 → max (4) N with With these restrictions: ⎛ ⎡ ⎤⎞ SK T ⊂ S1 , S1 ⊆ S , (5) ⎛ ⎡ L1 ⎤ ⎞ ⎜ ⎢ ⎥⎟ L K 1 ⋅ lT 1 + K 2 ⋅ lT 2 ≤ L , K 1 = min ⎜ N , ⎢ ⎥ ⎟ = min ⎜ N , ⎢ 1 ⎥⎟ . (6) ⎜ ⎟ ⎜ l ⎝ ⎣ T1 ⎦ ⎠ ∑ l r ⎟ ⎜ ⎢⎢ ∀r : A ∈S ⎥⎥ ⎟ K 1 ≥ 0, K 2 ≥ 0 , (7) ⎝ ⎣ r 1 ⎦⎠ For i = 2: in the same way we can achieve K 1 , K 2 are the integers (8) p (C 2 ) = ∑ p j ; Since the estimated formula is easy for ∀j :Δ j ⊂ S 2 computation, the programming problem (*) can be solved via full search algorithm for input data with K2 p (B / C 2 ) = small dimensionality. For large dimensionality it’s N necessary that the more effective algorithm have been with developed. ⎛ ⎡ ⎤⎞ ⎛ ⎡ L2 ⎤ ⎞ ⎜ ⎢ ⎟ ⎜ L2 ⎥ ⎟ 5 Simulation program and validity testing K 2 = min ⎜ N , ⎢ ⎥⎟⎟ = min N , ⎢ ⎥ . ⎜ ⎜ ⎢ ∑ lr ⎥ ⎟ ⎝ ⎣ lT 2 ⎦ ⎠ ⎜ ⎢ ∀r : A ∈S ⎥ ⎟ Validity testing of the derived estimated formula has ⎝ ⎣ r 2 ⎦⎠ been carried out via program simulation. The program For i = 3: has been written in Delphi. It uses the exhausted search algorithm to solve the programming problem (*). It p (C 3 ) = 1 − p (C1 ) − p(C 2 ) ; allows simulating the performance of cache system and since the query (T, Δ j , ID j ) is replaced by database system, which have been defined in section 2. query (T1 , Δ j ∩ S1 , ID j ) and query It also allows registering the cache hit rate. Using the program, we have carried out a number of experiments (T2 , Δ j ∩ (S 2 \ SK T ), ID j ) , then cache with randomized data input. For all that we randomized the value of M and Q in interval [5..10]. The relative accessing for the original query is successful, if error in all these experiments is less than 2.5%. This and only if the cache accessing for both of the confirms the high accuracy of derived formula. substituting queries are successful. For example let’s consider the table T with 6 K K p(B / C 3 ) = p(BT 1 ) ⋅ p (BT 2 ) = 1 ⋅ 2 attributes, where A1 is key attribute. Lengths of N N attributes are 30, 37, 51, 21, 46, 11 (bytes). The attribute K1 ⋅ K 2 usage matrix is defined in table 2. p (B / C 3 ) = . N2 Thus the evaluative formula is Trans A1 A2 A3 A4 A5 A6 \Attrs K1 K K ⋅K Tr1 .194 0 .194 .194 0 .194 PT T = ⋅ p(C1 ) + 2 ⋅ p(C 2) + 1 2 2 ⋅ p(C 3 ) (2) 1 2 N N N Tr2 .089 .089 0 0 0 .089 Tr3 .281 .281 0 0 .281 0 Let’s mark Tr4 .157 .157 0 0 .157 .157 p (C1 ) p(C2 ) p (C3 ) Tr5 .279 0 0 0 .279 .279 a= ; b= ; c= N N N2 Table 2: example of attribute usage matrix. Formula (2) is rewritten as The cache size is 2000 (bytes), and the tuple count is PT T = a ⋅ K1 + b ⋅ K 2 + c ⋅ K1 ⋅ K 2 (3) 200 (tuples). 12 For this example the best scheme of partitions is Notice: if S2 = SKR, then p(C2) = p(C3) = 0. In this S1 = { A1 , A2 , A5 , A6 }, S 2 = { A1 , A3 , A4 } , case the partition is not the case. So the cache memory and cache lengths for T1, T2 are allocated for table T is completely assigned to table T1. K 1 = 16, K 2 = 0 . As a result of (2) we achieve formula (1). So the case, With this scheme the cache hit probability is when the original table is not partitioned, is only a 0.0649, and it has increased to 1.27 (times) according to special case of its partitioning. non-partitioning case. 4.4 The objective function Simulation has been carried out with 10000 run- throughs, in each of which has been executed 5000 For a table T, a usage matrix of its attributes, lengths of queries. Result gives the average value of cache hit rate its attributes and limited to L (bytes) cache memory, equaling to 0.065. 6 Conclusions In this paper we have derived the objective function for vertical partitioning with a new estimated criterion: cache hit probability. We also have carried out validity testing of the achieved formula via program simulation. The simplicity of the achieved formula and its high accuracy confirm the availability of its using in database design and database reconstruction to achieve the significant enhancement of system performance. We are currently developing a heuristic algorithm for solving the programming problem (*). We will discuss it in future work. References [1] Sharma Chakravarthy, Jaykumar Muthuraj, Ravi Varadarajan, Shamkant B. Navathe. An Objective Function for Vertically Partitioning Relations in Distributed Databases and its Analysis. In Distributed and Parallel Databases 2(2): 183- 207(1994). [2] Sanjay Agrawal, Vivek Narasayya, Beverly Yang. Integrating Vertical and Horizontal Partitioning into Automated Physical Database Design. In SIGMOD 2004, June, 2004. [3] C. J. Date. An Introduction To Database Systems – Seventh edition. Addision-Wesley Longman, Inc, 2000. [4] William Page, David Austin, Willard Baird II, Nicholas Chase and others. Special Edition: Using Oracle8/8I. QUE Corporation, 1999.