New Objective Function for Vertical Partitioning
                         in Database System.

                                                 © Thanh Hung Ngo
                                          Don State Technical University
                                              nthungla@yahoo.com

                                      PhD supervisor: Michael V. Grankov


                       Abstract.                               we use an attribute usage matrix (AUM) as the input
                                                               data. We will consider that number of sub-relations
    In this paper we introduce the objective                   equaling to two, and both of them being located in the
    function for vertical partitioning in database             same site.
    systems. It has been built with the new
    evaluative criterion: cache hit probability.               2 Cache system and database system
    Testing the validity of the derived evaluative
    formula via program simulation shows its high              specification
    accuracy.                                                   Let’s consider a table T in a database system. Tuples of
    Index terms: vertical partitioning, objective              T have M attributes, and cache memory for its caching
    function, evaluation criterion, cache hit                  is limited to L bytes.
    probability.                                                    Assume that each transaction queries a subset of
                                                               attributes of a tuple by its record number in the table. It
                                                               means that a transaction is a sequence of three
1 Introduction                                                 operations: select table, select a subset of attributes and
                                                               select a tuple by record number in table. Then a
Vertical partitioning of a table splits it into two or more
tables (which we refer to sub-tables), each of which
                                                                                                            (
                                                               transaction can be defined as a triplet T, Δ j , ID j ,  )
contains a disjoint subset of the attributes of the original   where, T – table name, Δ j – the sub-set of query
table, except for the key attributes. Since a subset of
                                                               attributes, ID j – the query record number. Assume that
attributes is more frequently accessed by transactions
than the subset of the rest attributes, so vertical            these selects are independent between one and the
partitioning can reduce the amount of data that needs to       others, and the select of record number obeys uniform
be scanned to answer the transactions.                         law. Let us have Q different transactions and there
     As indicated in [1], many data partitioning               attribute usage is described in table 1:
algorithms have been developed basing on statistical
and pattern classification and analysis. They cluster data             Trans\Attrs     A1     A2      …        AM
using various criteria. The most common used criterion                     Tr1         u11    u12     …        u1M
is the square-error. The lack of these algorithms is that                  Tr2         u21    u22     …        u2M
they don’t estimate the “goodness” of partition schemes                    …           …      …       …         …
in the relation with an index of system performance. In                   TrQ          uQ1    uQ2     …        uQM
[2], authors develop the complex method, in which they
first create a set of “candidate” partition schemes, and                    Table 1: attribute usage matrix.
then analyze each of them via optimization unit of             where uij equals either pi or 0, and pi is the execution
database system to find the best scheme. The work is of        probability of transaction Tri. Thus the following
great interested, but it is too complicated. On the other      expressions are the case:
hand the performance of optimization unit strongly                               p1 + p2 + K + pQ = 1 ,
depends on the current tuning of database system.
     In this paper we will build the objective function                          0 ≤ p i ≤ 1, 1 ≤ i ≤ Q .
for vertical partitioning with the evaluative criterion: the        Cache is divided into parts, each of which contains
cache hit probability. As many algorithms in this area         tuples of one table. Write to cache and read from it are
                                                               executed per tuple. A tuple is loaded into cache if at
Proceedings of the Spring Young Researcher's                   least one of its attribute is claimed. At first time cache
Colloquium On Database and Information Systems                 is empty. In operation it is filled with tuples. For the
SYRCoDIS, St.-Petersburg, Russia, 2008                         current transaction if copy of the query tuple is in
cache, then it is fetched from cache to answer the query.                  4 Deriving of the objective function
Otherwise it is fetched from disk and its copy is
inserted into cache. In the first case we say that it was a                In this section we will derive the objective function,
cache hit, in the last case – a cache miss. After cache                    which estimate the cache hit probability for caching the
having been filled, a new tuple will substitute one                        original table and its sub-tables. The cache system and
among the present tuples. We will consider that the                        database system have been specified in section 2. The
cache replacement strategy being randomized. Since the                     method implementing the vertical partitioning was
period, while cache hasn’t been filled yet, is a                           described in section 3.
transitional period, we can derive the objective
                                                                           4.1 Preliminaries
function, having skipped this period.
     The assumptions, are described in this section, has                   The following are used in derivation of the objective
been made as a result of deductive analysis the literature                 function.
[1], [2], [3], [4]. Note that these assumptions, except for                     N – the tuple count of table T as well as T1 and T2.
the attribute usage matrix, are only for the purpose of                         M – the attribute count of tuples of table T.
mathematical deriving and validity testing of the                               li – length of attribute i of table T for i = 1, 2, …, M.
estimated formula, but not for implemented areas of the                         lT, lT1, lT2 – tuple length of table T, T1, T2.
derived objective function.                                                     L, L1, L2 – size of part of cache memory, allocated
                                                                           for caching table T, T1, T2. L1, L2 – the parameters,
3     Method                        implementing             vertical      which satisfy the restriction (L1 + L2 ≤ L).
partitioning                                                                    K, K1, K2 – number of tuples of table T, T1, T2,
                                                                           which can exist in the corresponding part of cache
In this and in the following sections we will consider the                 memory at the same time.
vertical partitioning of table T into two sub-tables T1, T2.                    Operation [ ] – operation achieving the integer part
Attribute sets of tables T, T1, T2 are S, S1, S2                           of a fraction.
correspondingly. These following expressions are the
case:                                                                      4.2 Cache hit probability for caching the original
                        S1 ∪ S 2 = S ,                                     table
                       S1 ∩ S 2 = SK T ,                                   In this case any transaction queries a tuple of table T,
                                    S1 > SK T ,                            which may be one among K tuples, copy of which has
                                                                           already existed in cache, or may be one among the rest
                                    S 2 ≥ SK T ,                           (N – K) tuples, which have not any copies in cache. If
where, SKT – the sub-set, containing all key attribute of                  the first occurrence is the case, finding of query tuple in
table T.                                                                   cache will successfully finish, i.e. it will be a cache hit.
     Then the table T is replaced with two sub-tables T1,                  Since the select of tuples obeys the uniform law, the
T2. Using method simulating the vertical partitioning                      cache hit probability is estimated by expression:
[2], we create regular tables, one for each sub-table. The                                                      K
queries referencing the original table T, is rewritten to                                                  PT =                         (1)
                                                                                                                N
execute against the sub-tables T1, T2. If assume that any
query contains at least one of non-key attributes, then a                                                          ⎛ ⎡              ⎤⎞
                                                                                                                   ⎜ ⎢              ⎥⎟
        (               )
query T, Δ j , ID j is replaced only in one of three
                                                                                                ⎛ ⎡ L ⎤⎞
                                                                                where, K = min⎜ N , ⎢ ⎥ ⎟ = min⎜ N , ⎢
                                                                                                ⎜             ⎟
                                                                                                                             L
                                                                                                                                    ⎥⎟ .
                                                                                                     ⎣ l    ⎦      ⎜        ∑   l    ⎟
                                                                                                ⎝             ⎠    ⎜ ⎢⎢ ∀r : A ∈S ⎥⎥ ⎟
                                                                                                         T                        r
variants:
     • 1-st variant: by one query, referencing sub-                                                                ⎝ ⎣        r     ⎦⎠
                            (          )
          table T1, T1 , Δ j , ID j , if all the query                     4.3 Cache hit probability for caching the sub-tables
            attributes are in S , i.e. (Δ j ⊂ S1 ) ;
                                       1                                   Let’s mark the occurrence, when the original query is
    •       2-nd variant: by one query, referencing sub-                   replaced in the first, the second, the third variant, the
                            (          )
            table T2, T2 , Δ j , ID j , if all the query                   event C1, C2, C3; the occurrence, when the finding of
            attributes are in S , i.e. (Δ j ⊂ S 2 ) ;
                                                                           query tuple in cache finishes successfully, the event B.
                                       2                                   According to formula of complete probability, cache hit
    •       3-rd       variant:       by               two       queries   probability is estimated by expression:
             (                    )
             T1 , Δ j ∩ S1 , ID j and                                                                  3
                                                                                             PT T = ∑ ( p(B / C i ) ⋅ p(C i ))
             (T2 , Δ j ∩ (S 2 \ SK T ), ID j ) simultaneously in                               1 2
                                                                                                      i =1
                                                                                For i = 1:
            the rest cases, i.e.
                                                                                                     p (C1 ) =       ∑
                ((              )      ) ((
                  Δ j ∩ S1 ≠ ∅ ∧ Δ j ∩ S2 ⊄ SKT .  )         )                                                   ∀j :Δ j ⊂ S1
                                                                                                                                pj ;

    The cache allocated for caching the table T is now
divided into two parts for caching its sub-tables. Sizes
                                                                                                             (                  )
                                                                                     since the query T, Δ j , ID j is replaced by the
of these parts are the parameters for partitioning                                             (                 )
                                                                                     query T1 , Δ j , ID j , then the cache accessing
algorithm.                                                                           for the original query is successful, if and only
         if the cache accessing for the substituting query                 allocated for its caching, the best scheme of partitions is
         is successful. Similarly to (1):                                  defined as the result of the programming problem (*):
                                       K
                         p (B / C 1 ) = 1                                    P (S1 , K1 , K 2 ) = a ⋅ K1 + b ⋅ K 2 + c ⋅ K1 ⋅ K 2 → max (4)
                                        N
         with                                                                  With these restrictions:
                                         ⎛ ⎡             ⎤⎞                                   SK T ⊂ S1 , S1 ⊆ S ,                      (5)
                    ⎛ ⎡ L1 ⎤ ⎞           ⎜ ⎢             ⎥⎟
                                                  L                                           K 1 ⋅ lT 1 + K 2 ⋅ lT 2 ≤ L ,
          K 1 = min ⎜ N , ⎢ ⎥ ⎟ = min ⎜ N , ⎢       1
                                                         ⎥⎟ .                                                                           (6)
                    ⎜            ⎟       ⎜
                           l
                    ⎝ ⎣ T1 ⎦ ⎠                     ∑  l r ⎟
                                         ⎜ ⎢⎢ ∀r : A ∈S ⎥⎥ ⎟                                    K 1 ≥ 0, K 2 ≥ 0 ,                      (7)
                                         ⎝ ⎣        r 1 ⎦⎠
    For i = 2: in the same way we can achieve                                                 K 1 , K 2 are the integers     (8)
                        p (C 2 ) = ∑ p j ;                                     Since the estimated formula is easy for
                                      ∀j :Δ j ⊂ S 2                        computation, the programming problem (*) can be
                                                                           solved via full search algorithm for input data with
                                                K2
                               p (B / C 2 ) =                              small dimensionality. For large dimensionality it’s
                                                N                          necessary that the more effective algorithm have been
            with                                                           developed.
                                      ⎛ ⎡           ⎤⎞
                   ⎛ ⎡ L2 ⎤ ⎞         ⎜ ⎢             ⎟
                                      ⎜       L2 ⎥ ⎟                       5 Simulation program and validity testing
         K 2 = min ⎜ N , ⎢  ⎥⎟⎟ = min   N , ⎢       ⎥   .
                   ⎜                  ⎜ ⎢ ∑ lr ⎥ ⎟
                   ⎝ ⎣ lT 2 ⎦ ⎠       ⎜ ⎢ ∀r : A ∈S ⎥ ⎟                    Validity testing of the derived estimated formula has
                                      ⎝ ⎣       r 2 ⎦⎠                     been carried out via program simulation. The program
    For i = 3:                                                             has been written in Delphi. It uses the exhausted search
                                                                           algorithm to solve the programming problem (*). It
                       p (C 3 ) = 1 − p (C1 ) − p(C 2 ) ;
                                                                           allows simulating the performance of cache system and
            since the query         (T, Δ j , ID j ) is replaced by        database system, which have been defined in section 2.
            query         (T1 , Δ j ∩ S1 , ID j ) and query                It also allows registering the cache hit rate. Using the
                                                                           program, we have carried out a number of experiments
             (T2 , Δ j ∩ (S 2 \ SK T ), ID j ) , then cache                with randomized data input. For all that we randomized
                                                                           the value of M and Q in interval [5..10]. The relative
        accessing for the original query is successful, if
                                                                           error in all these experiments is less than 2.5%. This
        and only if the cache accessing for both of the
                                                                           confirms the high accuracy of derived formula.
        substituting queries are successful.
                                                                                For example let’s consider the table T with 6
                                                  K K
            p(B / C 3 ) = p(BT 1 ) ⋅ p (BT 2 ) = 1 ⋅ 2                     attributes, where A1 is key attribute. Lengths of
                                                  N N                      attributes are 30, 37, 51, 21, 46, 11 (bytes). The attribute
                                      K1 ⋅ K 2                             usage matrix is defined in table 2.
                       p (B / C 3 ) =           .
                                        N2
    Thus the evaluative formula is                                          Trans      A1       A2        A3       A4         A5     A6
                                                                            \Attrs
               K1           K            K ⋅K                                Tr1       .194      0      .194      .194          0    .194
    PT T =        ⋅ p(C1 ) + 2 ⋅ p(C 2) + 1 2 2 ⋅ p(C 3 ) (2)
      1 2      N             N            N                                  Tr2       .089    .089       0         0           0    .089
                                                                             Tr3      .281     .281       0         0         .281     0
    Let’s mark                                                               Tr4       .157    .157       0         0         .157   .157
               p (C1 )        p(C2 )                       p (C3 )           Tr5       .279      0        0         0         .279   .279
              a=       ; b=          ;                c=
                 N              N                           N2                    Table 2: example of attribute usage matrix.
    Formula (2) is rewritten as                                            The cache size is 2000 (bytes), and the tuple count is
             PT T = a ⋅ K1 + b ⋅ K 2 + c ⋅ K1 ⋅ K 2                  (3)   200 (tuples).
               12
                                                                               For this example the best scheme of partitions is
     Notice: if S2 = SKR, then p(C2) = p(C3) = 0. In this                            S1 = { A1 , A2 , A5 , A6 }, S 2 = { A1 , A3 , A4 } ,
case the partition is not the case. So the cache memory                    and cache lengths for T1, T2 are
allocated for table T is completely assigned to table T1.                                       K 1 = 16, K 2 = 0 .
As a result of (2) we achieve formula (1). So the case,
                                                                               With this scheme the cache hit probability is
when the original table is not partitioned, is only a
                                                                           0.0649, and it has increased to 1.27 (times) according to
special case of its partitioning.
                                                                           non-partitioning case.
4.4 The objective function                                                     Simulation has been carried out with 10000 run-
                                                                           throughs, in each of which has been executed 5000
For a table T, a usage matrix of its attributes, lengths of                queries. Result gives the average value of cache hit rate
its attributes and limited to L (bytes) cache memory,                      equaling to 0.065.
6 Conclusions
In this paper we have derived the objective function for
vertical partitioning with a new estimated criterion:
cache hit probability. We also have carried out validity
testing of the achieved formula via program simulation.
The simplicity of the achieved formula and its high
accuracy confirm the availability of its using in database
design and database reconstruction to achieve the
significant enhancement of system performance.
     We are currently developing a heuristic algorithm
for solving the programming problem (*). We will
discuss it in future work.

References
[1] Sharma Chakravarthy, Jaykumar Muthuraj, Ravi
    Varadarajan, Shamkant B. Navathe. An Objective
    Function for Vertically Partitioning Relations in
    Distributed Databases and its Analysis. In
    Distributed and Parallel Databases 2(2): 183-
    207(1994).
[2] Sanjay Agrawal, Vivek Narasayya, Beverly Yang.
    Integrating Vertical and Horizontal Partitioning into
    Automated Physical Database Design. In SIGMOD
    2004, June, 2004.
[3] C. J. Date. An Introduction To Database Systems –
    Seventh edition. Addision-Wesley Longman, Inc,
    2000.
[4] William Page, David Austin, Willard Baird II,
    Nicholas Chase and others. Special Edition: Using
    Oracle8/8I. QUE Corporation, 1999.