Appling A Discrete Particle Swarm Optimization Algorithm to Database Vertical Partition

Appling A Discrete Particle Swarm Optimization Algorithm to Database Vertical Partition BilalBenmessahel bilal.benmessahel@gmail.com département d'informatique Université Ferhat ABBAS-SETIF MohamedTouahria département d'informatique Université Ferhat ABBAS-SETIF Appling A Discrete Particle Swarm Optimization Algorithm to Database Vertical Partition 3ED3A290270E95E9B4A3E2B9DB625A89 GROBID - A machine learning software for extracting information from scholarly documents Database vertical partition Particle swarm optimization RG String Genetic algorithms Optimization

Vertical partition is an important technique in database design used to enhance performance in database systems. Vertical fragmentation is a combinatorial optimization problem that is NP-hard in most cases. We propose an application and an adaptation of an improved combinatorial particle swarm optimization (ICPSO) algorithm for the vertical fragmentation problem. The original CPSO algorithm [3] suffers from major drawback-redundant encoding. This paper applies an improved version of CPSO that using the restricted growth (RG) string [5] constraint to manipulate the particles so that redundant particles are excluded during the PSO process. The effectiveness and efficiency of the improved CPSO algorithm are illustrated through several database design problems, ranging from 10 attributes/8 transactions to 50 attributes/50 transactions. In all cases, our design solutions match the global optimum solutions.

Introduction

Vertical partition (also called vertical fragmentation) is the problem of clustering attributes of a relation into fragments for subsequent allocation. The technique is used to minimize the execution time of user applications that run on these fragments. Vertical partition provides an important technique for designing distributed database systems. Compared to other types of data fragmentation, vertical partition is more complicated than horizontal partition because of the increased number of possible alternatives [1]. Vertical partition algorithms contain two essential parts: the optimization method and the objective function. Ozsu and Valduriez [1] argue that finding the best partition scheme for a relation with m attributes by exhaustive search must compare at least the mth Bell number of possible fragments, which means that such an algorithm has a complexity of O(m m ). Thus, it is more feasible to look for heuristic methods to seek optimal solutions. On the other hand, database partition aims at enhancing the transactional processing in database. The objective function evaluates whether such a goal is achieved.

Most previous algorithms employ multiple iterations of binary partition to approximate m-way partition. Navathe, Ceri, Wiederhold, and Dou (1984) propose the Recursive Binary Partition Algorithm (RBPA), which extends Hoffer and Severance's work by automating the selection process of vertical fragments; they propose some empirical objective functions. Cornell and Yu (1990) adopt the same approach but replace the empirical objective functions with one constructed on a model database. Chu and Ieong (1992) adopt transactions as units in their algorithm; however, it is still a binary partition approach.

Efforts have also been made to use other optimization techniques to benefit vertical partition. Hammer and Niamir (1979) propose a hill-climbing method that alternatively groups and regroups attributes and fragments to reach a suboptimal solution. Song and Gorla (2000) solve the problem with GA. However, each run of their GA only gets a binary partition. Therefore, the GA only provides the intermediate results in a recursive process. Our aim in this paper is to propose a pure PSO solution to vertical partition. By pure we mean the direct result from the PSO execution is already an m-way partition. The work described in this paper considers the vertical partition problem and reports a Combinatorial PSO application that can eliminate the encoding redundancy by using a restricted growth (RG) string [3] constraint in constructing particles. To evaluate its effectiveness and demonstrate its superiority, we compare the result of using the improved CPSO with that of using traditional GA as well as RG string encoding with traditional object based GA operators called SGA and another GA based algorithm called Group oriented Restricted Growth String GA GRGS-GA developed to solve the vertical partition problem by Jun Du and Al (2006) [2]. The balance of the paper is structured as follows. Section 2 introduce the partition evaluator (PE) developed by Chakravarthy and al [4]; this evaluator will be used as the fitness function for the proposed approach and the two other approaches used in experiments. Section 3 presents the particle swarm optimization with a general brief overview RG strings. Section 4 we develop an improved particle swarm for vertical partition problem. Section 5 presents the application of the improved CPSO algorithm to the vertical partition problem and compares the proposed approach with the other two approaches SGA and GRGS-GA; and it is demonstrates that the improved CPSO can effectively find optimal solutions even for large vertical partition problems. Section 6 is summary and conclusions.

Objective functions for vertical partition

The two kinds of objective functions used for partition algorithms are 1) model cost functions based on the transaction access analysis on a model DBMS and 2) those based on an empirical assumption. The former form of objective function is specific to the underlying DBMS while the latter is more general and intuitive [2].

In addition to the AUM used as input for both types of objective functions, the model cost function takes into account the specific access plan chosen by the query optimizer, e.g., the join method and the type of scan on the relation by each transaction type. Without this additional information, the empirical cost objective function only shows the trends in the cost that are affected by the partition process.

However, it is useful for the logical design of a database when information about physical parameters may not be available. Although less precise than the model cost functions, they can be very effective in comparing different optimization techniques used by algorithms. In this paper, we use an empirical objective function, a modified version from the partition evaluator proposed by Chakravarthy et al. (1992). This partition evaluator uses the square-error criterion commonly applied to clustering strategies. We thus name it the Square-Error Partition Evaluator (SEPE). The SEPE consists of two major cost factors: the irrelevant local attribute access cost and the relevant remote attribute access cost. They represent the additional cost required, other than the ideal minimum cost. Further, the ideal cost is the cost when transactions only access the attributes in a single fragment and have no instances of irrelevant attributes in that fragment. Both costs are calculated using the square-error result; they are denoted ˗ $ and ˗ $ , respectively. More details about SEPE, including the formula, can be found in Chakravarthy et al. (1992).

Particle swarm optimization PSO introduced by Kennedy and Eberhart [8] is one of the most recent metaheuristics, which is inspired by the swarming behavior of animals and human social behavior. Scientists found that the synchrony of animal's behavior was shown through maintaining optimal distances between individual members and their neighbors. Thus, velocity plays the important role of adjusting each member for the optimal distance. Furthermore, scientists simulated the scenario in which birds search for food and observed their social behavior. They perceived that in order to find food the individual members determined their velocities according to two factors, their own best previous experience and the best experience of all other members. This is similar to the human behavior in making decision, where people consider their own best past experience and the best experience of the other people around them.

PSO algorithm

The general principles of the PSO algorithm are stated as follows. Similarly to an evolutionary computation technique, PSO maintains a population of particles, where each particle represents a potential solution to an optimization problem.

Let m be the size of the swarm. Each particle i can be represented as an object with several characteristics. Suppose that the search space is a n-dimensional space, then the ith particle can be represented by a n-dimensional vector, I {˲ # ˲ $ ˲ { , and velocity ˢ {˰ # ˰ $ ˰ {, where i = 1, 2, ..., m.

In PSO, particle i remembers the best position it visited so far, referred as ˜ {J # J $ J {, and the best position of the best particle in the swarm, referred as ˙ {˙# ˙$ ˙ {. PSO is similar to an evolutionary computation algorithm and, in each generation t, particle i adjusts its velocity ˰ and position ˲ for each dimension j by referring to, with random multipliers, the personal best position J # and the swarm's best position ˙ # , using Eqs. ( 1) and ( 2), as follows:

˰ ˰ # -IŵJŵ J # . ˲ # -IŶJŶ ˙ # . ˲ # {ŵ{ And ˲ ˲ # -˰ {Ŷ{

Where c1 and c2 are the acceleration constants and r1 and r2 are random real numbers drawn from [0, 1]. Thus the particle flies through potential solutions toward ˜ and ˙ while still exploring new areas. Such stochastic mechanism may allow escaping from local optima. Since there was no actual mechanism for controlling the velocity of a particle, it was necessary to impose a maximum value ˢ on it. If the velocity exceeded this threshold, it was set equal to ˢ , which controls the maximum travel distance at each iteration, to avoid a particle flying past good solutions. The PSO algorithm is terminated with a maximal number of generations or the best particle position of the entire swarm cannot be improved further after a sufficiently large number of generations.

The aforementioned problem was addressed by incorporating a weight parameter in the previous velocity of the particle. Thus, in the latest versions of the PSO, Eqs.

(2) and ( 3) are changed into the following ones:

˰ { ˰ # -IŵJŵ J # . ˲ # -IŶJŶ ˙ # . ˲ # { {ŷ{ ˲ ˲ # -˰ {Ÿ{

is called inertia weight and is employed to control the impact of the previous history of velocities on the current one. Accordingly, the parameter regulates the trade-off between the global and local exploration abilities of the swarm. A large inertia weight facilitates global exploration, while a small one tends to facilitate local exploration. A suitable value for the inertia weight usually provides balance between global and local exploration abilities and consequently results in a reduction of the number of iterations required to locate the optimum solution. is a constriction factor, which is used to limit the velocity.

The PSO algorithm has shown its robustness and efficacy for solving function value optimization problems in real number spaces. Only a few researches have been conducted for extending PSO to combinatorial optimization problems.

An improved CPSO for Database Vertical Partition

In this paper, we propose an improved version of the combinatorial CPSO algorithm [3] aimed at solving the Database Vertical Partition problem. The CPSO algorithm suffers from major drawback-redundant encoding. This paper applies the restricted growth (RG) string [5] constraint to manipulate the particles so that redundant particles are excluded during the PSO process. The following section presents the basics of the RG strings.

Basics of RG strings

The Restricted Growth (RG) string encoding represents a grouping solution as an array of integers, denoted a[n], where n is the number of attributes in the relation. The elements in the array may be integer values ranging from 1 to n. Meanwhile, as constituents if RG string, they must satisfy Definition 1, given next. In addition to the formal definition of RG string, other supporting extended definitions are presented next: Definition 1. A RG string r is a sequence of integers represented as an array, which satisfies the following inequality:

{X{ 3 {ͽͷY{{ { { { {X . {{ -{ ˩ J J{ {

For example, {1 1 2 3 1 1 2 4} is RG string, but {4 4 2 3 4 4 2 1} is not, although they map to the same solution in random string encoding scheme.

Definition 2. The degree of RG string r is the largest value in r, denoted d(r).

For example, consider r = {11123221}, then d(r)=3.

Definition 3. The ith prefix of RG string r, denoted J , is the substring that includes the first i values of r.

For example, consider r = {11123221}, J & = {1112}.

Definition of a particle

Denote by I {˳ # ˳ $ ˳ { the n-dimensional vector associated to the solution I {˲ # ˲ $ ˲ { taking a value in {-1, 0, 1} according to the state of solution of the ith particle at iteration t.

I is a dummy variable used to permit the transition from the combinatorial state to the continuous state and vice versa.

I ŵ ˩˦ ˲ ˙ .ŵ ˩˦ ˲ J .ŵ JJ ŵ ˩˦ {˲ ˙ J { Ŵ Jˮ˨˥J˱˩J˥ {Ź{

Velocity

Let ˤŵ .ŵ . ˳ # be the distance between ˲ # and the best solution obtained by the ith particle.

Let ˤŶ ŵ . ˳ # the distance between the current solution ˲ # and the best solution obtained in the swarm.

The updated equation for the velocity term used in CPSO is then:

˰ ˱ ˰ # -Jŵ Iŵ ˤŵ -JŶ IŶ ˤŶ {ź{ ˰ ˱ ˰ # -Jŵ Iŵ .ŵ . ˳ # -JŶ IŶ {ŵ . ˳ # { {Ż{

With this function, the change of the velocity ˰ depends on the result of ˳ # . If ˲ # ˙ # , then ˳ # ŵ. Thereafter d2 turns to ''0'', and d1 takes "-2'', thus imposing to the velocity to change in the negative sense.

If ˲ # J # , then ˳ # .ŵ. Thereafter d2 turns to ''2'', and d1 takes ''0'', thus imposing to the velocity to change in the positive sense.

The case where ˲ # ˙ # and ˲ # J # , ˳ # turns to ''o'', d2 is equal to ''1'' and d1 is equal to "-1'', thereafter the parameters r1; r2; c1 and c2 will determine the sense of the change of the velocity.

The case where ˲ # ˙ # and ˲ # J # , ˳ # takes a value in {-1,1}, thus imposing to the velocity to change in the inverse sense of the sign of ˳ .

Construction of a particle solution

The update of the solution is computed within ˳ :

˳ # -˰ {%{

The value of ˳ is adjusted according to the following function:

˳ Ӣ ŵ ˩˦ 2 .ŵ ˩˦ . Ŵ Jˮ˨˥J˱˩J˥ {%{

The new solution is:

˲ Ӣ ˙ # ˩˦ ˳ ŵ J # ˩˦ ˳ .ŵ I JIJˤJ˭ J˯˭I˥J Jˮ˨˥J˱˩J˥ {ŵŴ{

The choice previously achieved for the affectation of a random value in {1, -1} for ˳ # in the case of equality between ˲ # J # ˙ # allows to insure that the variable ˳ takes a value 0, and to permit a change in the value of variable ˲ . We define a parameter for fitting intensification and diversification. For a small value of , ˲ takes one of the two values J # or ˙ # (intensification). In the opposite case, we impose to the algorithm to assign a null value to the ˳ , thus inducing for ˲ a value different from J # IJˤ ˙ # (diversification). The parameters c1 and c2 are two parameters related to the importance of the solutions J # IJˤ ˙ # for the generation of the new solution I . They also have a role in the intensification of the search.

Solution representation

In this subsection, we describe the formulation of the improved CPSO algorithm for the database vertical partition problem. One of the most important issues when designing the PSO algorithm lies on its solution representation. We setup search space of n-dimension for a relation of nattributes. Each dimension represents an attribute and particle I {˲ # ˲ $ ˲ { corresponds to the affectation of n attributes, such that ˲ {1, 2, 3,…, k}, where k is the number of fragments. The scheme of Fig. 1

Initial population

A swarm of particles is constructed based on RG string. In the particles generated, the RG constraint is enforced from the beginning. No rectification process is needed after all position in a particle are randomly created because each position is created as a random integer between 1 and the high potential degree for its position, complying with the RG string constraint. In the initial population, the first element is set to 1 and the upper bound for each element increases gradually and this rarely reaches k-1,

where k is the number of fragments anticipated by the user

Creating a new solution

For creating new solution we use Eq (5). We obtain the vector value ˳ # , The vector of velocity ˢ computed with Eq (6). The new value of is calculated using # -˰ .

is determined using Eq(9) and using Eq (10) the new solution vector I is determined. But the new solution can breaking the RG constraint for this purpose we have designing the rectifier. The rectifier is a key issue in the proposed approach that each particle is RG string. However, the initialization of particles does not guarantee each particle to be RG string. Also the operation of creating new solutions may change the constitution of a particle the population in a way that violates the RG string constraint. To handle such cases, we introduce a rectifying function that guarantees each particle to be RG string. For particles that violate the RG string constraint, the rectifier simply scans through a particle and converts it into RG string by adjusting the locations of its positions.

Implementation and experimental results

algorithms have been implemented in java. All experiments with improved CPSO and GRGS-GA [2] and SGA were run in Windows XP on desktop PC with Intel Pentium4, 3.6 GHz processors. The GRGS-GA (Group oriented Restricted Growth String) is GA based approach proposed by Jun Du and al [2] for database vertical partition. The SGA is a Simple GA that uses random encoding schemes and classical genetics operators.

We have dividing the experimental section into two phases. The test phase and the comparison phase. In the test phase we have trying the improved CPSO on two cases, the first case is an attribute usage matrix (AUM) of 10 attributes and 8 transactions and the second case is an attribute use matrix of 20 attributes and 15 transactions.

In the comparison phase we have trying the improved CPSO with two others GA based algorithm GRGS-GA and SGA on two larges cases generated pseudo randomly with a pseudo random generator of AUM designed to generate a large size AUM.

Case 1: 10-attribute example

In this case, we use an attribute usage matrix AUM, with 10 attributes and 8 transactions. This AUM has already been utilized by other researchers as described in Cornell and Yu, 1990, Navathe and al 1984, Suk-kyu Song and Narasimhaiah Gorla, 2000, J. Muthuraj and al 1993. J. Muthuraj and al found that for this AUM the best PE value is 5820, which gives a fragmentation of 3 fragments {1 5 7} {2 3 8 9} {4 6 10}. The improved CPSO find the best fragmentation in the 4 iteration, as illustrated in the figure 2. And the algorithm is executed 10 times. Figure 3 shows the optimal costs found in each trial. The above mentioned 3-fragment partition is evaluated to have a PE value of 5820. So we argue that if the final partition of each trial has a PE value less or equal to 5820, then such trial is considered a success. The success rate of the improved CPSO is 100%. Another interesting statistic is the average number of iterations needed to reach the optimal solution in the improved CPSO is 6.5. Apparently, the improved CPSO performs well in terms of fitness and convergence speed. The improved CPSO Algorithm is executed 100 times. Figure 5 shows the optimal costs found in each trial. The above mentioned 4-fragment partition is evaluated to have a PE value of 4644. So we argue that if the final partition of each trial has a PE value less or equal to 4644, then such trial is considered a success. The success rate of the improved CPSO is 100%. Another interesting statistic is the average number of generations needed to reach the optimal solution in the improved CPSO is 30.4. Apparently, the improved CPSO performs well in terms of fitness and convergence speed.

ˠō˓ ŵ Ŷ ŷ Ÿ Ź ź Ż % % ŵŴ ˠŵ ŶŹ Ŵ Ŵ Ŵ ŶŹ Ŵ ŶŹ Ŵ Ŵ Ŵ ˠŶ Ŵ ŹŴ ŹŴ Ŵ Ŵ Ŵ Ŵ ŹŴ ŹŴ Ŵ ˠŷ Ŵ Ŵ Ŵ ŶŹ Ŵ ŶŹ Ŵ Ŵ Ŵ ŶŹ ˠŸ Ŵ ŷŹ Ŵ Ŵ Ŵ Ŵ ŷŹ ŷŹ Ŵ Ŵ ˠŹ ŶŹ ŶŹ ŶŹ Ŵ ŶŹ Ŵ ŶŹ ŶŹ ŶŹ Ŵ ˠź ŶŹ Ŵ Ŵ Ŵ ŶŹ Ŵ Ŵ Ŵ Ŵ Ŵ ˠŻ Ŵ Ŵ ŶŹ Ŵ Ŵ Ŵ Ŵ Ŵ ŶŹ Ŵ ˠ% Ŵ Ŵ ŵŹ ŵŹ Ŵ ŵŹ Ŵ Ŵ ŵŹ

Case 3: 20-attribute pseudo random AUM

In this case we compare the improved CPSO algorithm with two others GA based algorithms the GRGS-GA and SGA on pseudo random attribute usage matrix of 20 attributes and 20 transactions. This AUM generated using pseudo random AUM generator designed to generate a large size usage matrix. The number of fragments anticipated is 20 fragments. The value of the best fitness in this case is 0. The average number of generations needed to reach the optimal solution in each algorithm. They are 30.4, 45.6, respectively. Apparently, SGA performs worst among the three in terms of fitness and convergence speed.

Appling A Discrete Particle Swarm Optimization Algorithm to Database Vertical Partition

The average number of generations needed to reach the optimal solution in each algorithm. They are 30.4, 45.6, and 296.2 for improved PSO, GRGS-GA and SGA, respectively. Apparently, SGA performs worst among the three in terms of fitness and

The trends of PE values of best particles for Improved CPSO, GRGS-GA and SGA on the 20-Attribute pseudo AUM.

attribute example

In this case, we try to apply the improved CPSO described in Section 4 to a large size usage matrix. This AUM have 50 attributes and 50 transactions, the number of fragments anticipated is 50 fragments. The value of the best fitness is 0.

The figure 7 shows that the improved CPSO reach the best fragmentation in 193 The average number of generations needed to reach the optimal solution in each GA and SGA, respectively. Apparently, SGA performs worst among the three in terms of fitness and GA and SGA on In this case, we try to apply the improved CPSO described in Section 4 to a large size usage matrix. This AUM have 50 attributes and 50 transactions, the number of The figure 7 shows that the improved CPSO reach the best fragmentation in 193 281 Improved CPSO

Results analysis

As the number of attributes grows, the improved CPSO becomes harder to converge because of the increased complexity of the partition problem. Every improved CPSO trial finds the optimal partition known when the usage matrix is generated. The convergence speed of the improved CPSO is well over the two others GA based algorithms. So, the advantage of using the improved CPSO is apparent over using the other two GAs. Figure 6 shows the PE values for 20 AUM. Again from this graph, we conclude that the improved CPSO is the best among the three and GRGS-GA is better than SGA in terms of the convergence speed.

Conclusion

Vertical database partition is a significant problem for database transaction performance. In this article, we proposed a PSO-based solution. Particularly, this solution features two new attempts: first, a RG string constraint is applied to overcome the redundant encoding of previous GAs for the partition problem or similar ones; second, a comparison is used to evaluate the performance of the improved CPSO with to others GA based algorithms the GRGS-GA [2], and a simple GA called SGA in term of convergence speed and best fragmentation results. The success of using the improved CPSO to solve vertical partition problem suggests it may be used to solve other clustering or grouping problems.

Fig. 1 .1Fig. 1. An example of solution representation with RG constraint.

Fig. 3 .3Fig. 3. Optimal solutions found in 10 trials case of 10 attributes

Fig. 4 .Fig. 5 .45Fig. 4.The trends of PE values of best particles in each iteration. Case of 20 attributes

Fig. 6 .Fig. 7 .67Fig. 6. The trends of PE values of best particles for Improved CPSO, G

Table 1 .110-attribute MatrixFig. 2. PE values of best particles in each iteration

References

Principles of distributed database systems MTOzsu PValduriez 1999 Prentice Hall Genetic algorithms based approach to database vertical partition JunDu RedaAlhajj KenBarker Journal of Intelligent Information Systems 26 2 March 2006. 2006 Year of Publication Combinatorial particle swarm optimization (CPSO) for partitional clustering problem BJarboui MCheikh PSiarry ARebai Applied Mathematics and Computation 192 2 2007 A formal approach to the vertical partition problem in distributed database design SChakravarthy JMuthuraj RVaradarjan SBNavathe 1992 Gainesville, Florida CIS Department, University of Florida, Technical Report Simple combinatorial gray codes constructed by reversing sublists FRuskey Algorithms and Computation Lecture Notes in Computer Science

Berlin Heidelberg New York

Springer 1993 762 A genetic algorithm for vertical fragmentation and access path selection SSong NGorla The Computer Journal 43 1 2000 Vertical partition for database design: A graphical algorithm SBNavathe MRa ACM SIGMOD Record 18 2 1989 Particle swarm optimization JKennedy RCEberhart Proceedings of the IEEE International Conference on Neural Networks the IEEE International Conference on Neural Networks 1995 IV