Mining High Quality Association Rules Using Genetic Algorithms Peter P. Wakabi-Waiswa and Venansius Baryamureeba Faculty of Computing & Information Technology, Makerere University, Kampala, Uganda Email: pwakabi@hotmail.com, barya@cit.mak.ac.ug Abstract gini, entropy gain, Laplace, lift and conviction that can be used to evaluate the rules (Carvalho, Freitas, & Ebecken Association rule mining problem (ARM) is a struc- 2005), (Freitas 1999). tured mechanism for unearthing hidden facts in large data sets and drawing inferences on how a subset of Quite a number of research works have been carried out items influences the presence of another subset. ARM is in this arena but results indicate that more innovative ap- computationally very expensive because the number of proaches need to be introduced with a view of finding algo- rules grow exponentially as the number of items in the rithms that can handle multiple and increasing rule quality database increase. This exponential growth is exacer- metrics as well as improving algorithm efficiency (Hruschka bated further when data dimensions increase. The asso- et al. 1999), (Kotsiantis & Kanellopoulos 2006). ciation rule mining problem is even made more com- In this paper we propose a Multi−Objective Genetic Algo- plex when the need to take the different rule quality rithm for Mining Association Rules (MOGAMAR), which metrics into account arises. In this paper, we propose generates association rules with f ive rule quality metrics: a genetic algorithm (GA) to generate high quality as- confidence, support, interestingness, lift and J−Measure sociation rules with five rule quality metrics. We study the performance of the algorithm and the experimental which permit the user to evaluate association rules on these results show that the algorithm produces high quality different quality metrics in a single algorithm run. The main rules in good computational times. motivation for using Genetic Algorithm (GA) is that a GA Keywords: confidence; support; interestingness; lift; performs a global search and copes better with attribute in- J−Measure; genetic algorithms; teraction than the greedy rule induction algorithms often used in data mining tasks (Freitas 2007). Genetic Algo- rithms are robust with little likelihood of getting stuck in 1 Introduction local optima and they are highly parallel in nature making Since the association rule mining problem was proposed by them good candidates for distributed implementations (Liu (Agrawal, Imielinski, & Swami 1993), several algorithms & Kwok 2000a). have been developed. In the most commonly used approach, The rest of this paper is organized as follows. We present the rule generation process is split into two separate steps. a detailed description of the proposed algorithm in Section The first step includes applying the minimum support to find 2. In Section 3 we evaluate the performance of the algorithm all frequent itemsets in a database and the second step in- and in Section 4 we conclude the paper. cludes applying the minimum confidence constraint on the frequent itemsets to form the rules. The first step is compu- 2 The Multi−Objective Genetic Algorithm tationally very expensive because finding all frequent item- for Mining Association Rules sets in a database involves searching all possible itemsets. The set of all possible itemsets grows exponentially with the In this work we use the underlying structure of the object- number of items in the database. The exponential growth oriented genetic algorithm proposed in (Davis 1991) with profoundly affects the performance of association rule min- modifications to the representation of the individuals. In the ing algorithms (Agrawal, Imielinski, & Swami 1993). rest of this section we give a detailed description of the pro- Other than the aforementioned difficulties, most of the ex- posed algorithm. isting algorithms have been developed to produce simple, easy to understand association rules which measure the qual- 2.1 Representation of the Rules ity of generated rules by considering mainly only one or two We adopted a modified Michigan approach proposed in evaluation criteria most especially confidence factor or pre- (Ghosh & Nath 2004) whereby the encoding/decoding dictive accuracy (Dehuri et al. 2008). These algorithms pro- scheme associates two bits to each attribute in the database. vide useful tools for descriptive data mining but there are If these two bits are 00 then the attribute next to these two several measures of rule quality such as comprehensibility, bits appears in the antecedent part and if it is 11 then the confidence, J−measure, surprise, gain, chi-squared value, attribute appears in the consequent part. And the other two 00 A 11 B 00 C 01 D 00 F 2.3 Reproduction Figure 1: Modified Michigan Rule Representation The reproduction mechanism involves rule selection and the application of the crossover operators. The rule selection method used by this algorithm follows the “universal suf- combinations, 01 and 10 will indicate the absence of the at- frage” approach proposed in (Giordana et al. 1997). With tribute in either of these parts. As an example, suppose there this approach each association rule is represented by a sin- is a rule ACF → BE. It will be represented as shown in Fig- gle individual. The individuals to be mated are elected by ure 1. training data instances. Each instance votes for a rule that Following this approach the algorithm can handle variable it covers in a stochastic fitness-based way. Using an exam- length rules with more storage efficiency, adding only an ple, let us assume we have a set R of rules or chromosomes overhead of 2k bits, where k is the number of attributes that cover a given instance i i.e. a rule whose antecedent and in the database. The downside of this approach is that it consequent are satisfied by the instance i. Then the instance is not well suited for handling continuous valued attributes. i votes in one of the rules in R by using a roulette wheel For handling real-valued attributes we have incorporated the selection scheme. discretization method proposed in (Kwedlo & Kretowski This means that each rule r in R is assigned a roulette-wheel 1999). The combination of these two approaches enabled us slot whose size is proportional to the ratio of fitness of r di- to uniformly represent the association rules using the Michi- vided by the sum of fitness of all rules in R. The better the gan approach. The decoding is performed as follows: fitness of a given rule r the higher its probability of being selected over the other rules covering the given instance i. (2i−1 ∗ ith biti ) P  DV = mnv + (mxv − mnv) ∗ (1) In the event of absence of a rule covering i the algorithm 2n − 1 automatically creates a new rule using the seeding operator where DVis the decoded value; 1 ≤ i ≤ n and n is the used at the population initialization stage. Since it is only number of bits used for encoding; mnv and mxv are mini- the rules that cover a given instance that do compete with mum and maximum values of the attribute, respectively; and each other, this results into some kind of niching. Niching biti is the value of the bit in position i. It is important to is a mechanism through which evolutionary algorithms form note with that the encoding of the rules in this algorithm, the and maintain subpopulations or niches. Niching fosters the consequent part of the rule is not important at the start of the evolution of several different rules each covering a different run of the algorithm because the the consequent should not part of the data being mined. This assists avoid the con- be randomly determined. The consequent part is determined vergence of the population to a single rule resulting in the when the fitness of the rule is computed. discovery of a set of rules rather than a single one. The actual reproduction takes place by performing a 2.2 Initialization multi-point crossover and the mutation on the new individu- We avoid generation of the initial population purely ran- als. domly because it may result in rules that will cover no train- The Crossover Operator We modified the standard ing data instance thereby having very low fitness. Further- crossover operator to either generalize the crossover oper- more, a population with rules that are guaranteed to cover ator if the rule is too specific, or to specialize it if the rule at least one training instance can lead to over−fitting the is too general. A rule is considered too specific if it covers data. It was shown in (Surry & Radcliffe 1996) that uti- too few data instances i.e. when too few data instances sat- lizing non−random initialization can lead to an improve- isfy both the antecedent and the consequent of the rule. In ment in the quality of the solution and can drastically reduce contrast, a rule is considered too general when it covers too the the runtime. We, therefore, designed an initialization many data instances. We make use of the bitwise OR and method which includes choosing a training instance to act as the bitwise AN D to implement generalization and special- a “seed” for rule generation as proposed in (Freitas 2002). In ization, respectively. The bitwise OR and the bitwise AN D our initialization approach, a seed should be a data instance are applied to the antecedent part of the rule since this deter- lying in the middle of a cluster of examples with a common mines the consequent part (Giordana et al. 1997). consequent. The training examples are stored in an array and iteratively for each example we calculate the fraction The generalization/specialization crossover procedure of those that cover the consequent against those that negate first constructs an index, I = {i1 , ..., in }, of pointers to the the coverage of that consequent (same-consequent/opposite- positions in the chromosome (bit-string) where the corre- consequent), ρ, as in (2). sponding bits in the two parents have different values. Then, for every element ik ∈ I the following procedure is re- Count(Sameconsequent ) peated. If the rule is too general, the algorithm replaces the ρ= (2) Count(Oppositeconsequent ) value of the bits b (ik ) with the logical OR of the corre- where Count(Sameconsequent ) denotes the sponding bits in the parents, otherwise if the rule is too spe- number of same-consequent examples and cific the algorithm replaces the value of the bit b (ik ) with Count(Oppositeconsequent ) denotes opposite-consequent the logical AN D of the corresponding bits in the parents. examples. The training example with the highest ρ will be The crossover operator is equipped with a mechanism for selected as the seed. detecting and eliminating invalid genotypes. The Mutation Operator The mutation operator helps in Fitness Evaluation MOGAMAR performs the fitness maintaining the diversity within the population and also in evaluation of the generated rules using a set of five comple- preventing premature convergence to local optima (Liu & mentary metrics: confidence, support, interestingness, lift Kwok 2000b). The mutation operator needs to be designed and J−Measure. These metrics are converted into an objec- in such a way that we avoid the population being domi- tive fitness function with user defined weights. The support, nated by a single highly fit individual. Our approach to cope σ(X), of an itemset X is defined as the proportion of trans- with this problem is to use an adaptive mutation probabil- actions in the data set which contain the itemset. The con- ity, where the value of the mutation probability is varied fidence factor or predictive accuracy of the rule is the con- as the population becomes dominated by an individual. We ditional probability of the consequent given the antecedent, adopted the non−unif orm−mutation operator proposed calculated as in (3). in (Michalewicz 1999). The non-uniform mutation opera- tor adapts to the environment by varying as the fitness of the conf idence = σ (X ∪ Y ) /σ(X) (3) individuals in the population changes. We made a modifi- We adopt the interestingness metric calculation proposed in cation to the non-uniform mutation operator to enable it to (Freitas 1998). The algorithm first calculates the informa- generalize and/or specialize a rule condition. The mutation tion gain, Inf oGain(Ai ), of each attribute, Ai . Then the operator randomly selects a condition from the rule. If that interestingness of the rule antecedent, RAI, is calculated by condition involves a nominal attribute, then the value will be an information−theoretical measure as (4). randomly changed from one value to another. If the attribute n P  is continuous, mutation will randomly change the conditions Inf oGain(Ai ) interval values. The specialization mutation operator works  i=1  on a randomly selected condition in the rule. If the condi-  n  RAI = 1 −   log (|Gk |)  (4) tion involves a continuous attribute, specialization shrinks  2   the interval. The mutation operator used here ensures that the high mu- tation probabilities don’t cause the loss of the fittest individ- The degree of interestingness of the rule consequent (CAI) ual of a generation. Furthermore, the mutation operator can is as (5): cause the undesired effect of changing the rule’s consequent CAI = (1 − Pr (Gkl )) 1/β (5) if applied before generating the consequent but in our case the consequent is generated after the mutation operator has where Gkl is the prior probability of the goal attribute value been applied. Gkl , β is a user-specified parameter, and 1/β is a measure for reducing the influence of the rule consequent interesting- ness in the value of the fitness function. 2.4 Replacement The interestingness of the rule is given by (6): We use an elitist individual replacement approach that en- RAI + CAI sures that more fit genotypes are always introduced into the interestingness = (6) population. 2 Uniqueness testing The application of the genetic opera- J−Measure shows how dissimilar a priori and posteriori tors on the parent population may result in identical geno- beliefs are about a rule meaning that useful rules imply a types in the population. The algorithm first tests to ensure high degree of dissimilarity. In rule inference we are inter- the new offspring do not duplicate any existing member of ested in the distribution of the rule “implication” variable Y , the population. There is, however, a computational overhead and its two events y and complement ȳ. The purpose is to caused by the uniqueness testing process on the operation of measure the difference between the priori distribution f (y), the algorithm. The computational overhead is compensated i.e. f (Y = y) and f (Y 6= y), and the posteriori distribution by a reduction in the number of genotype evaluations re- ~ The J − M easure is calculated as (7): f (Y | X). quired because the check for duplicates can be performed    before fitness evaluation. The saving of number of fitness f (y |x) JM =f (x) f (y |x ).ln + evaluations significantly increases efficiency. f (y)   (7) Furthermore, the adoption of a replacement strategy with (1 − f (y |x)) genotype uniqueness enforced preserves genetic diversity (1 − f (y |x )) .ln 1 − f (y) within the population (Lima et al. 2008). Genetic diver- sity is significant as crossover-type operators depend on re- Lift is equivalent to the ratio of the observed support to combining genetically dissimilar genotypes to make fitness that expected if X and Y were statistically independent and gains. Uniqueness of the genotypes permits the algorithm to it is defined as (8): find not only the single best individual but the n best geno- types in a single run, where n is the population size. The σ(X ∪ Y ) lif t(X → Y ) = (8) process of finding the best n individuals imposes some com- σ(X) ? σ(Y ) putational load but it is compensated with the generation of Finally, the fitness function is calculated as the arithmetic n high-fitness individuals in a single program run. weighted average of confidence, support, interestingness, lift and J−Measure. The fitness function, f (x), is given by Dataset No. of Instances Attributes (9): Adult 48,842 14 ws ∗ S + wc ∗ C + wi ∗ I + wl ∗ L + wJ ∗ JM Chess 3,196 36 f (x) = Connect-4 67,557 42 ws + wc + wi + wl + wj Diabetes 768 20 (9) DNA 3190 62 where S denotes support, C denotes confidence, I Mushroom 8,124 22 denotes interestingness, L denotes lift, J denotes J- MVLS 441,327 32 Measure, and their respective user defined weights are ws , wc , wi , wl , wj . The chromosomes are then ranked de- Table 1: Summary of the Datasets pending on their fitness. Selection Fitness Calculation Using the rank-based lin- ear normalization we convert the fitness values into erage rule quality metrics and CPU utilization. From Table selection-fitness values. With rank-based linear normaliza- 2, we observe that: tion a linear increment is used to set the selection-fitness 1. The discovered rules have a very high average quality values for the genotypes based on their rank. The main rea- value for all datasets implying that they are good algo- son for using rank-based linear normalization is because it rithms provides explicit control on the selection pressure applied 2. The algorithms are generally fast for smaller datasets with to the population and reduces the likelihood that the popu- increasing average CPU times for larger datasets lation will converge prematurely on a sub-optimal solution (Metzler 2005). This assists in avoiding problems caused by 3. The algorithms generally produce poorer quality rules for super-individuals dominating the selection process. large datasets. Our overall observation is that the quality of the generated 2.5 Criteria for Termination rules decreases as the dataset dimension increase. The algorithm terminates execution when the Degeneracy It can be observed that the algorithms consumed quite a condition is met −i.e. when the best and worst performing bit of time with the Connect−4 and the MVLS datasets. It chromosome in the population differs by less than 0.1%. It is quite evident that these two datasets have a much larger also terminates execution when the total number of genera- number of transactions and attributes as compared to the rest tions specified by the user is reached. of the datasets. We also observed that the rules with high confidence within these datasets have very low support. For 3 Algorithm Performance instance, the rules that have a ‘tie’ as their consequent and a confidence of 100% in the Connect−4 dataset, have a sup- In this section, we evaluate the relative performance of port of only 14 records. This profoundly affects the time it the algorithms and compare its performance to that of the takes to process the records. When we set a low minimum Data Mining by Evolutionary Learning (DMEL) algorithm support the algorithm response time greatly improves. (Au, Chan, & Yao 2003). DMEL is known to be one of the best performing evolutionary algorithms for association Dataset MOGAMAR DMEL rule mining(Reynolds & de la Iglesia 2008). The perfor- Average Time Average Time mance characteristics studied included the quality of solu- Quality (Secs) Quality (Secs) tions found and CPU utilization. Adult 89.45 1.35 89.7 1.34 3.1 Datasets Connect-4 87.3 12 88.2 19 Chess 97.35 0.01 95.6 0.015 To evaluate our algorithm we used Adult, Connect-4, Chess, Diabetes 87.75 0.015 79.8 0.017 Diabetes, DNA and Mushroom datasets from UCI repos- DNA 96.0 1.23 95.4 1.21 itory of machine learning databases (Frank & Asuncion Mushroom 89.4 0.03 88.5 0.04 2010). We also used the Motor Vehicle Licensing Sys- tem (MVLS) database from the Uganda Revenue Author- MVLS 82.4 900 83.6 903 ity (URA) which we processed for use in experiments. The Table 2: Quality of rules and run times MVLS database contains data pertaining to motor vehicle li- censing fees, vehicle ownership and registration, transfer of ownership, accidents, usage, country of origin, date or year of manufacture. The MVLS has over 4 million records but 3.3 Sensitivity to Dataset Size we randomly selected 441,327 transactions for these exper- In Section 3.2, we used fixed datasets sizes to assess the per- iments. The summary of the datasets is given in Table 1. formance of the algorithms. With a fixed dataset size it may not be possible to gauge how well the algorithm performs when the datasets grow. We now study the difference in 3.2 Relative Performance of the Algorithms performance of the algorithm with increasing dataset sizes. The summary of the results of our experiments with the fixed Figure 2 summarizes the trends in the performance of algo- parameters are given in Table 2. The results include the av- rithms with varying dataset sizes. From Figure 2 we observe Figure 3: Relative Performance of MOGAMAR and DMEL for different mutation rates cause each structure in the new population undergoes a ran- Figure 2: Relative Performance of the algorithms dom change with a probability equal to the mutation rate. This prevents members of the population from converging to a single very fit individual and as such increasing the re- that the increase in dataset size leads to an increase in the av- quired number of valuations. This implies that the mutation erage response time of the algorithms. Furthermore, when operator needs to be more thoroughly investigated to estab- the data are increased two-fold, there is a multiplier effect lish the most suitable rates. on the response time of the algorithms. This implies that the performance of the ARM algorithm highly depends on the 4 Conclusion size of the data being mined. In this paper we have proposed a new association rule min- ing algorithm, MOGAMAR. The approach proposed in this 3.4 Sensitivity to Genetic Operators and paper incorporates a novel population initialization tech- Parameters nique that ensures the production of high quality individ- Experiments we carried out in Section 3.2 were done us- uals; specifically designed breeding operators that ensure ing specific parameter instances. It is possible that varia- the elimination of defective genotypes; an adaptive muta- tions in the parameter values can lead to either an improve- tion probability to ensure genetic diversity of the popula- ment or deterioration in performance of the algorithms. In tion; and uniqueness testing. The performance of MOGA- this subsection we make a deeper study on the effect of the MAR has been compared with DMEL, an efficient ARM crossover and mutation operators on the performance of the algorithm, on a number of benchmark datasets with experi- algorithms. mental results showing that our proposed approach can yield equally as good performance with consistently high quality Effect of Crossover Operator We studied the perfor- rules. MOGAMAR provides the user with rules according mance of MOGAMAR and DMEL with different values of to five interestingness metrics, which can easily be increased the crossover rate. We observed that the overall performance if need be by modifying the fitness function. We studied of both algorithms with different crossover rates is slightly the effect of parameters to the performance of the algorithm different from that shown when the values of the crossover specifically dataset size, crossover and mutation rates. We rate is fixed. This implies that the crossover rate does not observed that the algorithm performs poorly for large dataset have a very profound effect on the performance of the algo- sizes. The poor performance is more evident as the datasets rithm. increase in size. This has been identified as an area requir- Effect of Mutation Operator Figure 3 shows the perfor- ing more research efforts. It was further observed that the mance of the algorithms with varying mutation rates. It can crossover rate does not have a big impact on the performance be seen that their performance was drastically affected with while the mutation rate does. This indicates that it is neces- increasing mutation rates implying that the mutation oper- sary to find methods for finding the right mutation rates that ator has high impact on the performance of genetic algo- encourage the performance of the algorithm. This is another rithms. With the mutation rates increasing, the utilization area we shall be researching in. We have used weighted fit- of the CPU drastically increased showing a big reduction in ness function for finding the best rules but this approach may the algorithm efficiency and disparity in the characteristics not be the best when there are several solution-quality crite- of the algorithms became more evident. The main cause of ria to be evaluated. This is because these criteria may be this phenomenon is that the mutation operator reintroduces non-commensurate or conflicting most especially when they useful genotypes into the population for a diverse pool of evaluate different aspects of a candidate solution. In our fu- parents. This increases the diversity of the population be- ture works we shall specifically look into the possibility of differently modeling the ARM problem. Lima, C. F.; M.Pelikan; Goldberg, D.; O G. Lobo, K. S.; and Hauschild, M. 2008. Influence of selection and re- References placement strategies on linkage learning in boa. In Evo- Agrawal, R.; Imielinski, T.; and Swami, A. 1993. Mining lutionary Computation, 2007. CEC 2007. IEEE Congress, association rules between sets of items in large databases. 1083–1090. In Proceedings of the 1993 ACM SIGMOD International Liu, J., and Kwok, J. 2000a. An extended genetic rule Conference on Management of Data, Washington, D.C., induction algorithm. In Proceeding of the 2000 Congress 26–28. on Evolutionary Computation. Au, W.; Chan, K.; and Yao, X. 2003. A novel evolution- Liu, J., and Kwok, J. 2000b. An extended genetic rule ary data mining algorithm with applications to churn pre- induction algorithm. In Proceedings of the 2000 Congress diction. IEEE Transactions on Evolutionary Computation on Evolutionary Computation, volume 1, 458–463. 7(6). Metzler, D. 2005. Direct maximization of rank-based Carvalho, D. R.; Freitas, A. A.; and Ebecken, N. F. F. 2005. metrics. Technical report, University of Massachusetts, Evaluating the correlation between objective rule interest- Amherst. ingness measures and real human interest. In PKDD, 453– Michalewicz, Z. 1999. Genetic Algorithms + Data Struc- 461. tures = Evolution Programs. Springer-Verlag, New York. Davis, L. 1991. Handbook of Genetic Algorithms. VNR Reynolds, A., and de la Iglesia, B. 2008. A multi-objective Computer Library, Von Nostrad Reinhold, New York. grasp for partial classification. Soft Comput. 13. Dehuri, S.; Patnaik, S.; Ghosh, A.; and Mall, R. 2008. Ap- Surry, P., and Radcliffe, N. 1996. Inoculation to initialize plication of elitist multi-objective genetic algorithm in clas- evolutionary search. In T.C., F., ed., Evolutionary Comput- sification rule generation. Applied Soft Computing Journal ing: AISB Workshop, Springer, Brighton, U.K. 8:477–487. Frank, A., and Asuncion, A. 2010. UCI machine learning repository. Freitas, A. 1998. On objective measures of rule surprising- ness. In Principles of Data Mining and Knowledge Discov- ery, 2nd European Symp., PKDD98. Nantes,France, 1083– 1090. Freitas, A. A. 1999. On rule interestingness measures. Knowledge-Based Systems 12:309–3135. Freitas, A. A. 2002. Data mining and knowledge discov- ery with evolutionary algorithms. Springer-Verlag Berlin Heidelberg. Freitas, A. A. 2007. A review of evolutionary algorithms for data mining. Soft Computing, Knowledge Discovery and Data Mining. Ghosh, A., and Nath, B. 2004. Multi−objective rule mining using genetic algorithms. Information Sciences 163:123–133. Giordana, A.; Anglano, C.; Giordana, A.; Bello, G. L.; and Saitta, L. 1997. A network genetic algorithm for concept learning. In Proceedings of the Sixth International Confer- ence on Genetic Algorithms, 436–443. Morgan Kaufmann. Hruschka, E. R.; Campello, R. J.; Freitas, A. A.; and de Carvalho, A. C. P. L. F. 1999. Survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man, and Cybernetics 12:309–315. Kotsiantis, S., and Kanellopoulos, D. 2006. Associa- tion rules mining: A recent overview. GESTS Interna- tional Transactions on Computer Science and Engineering 32(1):71–82. Kwedlo, W., and Kretowski, M. 1999. An evolutionary al- gorithm using multivariate discretization for decision rule induction. In Proceedings of the 3rd European Confer- ence on Principles and Practice of Knowledge Databases (PKDD ’99).