Comparison of Heuristics for Optimization of Association Rules Fawaz Alsolami1 , Talha Amin1 , Mikhail Moshkov1 , and Beata Zielosko2 1 Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology Thuwal 23955-6900, Saudi Arabia {fawaz. alsolami, talha.amin, mikhail.moshkov}@kaust.edu.sa 2 Institute of Computer Science, University of Silesia 39, Bedzinska St., 41-200 Sosnowiec, Poland beata.zielosko@us.edu.pl Abstract. In this paper, five greedy heuristics for construction of association rules are compared from the point of view of the length and coverage of con- structed rules. The obtained rules are compared also with optimal ones con- structed by dynamic programming algorithms. The average relative difference between length of rules constructed by the best heuristic and minimum length of rules is at most 4%. The same situation is with coverage. Key words: greedy heuristics, association rules, decision rules, dynamic pro- gramming, rough sets 1 Introduction Association rule mining is one of the important fields of data mining and knowledge discovery. It aims to extract interesting correlations, associations, or frequent patterns among sets of items in data set. There are many algorithms for construction of association rules. One of the most popular is Apriori algorithm based on frequent itemsets [1]. During years, many new algorithms were designed which are based on, e.g., hash based technique [15], parti- tioning the data [18], and others [7, 10, 19]. The most popular measures for mining association rules are support and confi- dence [9], however in the paper length and coverage as rule evaluation measures are considered. The choice of length is connected with the Minimum Description Length Principle [17]. Shorter rules are better from the point of view of understanding and in- terpreting by experts. Search of rules with big coverage allows us to discover major patterns in the data, and it is important from the point of view of knowledge represen- tation. In the paper, greedy algorithms for construction of association rules are studied since the problems of construction of rules with minimum length or maximum coverage are N P -hard [6, 12, 14]. The most part of approaches, with the exception of brute- force, Apriori algorithm or extensions of dynamic programming, cannot guarantee the construction of optimal rules (i.e., rules with minimum length or maximum coverage). 5 In the paper [12], it was shown based on results of U. Feige [8] that, under reasonable assumptions on the class NP, some greedy algorithm is close to the best polynomial approximate algorithms for minimization of association rule length. We do not know about similar results for coverage. Application of rough sets theory to the construction of rules for knowledge represen- tation or classification tasks are usually connected with the usage of decision table [16] as a form of input data representation. In such a table one attribute is distinguished as a decision attribute and it relates to a rule’s consequence. However, in the last years, as- sociative mechanism of rule construction, where all attributes can occur as premises or consequences of particular rules, is popular. Association rules can be defined in many ways. In the paper, a special kind of association rules is studied, i.e., they relate to de- cision rules. Similar approach was considered in [12, 13], where a greedy algorithm for minimization of length of association rules was investigated. In this paper, we consider five greedy heuristics for construction of association rules and compare them from the point of view of the length and coverage of constructed rules. We also compare the obtained rules with optimal ones constructed by dynamic programming algorithms. We show that the average relative difference between length of rules constructed by the best heuristic and minimum length of rules is at most 4%. The same situation is with coverage. The paper consists of five sections. Section 2 contains main notions. In Sect. 3, we discuss five greedy heuristics. Section 4 contains experimental results for decision tables from UCI Machine Learning Repository, and Sect. 5 – short conclusions. 2 Main Notions An information system I is a rectangular table with n+1 columns labeled with attributes f1 , . . . , fn+1 . Rows of this table are filled by nonnegative integers which are interpreted as values of attributes. An association rule for I is a rule of the kind (fi1 = a1 ) ∧ . . . ∧ (fim = am ) → fj = a, where fj ∈ {f1 , . . . , fn+1 }, fi1 , . . . , fim ∈ {f1 , . . . , fn+1 } \ {fj }, and a,a1 ,. . . ,am are nonnegative integers. The notion of an association rule for I is based on the notions of a decision table and decision rule. We consider two kinds of decision tables: with many-valued decisions and with single-valued decisions. A decision table with many-valued decisions T is a rectangular table with n columns labeled with (conditional) attributes f1, . . . ,fn . Rows of this table are pairwise different and are filled by nonnegative integers which are interpreted as values of conditional attributes. Each row r is labeled with a finite nonempty set D(r) of nonnegative integers which are interpreted as decisions (values of a decision attribute). For a given row r of T , it is necessary to find a decision from the set D(r). A decision table with single-valued decisions T is a rectangular table with n columns labeled with (conditional) attributes f1, . . . ,fn . Rows of this table are pair- wise different and are filled by nonnegative integers which are interpreted as values of 6 conditional attributes. Each row r is labeled with a nonnegative integer d(r) which is interpreted as a decision (value of a decision attribute). For a given row r of T , it is necessary to find the decision d(r). Decision tables with single-valued decisions can be considered as a special kind of decision tables with many-valued decisions in which D(r) = {d(r)} for each row r. For each attribute fi ∈ {f1 , . . . , fn+1 }, the information system I is transformed into a table Ifi . The column fi is removed from I and a table with n columns labeled with attributes f1 , . . . , fi−1 , fi+1 , . . . , fn+1 is obtained. Values of the attribute fi are attached to the rows of the obtained table Ifi as decisions. The table Ifi can contain equal rows. We transform this table into two decision tables – with many-valued and single-valued decisions. A decision table Ifm−v i with many-valued decisions is obtained from the table Ifi by replacing each group of equal rows with a single row from the group with the set of decisions attached to all rows from the group. A decision table Ifs−v i with single-valued decisions is obtained from the table Ifi by replacing each group of equal rows with a single row from the group with the most common decision for this group. The set {Ifm−v 1 , . . . , Ifm−v n+1 } of decision tables with many-valued decisions obtained from the information system I is denoted by Φm−v (I). We denote by Φs−v (I) the set {Ifs−v 1 , . . . , Ifs−v n+1 } of decision tables with single-valued decisions obtained from the information system I. Since decision tables with single-valued decisions are a special case of decision tables with many-valued decisions, we consider the notion of decision rule for tables with many-valued decisions. Let T ∈ Φm−v (I). For simplicity, let T = Ifm−v n+1 . The attribute fn+1 will be con- sidered as a decision attribute of the table T . We denote by N (T ) the number of rows in table T . For a decision a, denote N (T, a) the number of rows r of T such that a ∈ D(r), and M (T, a) = N (T ) − N (T, a). A decision a is a common decision of T if a ∈ D(r) for any row r of T . We denote by E(T ) the set of conditional attributes of T which are not constant on T . A table obtained from T by removal some rows is called a subtable of T . We denote by T (fi1 , a1 ), . . . , (fim , am ) a subtable of T which consists of rows that at the intersection with columns fi1 , . . . , fim have values a1 , . . . , am . The expression (fi1 = a1 ) ∧ . . . ∧ (fim = am ) → fn+1 = a is called a decision rule over T if fi1 , . . . , fim ∈ {f1 , . . . , fn }, a1 , . . . , am are the val- ues of the corresponding attributes, and a is a decision. We correspond to the considered rule the subtable T ′ = T (fi1 , a1 ), . . . , (fim , am ) of the table T . This rule is called real- izable for a row r of T if r belongs to T ′ . This rule is called true for T if a is a common decision of T ′ . We say that the considered rule is a rule for T and r, if this rule is true for T and realizable for r. The number m is called the length of the rule. The coverage of the rule is the number of rows r from T ′ for which a ∈ D(r). If the considered rule is a rule for T and r then its coverage is equal to N (T ′ ). Decision rules which are true for decision tables from Φm−v (I) can be considered as association rules (modification for many-valued decision model) that are true for the information system I. Decision rules which are true for decision tables from Φs−v (I) 7 can be considered as association rules (modification for single-valued decision model) that are true for the information system I. 3 Greedy Heuristics We consider the work of five greedy heuristics on an example of the table T = Ifm−v n+1 . Let r = (b1 , . . . , bn ) be a row of T and a be a decision from D(r). A heuristic H constructs a decision rule for T and r. This heuristic starts with a rule whose left-hand side is empty → fn+1 = a, and then sequentially adds conditions to the left-hand side of this rule. Let during the work of the heuristic H, we already constructed the following rule: (fi1 = bi1 ) ∧ . . . ∧ (fim = bim ) → fn+1 = a. We correspond to this rule the subtable T ′ = T (fi1 , bi1 ), . . . , (fim , bim ) of the table T . If a is a common decision for T ′ then the work of H is finished and the constructed rule is returned. Otherwise, we should select a new attribute fim+1 and construct a new rule: (fi1 = bi1 ) ∧ . . . ∧ (fim = bim ) ∧ (fim+1 = bim+1 ) → fn+1 = a. Denote T ′′ = T ′ (fim+1 , bim+1 ), M (fim+1 , r, a) = M (T ′′ , a) = N (T ′′ ) − N (T ′′ , a), and RM (fim+1 , r, a) = (N (T ′′ ) − N (T ′′ , a))/N (T ′′ ). We denote α(fim+1 , r, a) = N (T ′ , a) − N (T ′′ , a) and β(fim+1 , r, a) = M (T ′ , a) − M (T ′′ , a). We describe now how five greedy heuristics select the attribute fim+1 . Heuristic “M” selects an attribute fim+1 ∈ E(T ′ ) which minimizes the value M (fim+1 , r, a). Heuristic “RM” selects an attribute fim+1 ∈ E(T ′ ) which minimizes the value RM (fim+1 , r, a). Heuristic “maxCov” selects an attribute fim+1 ∈ E(T ′ ) which minimizes the value α(fim+1 , r, a) given that β(fim+1 , r, a) > 0. Heuristic “poly” selects an attribute fim+1 ∈ E(T ′ ) which maximizes the value β(fim+1 ,r,a) α(fim+1 ,r,a)+1 . Heuristic “log” selects an attribute fim+1 ∈ E(T ′ ) which maximizes the value β(fim+1 ,r,a) log2 (α(fim+1 ,r,a)+2) . Let H be one of the considered heuristics. For a row r of the table T , we apply it to the row r and each decision a ∈ D(r). As a result, we obtain |D(r)| rules. Depending on our aim, we either choose among these rules a rule with minimum length or a rule with maximum coverage. 4 Experimental Results Experiments were made using data sets from UCI Machine Learning Repository [5] and software system Dagger [2]. Some decision tables contain conditional attributes that take unique value for each row. Such attributes were removed. In some tables there were equal rows with, possibly, different decisions. In this case each group of identical 8 rows was replaced with a single row from the group with the most common decision for this group. In some tables there were missing values. Each such value was replaced with the most common value of the corresponding attribute. Prepared 12 data sets were considered as information systems (see Table 1 which contains some information about each of these information systems). Table 1. Data sets considered as information systems Data set Rows Attr Adult-stretch 16 5 Balance-scale 625 5 Breast-cancer 266 10 Cars 1728 7 Hayes-roth-data 69 5 Lenses 24 5 Monks-1-test 432 7 Monks-3-test 432 7 Shuttle-landing 15 7 Teeth 23 9 Tic-tac-toe 958 10 Zoo-data 59 17 For each information system I, we construct the set Φm−v (I) of decision tables with many-valued decisions and the set Φs−v (I) of decision tables with single-valued decisions. For each row r of each table T ∈ Φm−v (I), we apply to this row each of the considered five greedy heuristics as it was described at the end of the previous section. We rank five heuristics for row r relative to the length and coverage of constructed rules and find, for each heuristic, the average ranks relative to length and coverage among all rows of all tables from Φm−v (I). After that we consider mean of average ranks among all 12 information systems and obtain overall ranks. Results can be found in Table 2. The best three heuristics for length are M, log, and RM. The best three heuristics for coverage are poly, log, and RM. We study in the same way decision tables with single- valued decisions (see Table 2). The best three heuristics for length are M, RM, and log. The best three heuristics for coverage are poly, log, and RM. For each heuristic and each row r of each table T ∈ Φm−v (I), we compare the length of rule constructed by heuristic for r (we denote it length_greedy) with min- imum length of rule (we denote it length_min) and calculate the relative difference length_greedy−length_min length_min (we assume that 00 = 0). The minimum length of rule can be found by dynamic programming algorithms (see [3, 4, 20, 21] for decision tables with single-valued decisions and [11] for decision tables with many-valued decisions). Later, we find average relative difference among all rows of all tables from Φm−v (I), and overall average relative difference for all 12 information systems. Results can be found in Table 3. The best three heuristics for the length are M (2% difference), RM (4%), and log (13%). Similar study was done for coverage and decision tables with many- 9 valued decisions. The relative difference is given by coverage_max−coverage_greedy coverage_max where coverage_greedy is the coverage of the rule constructed by greedy heuristic, and coverage_max is the maximum coverage of the rule calculated by a dynamic program- ming algorithm. The best three heuristics for the coverage are poly (4% difference), log (8%), and maxCov (14%). Table 2. Overall ranks for the heuristics Heuristics poly log maxCov M RM Length 3.38 2.25 5.00 2.17 2.21 Single-valued decisions Coverage 1.67 1.83 4.00 4.21 3.29 Length 3.33 2.33 5.00 1.79 2.54 Many-valued decisions Coverage 1.67 1.83 3.67 4.21 3.62 We study in the same way decision tables with single-valued decisions (see results in Table 3). The best three heuristics for the length are RM (4% difference), M (5%), and log (14%). The best three heuristics for the coverage are poly (4% difference), log (8%), and maxCov (15%). Table 3. Overall average relative differences for the heuristics Heuristics poly log maxCov M RM Length 0.27 0.14 0.84 0.05 0.04 Single-valued decisions Coverage 0.04 0.08 0.15 0.24 0.21 Length 0.29 0.13 0.83 0.02 0.04 Many-valued decisions Coverage 0.04 0.08 0.14 0.23 0.20 From the considered results it follows that, for the length minimization, we should use the heuristic M and, probably, the heuristic RM. For the coverage maximization we should use the heuristic poly. 5 Conclusions We compared five heuristics for construction of association rules in the frameworks of both multi-valued and single-valued decision approaches. We shown that the aver- age relative difference between coverage of rules constructed by the best heuristic and maximum coverage of rules is at most 4%. The same situation is with length. In the future, we are planning to use the best heuristic for coverage in algorithms constructing relatively small systems of rules covering almost all objects in information systems. 10 Acknowledgements Research reported in this publication was supported by the King Abdullah University of Science and Technology (KAUST). The authors wish to express their gratitude to anonymous reviewers for useful com- ments. References 1. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: SIGMOD ’93, pp. 207–216. ACM (1993) 2. Alkhalid, A., Amin, T., Chikalov, I., Hussain, S., Moshkov, M., Zielosko, B.: Dagger: A tool for analysis and optimization of decision trees and rules. In: Computational Informatics, So- cial Factors and New Information Technologies: Hypermedia Perspectives and Avant-Garde Experiences in the Era of Communicability Expansion, pp. 29–39. Blue Herons (2011) 3. Amin, T., Chikalov, I., Moshkov, M., Zielosko, B.: Dynamic programming approach for partial decision rule optimization. Fundam. Inform. 119(3-4), 233–248 (2012) 4. Amin, T., Chikalov, I., Moshkov, M., Zielosko, B.: Dynamic programming approach to opti- mization of approximate decision rules. Inf. Sci. 221, 403–418 (2013) 5. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (http: //wwwicsuciedu/~mlearn/, 2007), http://www.ics.uci.edu/~mlearn/ 6. Bonates, T., Hammer, P.L., Kogan, A.: Maximum patterns in datasets. Discrete Applied Mathematics 156(6), 846–861 (2008) 7. Borgelt, C.: Simple algorithms for frequent item set mining. In: Koronacki, J., Raś, Z.W., Wierzchoń, S.T., Kacprzyk, J. (eds.) Advances in Machine Learning II, Studies in Computa- tional Intelligence, vol. 263, pp. 351–369. Springer Berlin Heidelberg (2010) 8. Feige, U.: A threshold of ln n for approximating set cover. In: Leighton, F.T. (ed.) Journal of the ACM (JACM), vol. 45, pp. 634–652. ACM New York (1998) 9. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000) 10. Herawan, T., Deris, M.M.: A soft set approach for association rules mining. Knowledge- Based Systems 24(1), 186–195 (2011) 11. Moshkov, M., Zielosko, B.: Combinatorial Machine Learning - A Rough Set Approach, Studies in Computational Intelligence, vol. 360. Springer (2011) 12. Moshkov, M.J., Piliszczuk, M., Zielosko, B.: Greedy algorithm for construction of partial association rules. Fundam. Inform. 92(3), 259–277 (2009) 13. Moshkov, M.J., Piliszczuk, M., Zielosko, B.: On construction of partial association rules. In: Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G. (eds.) RSKT, LNCS, vol. 5589, pp. 176–183. Springer (2009) 14. Nguyen, H.S., Ślȩzak, D.: Approximate reducts and association rules - correspondence and complexity results. In: Zhong, N., Skowron, A., Ohsuga, S. (eds.) RSFDGrC, LNCS, vol. 1711, pp. 137–145. Springer (1999) 15. Park, J.S., Chen, M.S., Yu, P.S.: An effective hash based algorithm for mining association rules. In: Carey, M.J., Schneider, D.A. (eds.) SIGMOD Conference, pp. 175–186. ACM Press (1995) 16. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Inf. Sci. 177(1), 3–27 (2007) 17. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978) 18. Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: Dayal, U., Gray, P.M.D., Nishio, S. (eds.) VLDB, pp. 432–444. Morgan Kaufmann (1995) 11 19. Wieczorek, A., Słowiński, R.: Generating a set of association and decision rules with statis- tically representative support and anti-support. Information Sciences 277, 56–70 (2014) 20. Zielosko, B.: Sequential optimization of γ-decision rules. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M. (eds.) FedCSIS, pp. 339–346 (2012) 21. Zielosko, B., Chikalov, I., Moshkov, M., Amin, T.: Optimization of decision rules based on dynamic programming approach. In: Faucher, C., Jain, L.C. (eds.) Innovations in Intelligent Machines (4), Studies in Computational Intelligence, vol. 514, pp. 369–392. Springer (2014)