Association Rule Mining Methods as Means of Forming the System of Life Quality Indicators Luidmila P. Bilgaeva1 , Erzhena Ts. Sadykova2 , Gregory V. Badmaev1 1 East Siberia State University of Technology and Managemen, Ulan-Ude, Russia, http://www.esstu.ru 2 Baikal Institute of Nature Management SB RAS, Ulan-Ude, Russia, http://www.binm.ru Abstract. The paper is devoted to comparing of the association rules mining methods and choosing the best method for the formation of the indicators system that affects the quality of life. Three methods are con- sidered: AprioriTiD, FPG and ABBM. For each method, such metrics are calculated as support, confidence, lift and running time, the values of which allow you to discover the best method. It results in the extraction of useful association rules showing how life quality indicators are related to each other. Later on it can be used to solve the problems of analyzing and forecasting. Keywords: data mining, association rule mining, support, confidence, lift, frequent itemsets, truncation of candidates, algorithm running time, life quality indicators 1 Introduction Association rules mining is one of the modern technology tasks for Data Mining, which involves finding the patterns between some related events, the definition of interrelated objects and their location in the state space [13]. Initially, the methods of association rules mining were created to assess the consumer basket [2]. At present, the algorithms for mining of association rules are used to solve various problems associated with the identification of different regularities, for example, in medical diagnostics [5], pharmacology [11], studying of forest fires [8], etc. In this paper, we propose to perform a comparative analysis of the association rules mining methods for the formation of a system of indicators characterizing the life quality of the population. To assess the life quality the number of indica- tors, such as socio-economic and environmental ones, are used. They allow you to determine the degree of satisfaction of personal material and social needs. In its turn, each indicator is characterized by a variety of factors affecting them. The association rules mining will allow forming a small number of indicators and factors, which has the greatest impact on the life quality definition. Further on, this approach can be used to solve problems of analyzing and forecasting of socio-economic processes. As the initial data we use the classification of socio-economic and environ- mental indicators proposed in [12]. 2 Valid method choice To select the best method for association rules mining, the authors developed certain criteria and, in accordance with them, analyzed a certain number of methods. The results are given in Table 1. Table 1. Comparative analysis of association rules mining methods Criteria No Methods Implemen- Small Small ID transac- Candidates’ tation number of running tions truncation simplicity candidates time application possibility 1 AIS + – – – – 2 SETM + – – + – 3 Apriori + + + – + 4 AprioriSome – + + – + 5 AprioriTID + + + + + 6 FPG – + + – + 7 ABBM – + + + + The table shows that the AIS and SETM methods generate a large number of candidates, but cannot truncate them. As a result, the running time of these algorithms increases significantly, which makes them less popular [7]. The Apriori, AprioriTID, AprioriSome, FPG and ABBM methods do not have this disadvantage because they generate a smaller number of candidates and are able to truncate candidates, which provides fewer loads on RAM. The analysis shows that the Apriori and AprioriTid methods proposed in [2,3] are easy to implement, because to store intermediate results, a data structure called a table is used. The AprioriTid algorithm, unlike the Apriori algorithm, uses the transaction identifier (TID) to search for frequent itemsets, which sig- nificantly reduces the algorithm running time. The FPG method uses a data structure called an FP-tree. It allows you to discover a frequent itemsets without candidate itemsets generation. Frequent itemsets are extracted directly from the FP-tree [10]. The FPG method is com- plicated in implementation but it gives a gain in the algorithm running time. The ABBM method is based on a binary matrix [6, 9]. It is complicated to implement but it is more efficient in the algorithm running time than the FPG method. This is due to the matrix columns pruning, where there are candidates of frequent itemsets having support values less than the minimum support. In [4], we analyzed the AprioriTid method for the formation of the indicators system that affects life quality. In this paper, having compared the AprioriTid, FPG and ABBM methods to each other on such criteria as algorithm running time, useful and confident association rules mining, we propose to apply the ABBM method to solve this problem. 3 Basic metrics of association rules mining There are many techniques which allow us solving the problem of association rules mining. They have the same mathematical approach but are different in their implementation methods. Let us consider the basic theoretical principles of these methods [14]. The association rule of context K is an expression of the form A → B, where A, B ⊆ M . The context K is a tuple (G, M, I), where G is a set of objects, M is a set of features, but I ⊆ G × M . When association rules are searched, special metrics are used: Support, Confidence, Lift. Association rule A → B Support is a quantity defined by the formula: |(A ∪ B)0 | Support(A → B) = (1) |G| The Support value indicates which part of the G objects contains A ∪ B. The Confidence of the association rules is defined by the formula: |(A ∪ B)0 | Confidence(A → B) = (2) |A0 | The Confidence value shows, which part of the objects that contain A, also contains A ∪ B. The following quantity is called the association rule utility (Lift): |(A ∪ B)0 | Lift(A → B) = (3) |A0 | × |B 0 | In other words, the utility is the ratio of Confidence(A → B) to the Support(B). The Lift value indicates how useful the rule is. If the found utility value is more than 1, then the rule is considered to be useful. The task of association rules mining is to find all association rules of the context for which the support and confidence values exceed the certain set values min_support and min_confidence, correspondingly. First, the transaction database is scanned where transactions consisting of various elements are stored. Then we carry out a search for frequent itemsets from 1-itemsets, 2-itemsets, 3-itemsets, etc. For this, such data structures, as a table, tree or binary matrix are used, depending on the method used. The search of frequent itemsets is limited to the minimum support value of min_support, which is set by the user [2]. The association rule mining is performed in frequent itemsets and is limited to the value of the minimum con- fidence of min_confidence and utility (Lift). Usually the minimum confidence is set by every user. 4 Software of association rules mining 4.1 Software architecture development To solve the problem of association rules mining, the special software was devel- oped. Its architecture is shown in Fig. 1. Fig. 1. Architecture of association rules mining software The software consists of three modules: a user interface, computational mod- ule, database controller. The user interface module is a set of dialog boxes that serves to provide interaction of the software modules and the user. When the software is started, the main window and the window for displaying the service information are opened. The main window consists of a control panel and a results panel. Using the control panel, the entire system of association rules mining is managed. The output pane consists of two windows: “Useful Rules”, “Description”. The “Useful Rules” window displays all the useful and trustworthy rules. If you choose one of the rules there, then you can see the names of the indicators included in the associative rule and the values of its metrics, such as support, confidence and lift, in the “Description” window. It should be noted that in the associative rule, all the indicators are presented in an encoded form, so the “Description” window allows you to analyze the indicators that are part of the rule. The database controller is used to access the database. The database is repre- sented as xml format files and consists of four tables: experiments, transactions, attributes (items), results. Input to the database is performed by importing some data from an Excel file. A transaction is a set of attributes that characterize a subject area under analysis, for example, “Lifetime”. Each attribute is encoded with a unique integer value which identifies the former in a database. The computing module is a software implementation of the AprioriTid, FPG, ABM algorithms for the association rules mining. To start calculations, you need to select one of the algorithms in the main program frame and specify its settings: minimum support (minsup), minimum confidence (minconf), utility (lift). The program allows you saving the results in a database. 4.2 Association rules mining algorithms In this paper, FPG and ABM the algorithms are presented. The AprioriTid algorithm was considered in [5]. FPG Algorithm The basis of the Frequent Pattern-Growth method is the transformation of the transaction database into a tree structure, called the tree of frequent datasets (Frequent-Pattern Tree or FP-tree). The block diagram of the FPG algorithm is shown in Fig. 2. The transaction database is scanned first, and frequent itemsets are selected. These are the items, the number of which is bigger and equal to the minimum support value in various transactions. 1. These frequent itemsets are sorted in descending order of their support val- ues, for example, (c, 5), (b, 4), (a, 3). 2. After sorting, the root node of the FP tree is created, which is denoted as ROOT. The tree construction begins with the enumeration of transactions. The tree is being constructed according the following rule. If for the next transaction tree element there is a node which name coincides with the ele- ment name, then the element does not create a new node, and the index of the corresponding node in the tree is increased by 1. Otherwise, a new node for this item is created and an index 1 is assigned to it. 3. Then you need to look through all the items in the loop and find all the paths in the tree that lead to the current item nodes. For each path, you have to calculate how many times the current item takes place in it and put it in a 2-tuple (set, index). 4. Next, remove the item itself (i.e. the set suffix) from the paths leading to it (i.e. the dial prefix). 5. Count how many times each item appears in the paths prefixes got in the previous step, and sort them in descending order of these values, getting a new set of transactions. It is based on the construction of a new FP-tree which is called a conditional FP-tree, since it is associated with only one object. 6. In this FP-tree, you need to find all the items (nodes) for which support (the number of occurrences in the tree) is equal to the minimum support or more. If the item occurs two or more times, then its indices (the occurrence frequencies in the conditional basis) are summed. Fig. 2. Block diagram of the FPG algorithm 7. Starting from the tree root, you are to record the paths that lead to each node for which the support or index is bigger or equal to the minimum support. Then you return the item (the template suffix) removed earlier and count the index or support obtained as a result. This result will be frequent itemsets. ABBM Algorithm An algorithm block diagram based on a binary matrix is shown in Fig. 3. Fig. 3. Block diagram of the ABBM algorithm The method essence is to mine multi-level association rules using the hierar- chies’ concept. The algorithm begins with the taxonomy coding. Then the first level of the H = 1 hierarchy is determined. Next, the transaction database is be- ing converted to a binary matrix, and the minimum support value for this level is set. After that, we look for frequent 1-itemsets which are candidates for the rules. For each of them, support is calculated (i.e. the number of their repetitions in all the transactions involved in the experiment). Next, the matrix columns are clipped, items support of which are less than the minimum support. Then 2-itemsets, 3-itemsets, . . . , i-itemsets are generated in a loop by means of conjunction, where 2 ≤ i ≤ k and k ≤ Hmax . For each of the following hierarchy level, the above stated steps (starting with step 4) are repeated. Once the maximum hierarchy level is reached, the algorithm is completed. Association rules are generated from the found of frequent itemsets. 5 The experiments results To implement the association rules mining for indicators that affect the life quality, the system of indicators proposed by the authors in [12] was used. It was comprised of 46 indicators and attributes, including 8 indicators and 38 attributes which affect one or another indicator in different combinations. Five experiments were performed for which two indicators and seventeen attributes were selected. First, all the indicators and attributes of the system were encoded, then 30 transactions with the number of items from five to eleven were gener- ated from the codes. For example, in the fifth experiment, the first transaction contains nine elements and has the following form: 91, 92, 93, 94, 95, 96, 97, 98, 99. All experiments were carried out on a personal computer with the follow- ing characteristics: processor – AMD FX-6350 3.90GHz, 6 core; RAM – 8GB; Video card – NVIDIA GeForce GTX580 3GB; External memory – SSD Kingston 120GB; Operating system – Windows 10. The running time of the association rules mining algorithms was measured in ticks, where 1 tick is equal 1/10000000 seconds according to [1]. To determine the running time of the AprioriTid, FPG and ABBM algo- rithms, the association rules mining was carried out within the following param- eters: min_support = 3; min_confidence = 0.8; lift > 1. The graph in Fig. 4 shows how the algorithm running time and the number of transactions are correlated. The graph demonstrates that with 30 transactions the ABBM algorithm is executed 12 times faster than the AprioriTid algorithm and 1.6 times faster than the FPG algorithm. Within the same parameters, a graph of relation between the rules number and the transactions number was constructed. It is presented in Fig. 5. The graph shows that the number of rules generated with the AprioriTid method is 2.2 times more than that with the FPG and ABBM ones. It is due to the antimonotonicity property of the AprioriTid method. When you add new items to a transaction, all frequent transaction itemsets are saved. As a result, the number of new frequent itemsets increases thus the number of generated rules increases too. For the next experiment, we set the min_support parameter equal to 10, the min_confidence and lift parameters remained the same as in the previous two experiments. The number of transactions was 20. It was done to generate a small number of rules. Fig. 4. Graph of relation between the algorithm running time and the number of transactions Fig. 5. Graph of relation between the number of rules and the number of transactions The experiment resulted in obtaining five valid and useful association rules with the AprioriTid and ABBM methods and six rules by the FPG method. Since the association rule is an implication, we united them by means of conjunction. Table 2 shows the results of the rule generation. Table 2. Association Rules Method Association Rules AprioriTid (92 → 98) ∧ (94 → 98) ∧ (98 → 100) ∧ (98 → 102) ∧ (97 → 98) FPG (93 → 94)∧(94 → 97)∧(94 → 102) ∧(98 → 94)∧(98 → 97)∧(98 → 102) ABBM (92 → 98) ∧ (94 → 98) ∧ (98 → 100) ∧ (98 → 102) ∧ (97 → 98) The rules (92 → 98) ∧ (94 → 98) ∧ (98 → 100) ∧ (98 → 102) ∧ (97 → 98), obtained by the ABBM method, mean: (92 and 94 and 97) affect 98, and 98 in its turn affects (100 and 102). The table analysis shows that the rules obtained with the AprioriTid and ABBM methods are the same and include six attributes: 92, 94, 97, 98, 100, 102, while the rules generated with the FPG method differ from them and include only five attributes: 93, 94, 97, 98, 102. They have four attributes in common: 94, 97, 98, 102. Since the two methods include the same attributes, they can be totally included in the life quality indicators system to further solve the problems of analyzing and forecasting. These indicators and attributes are as follows: 92 – Fiscal capacity per capita; 93 – Retail turnover per capita; 94 – Volume of paid services provided per capita; 97 – The growth rate of the minimum living wage; 98 – GRP per capita; 100 – The proportion of the population whose incomes are below the minimum living wage; 102 –Employment level. 6 Conclusion Computational experiments were carried out with the developed software. They enabled us to obtain valid and useful association rules for the population life quality indicators. The correlation between the algorithms running time and the number of transactions was revealed. It was found that the ABBM method was the most efficient. The FPG method is 1.6 times slower than the ABBM method. The correlation between the number of rules and the number of transac- tions was obtained. It demonstrates that the more transactions are present, the more rules are generated. As for the AprioriTid method it generates 2.1 times more rules than the ABBM and FPG methods because of its antimonotonicity property. This fact negatively influences on the AprioriTid algorithm running time. The experiments results demonstrated that application of association rules mining methods will allow identifying the most significant indicators in any subject area. Further on we can use these indicators to solve various problems of analyzing and forecasting. References 1. Electronic library of reference materials of Microsoft for the C# language. https://docs.microsoft.com/ru-ru/dotnet/csharp/language-reference 2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Advances in Knowledge Discovery and Data Mining, pp. 307–328. American Association for Artificial Intelligence Menlo Park, CA, USA (1996) 3. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering. pp. 3–14. ICDE ’95, IEEE Computer Society, Washington, DC, USA (1995) 4. Bilgaeva, L., Shirapov, D., Badmaev, G.: Formation of life quality indicators system through search algorithm of association rules. Proceedings of the Work-shop on CMDM 2016. Aachen, Germany, CEUR-WS 1726, 13–22 (2016) 5. Billig, V., Tsaregorodcev, N., Ivanova, O.: Building association rules in medical di- agnosis. Programmnye produkty i sistemy: International journal 2, 146–157 (2016) 6. Chinta Someswara Rao, Ravi Babu, D., Shiva Shankar, R., Pradeep Kumar, V., Rajanikanth, J., Chandra Sekhar, C.: Mining association rules based on boolean algorithm – a study in large databases. International Journal of Machine Learning and Computing 3(4), 347–351 (2013) 7. Houtsma, M., Swami, A.: Set-oriented mining for association rules in relational databases. In: Proceedings of the Eleventh International Conference on Data En- gineering. pp. 25–33. ICDE ’95, IEEE Computer Society, Washington, DC, USA (1995) 8. Kostenchuk, M., Drozhzhin, N., Belousov, R.: Search of patterns in estimation of forest fire situation by weather conditions. Scientific and educational problems of civil protection 2, 61–66 (2014) 9. Liu, H., Wang, B.: An association rule mining algorithm based on a boolean matrix. Data Science Journal 6, 559–565 (2007) 10. Oreshkov, V.: FPG – an alternative search algorithm for association rules. https://basegroup.ru/community/articles/fpg (2014) 11. Pivovarova, N., Vidunova, S.: Data mining in the pharmaceutical business. Naukovedenie 8(6) (2016), http://naukovedenie.ru/PDF/166TVN616.pdf 12. Saktoev, V., Sadykova, E.: Sustainable Development of Regional Economic Systems with Environmental Regulations. ZAO “Economy”, Moscow, Russia (2011) 13. Vercellis, C.: Business Intelligence: Data Mining and Optimization for Decision Making. Willey (2009) 14. Zayko, T.A., Oleinik, A.A., Subbotin, S.A.: Association rules in data mining. Vest- nik NTU “KPI” 39(1012), 82–95 (2013)