=Paper=
{{Paper
|id=None
|storemode=property
|title=Formal Frame for Data Mining with Association Rules - a Tool for Workflow Planning
|pdfUrl=https://ceur-ws.org/Vol-950/planlearn2012_submission_2.pdf
|volume=Vol-950
}}
==Formal Frame for Data Mining with Association Rules - a Tool for Workflow Planning==
Formal Frame for Data Mining with Association Rules – a Tool for Workflow Planning Jan Rauch and Milan Šimůnek 1 1 INTRODUCTION ≈. The rule ϕ ≈ ψ is true in a data matrix M if F≈ (a, b, c, d) = 1 where ha, b, c, di = 4f t(ϕ,ψ, M), otherwise it is false in M. The goal of this extended abstract is to contribute to the forum for Expression A1 (1, 2, 3) ∨ A2 (4, 6) ⇒p,B A3 (8, 9) ∧ A4 (1) is an research on construction of data mining workflows. We briefly intro- example of association rule, ⇒p,B is a 4ft-quantifier of founded im- duce a formal framework called FOFRADAR (FOrmal FRAmework a plication. It is F⇒p,B (a, b, c, d) = 1 if and only if a+b ≥ p∧a ≥ B for Data mining with Association Rules) and then we outline how [2]. There are various additional 4ft-quantifiers defined in [2, 4]. it can be used to control a workflow of data mining with associa- A deduction rule ϕϕ≈ψ 0 ≈ψ 0 is correct if the following is true for each tion rules. We consider this relevant to associative classifiers that use data matrix M: if ϕ ≈ ψ is true in M then also ϕ0 ≈ ψ 0 is true in association rule mining in the training phase [3]. M. There are reasonable criteria making possible to decide if ϕϕ≈ψ 0 ≈ψ 0 We deal with association rules ϕ ≈ ψ where ϕ and ψ are general is a correct deduction rule [4]. Boolean attributes derived from columns of analyzed data matrices. FOFRADAR consists of a logical calculus LC of association rules Symbol ≈ is called 4ft-quantifier and it stands for a condition con- and of several mutually related languages and procedures used to cerning a contingency table of ϕ and ψ [6]. Such rules are more gen- formalize both items of domain knowledge and important steps in eral than rules introduced in [1]. We consider data mining process as the data mining process. They are shortly introduced below, relations described by the well known CRISP-DM methodology. of some of them to the CRISP-DM are sketched in Fig. 1. The FOFRADAR is introduced in [5]. Its goal is to formally de- scribe a data mining process such that domain knowledge can be used both in formulation of reasonable analytical questions and in inter- pretation of resulting set of association rules. No similar approach to dealing with domain knowledge in data mining is known to the au- thors. An application of the FOFRADAR in data mining workflows is outlined here for the first time. 2 FOFRADAR FOFRADAR is based on a logical calculus LC of association rules. Formulas of LC correspond to the association rules ϕ ≈ φ [4]. Such rules are evaluated in data matrices rows of which correspond to ob- served objects o1 , . . . , on and columns correspond to observed at- tributes A1 , . . . , AK . We assume that Ai has a finite number ti ≥ 2 of possible values 1, . . . , ti (i.e. categories) and Ai (oj ) is a value of Ai in row oj for i = 1, . . . , K and j = 1, . . . , n. Boolean attributes ϕ, φ are derived from basic Boolean attributes i.e expressions Ai (α) where α ⊂ {1, . . . , ti }. A basic Boolean attribute Ai (α) is true in a row oj of a given data matrix M if Ai (oj ) ∈ α, otherwise it is false. Thus, we do not deal only Figure 1. FOFRADAR framework and CRISP-DM methodology with Boolean attributes - conjunctions of attribute-value pairs Ai (a) where a ∈ {1, . . . , ti } but we use general Boolean attributes derived by connectives ∧, ∨, ¬ from columns of a given data matrix. The 4ft-table 4f t(ϕ,ψ, M) of ϕ and ψ in a data matrix M is a Language LDK of domain knowledge – formulas of LDK corre- quadruple ha, b, c, di where a is the number of rows of M satisfying spond to items of domain knowledge. A formula A1 ↑↑ A11 mean- both ϕ and ψ, b is the number of rows satisfying ϕ and not satisfying ing that if A1 increases then A11 increases too is an example. We ψ, c is the number of rows not satisfying ϕ and satisfying ψ and d consider formulas of LDK as results of business understanding. is the number of rows satisfying neither ϕ nor ψ. A {0, 1}-valued Language LDt of data knowledge – its formulas can be consid- associated function F≈ (a, b, c, d) is defined for each 4ft-quantifier ered as results of data understanding. An example is information that 1 University of Economics, Prague, Czech Republic, email: rauch@vse.cz 90 per cent of observed patients are men. and simunek@vse.cz Language LAQ of analytical questions – the expression [M : A1 , . . . , A10 ≈? A11 , . . . , A20 ; 6→ A1 ↑↑ A11 ] is an example 1 1 Formulate_Analytical_Question of a formula of LAQ . It corresponds to a question Q1 : In data 2 Define_Set_of_Relevant_Rules matrix M, are there any relations between combinations of values 3 2 Apply ASSOC of attributes A1 , . . . , A10 and combinations of values of attributes 4 Apply CONCL A11 , . . . , A20 which are not consequences of A1 ↑↑ A11 ? 5 IF Continue_ASSOC THEN Language LRAR of sets of relevant association rules – each for- 6 BEGIN mula Φ of LRAR defines a set S(Φ) of rules relevant to a given 7 Modify Set_of_Relevant_Rules analytical question. The set S(Φ) relevant to Q1 can consist of rules 8 GOTO 2 ϕ ⇒0.9,100 ψ where ϕ is a conjunction of some of basic Boolean 9 END attributes A1 (α1 ), . . . , A10 (α10 ), similarly for ψ and A11 , . . . , A20 . 10 IF Continue_Analysis THEN GOTO 1 Here α1 can be e.g. any interval of maximal 3 consecutive categories, 11 STOP similarly for additional basic Boolean attributes. Procedure ASSOC – its input consists of a formula Φ of LRAR and of an analyzed data matrix M. Output of the ASSOC procedure Figure 2. Association rule data mining workflow based on FOFRADAR is a set T rue(S(Φ), M) of all rules ϕ ≈ ψ belonging to S(Φ) which are true in M. The procedure 4ft-Miner [6] is an implementation of ASSOC. It deals with a very sophisticated language LRAR . rules ϕ ≈ ψ which belong to S(Φ) and which are true in M. Procedure Cons – this procedure maps a formula Ω of LDK to a A next step is interpretation of the set T rue(S(Φ), M). This is re- set Cons(Ω, ≈) of association rules ϕ ≈ ψ which can be considered alized by the procedure CON CL, see row 4 in Fig. 2. Consequences as consequences of Ω. The set Cons(A1 ↑↑ A11 , ⇒p,B ) is a set of ω⇒p,B τ of particular items of domain knowledge are used which means that all rules ϕ ⇒p,B φ for which ϕ⇒p,B φ is a correct deduction rule and the procedure Cons is applied and several formulas of the language ω ⇒p,B τ is an atomic consequence of Cons(A1 ↑↑ A11 ). Rules LConcl are produced by the procedure CON CL. A1 (low) ⇒p,B A11 (low)) and A1 (high) ⇒p,B A11 (high) are ex- One of formulas produced by CON CL is a simple Boolean vari- amples of atomic consequences of A1 ↑↑ A11 , low and high are able Continue ASSOC. If its value is setup as true, then the set suitable subsets of categories of A1 and A11 . Some additional rules S(Φ) of relevant association rules which have to be verified is mod- can also be considered as belonging to Cons(A1 ↑↑ A11 , ⇒p,B ), ified and the process of solution of the analytical question Q con- see [5] for details. tinues, ses rows 5 – 9 in Fig. 2. The modification of the set S(Φ) Language LConcl – formulas of this language correspond to con- is done by the procedure M odif ySet of Relevant Rules which clusions which can be made on the basis of the set T rue(S(Φ), M) uses experience from applications of the 4ft-Miner procedure. produced by the ASSOC procedure. Two examples of such conclu- If the value of Continue ASSOC is setup to f alse, then the sions follow. (1): All rules in T rue(S(Φ), M) can be considered process of solution of the particular analytical question is terminated. as consequences of known items of domain knowledge A1 ↑↑ A11 Then a procedure Continue Analysis is used to decide if an addi- or A2 ↑↑ A19 . (2): Lot of rules from T rue(S(Φ), M) can be con- tional analytical question will be generated and solved. sidered as consequences of yet unknown item of domain knowledge There are first experiences with ”manually driven” processes cor- A9 ↑↑ A17 . responding to procedures used in Fig. 2. The most important is ex- There are additional procedures belonging to FOFRADAR, they perience with process corresponding to the CON CL procedure, see transform formulas of a particular language to formulas of another [7]. We believe to get enough experience to run the whole data work- language of FOFRADAR [5]. flow process automatically as outlined in Fig. 2. 3 FOFRADAR and Workflow of Data Mining ACKNOWLEDGEMENTS To keep things simple and the explanation concise we assume that The work described here has been supported by Grant No. the analyzed data matrix M is given as a result of necessary trans- 201/08/0802 of the Czech Science Foundation. formations. In addition, we assume that a set DK of formulas of the language LDK and a set DtK of formulas of the language LDt are given. A workflow of data mining with association rules can be then REFERENCES described according to Fig. 2. [1] R. Aggraval et al., Fast Discovery of Association Rules, 307–328, Ad- The first row in Fig. 2 means that an analytical question Q which vances in Knowledge Discovery and Data Mining, AAAI Press, Menlo can be solved by the procedure ASSOC is formulated using set DK Park, 1996. [2] P. Hájek and T. Havránek, Mechanising Hypothesis Formation - Mathe- of formulas of the language LDK . The set DtK of formulas of the matical Foundations for a General Theory, Springer, Berlin, 1978 language LDK can also be used to formulate reasonable analytical [3] M. Jalali-Heravi and R. Zaiane, A study on interestingness measures for questions. associative classifiers, 1039–1046, Proceedings of the 2010 ACM Sym- A solution of Q starts with a definition of a set S(Φ) of relevant posium on Applied Computing, ACM New York, 2010 association rules which have to be verified in M, see row 2 in Fig. [4] J. Rauch: Logic of association rules, Applied Intelligence, 22, 9–28, 2005 [5] J. Rauch, Consideration on a Formal Frame for Data Mining, 562–569, 2. The set S(Φ) is given by a formula Φ of language LRAR . The Proceedings of Granular Computing 2011, IEEE Computer Society, Pis- formula Φ is a result of application of a procedure transforming for- cataway, 2011. mulas of LAQ to formulas of LRAR . [6] J. Rauch and M. Šimůnek, An Alternative Approach to Mining Associa- Then, the procedure ASSOC is applied, see row 3 in Fig. 2. Ex- tion Rules, 219–238, Data Mining: Foundations, Methods, and Applica- tions, Springer-Verlag, Berlin, 2005. perience with the procedure 4ft-Miner, which is an enhanced imple- [7] J. Rauch and M. Šimůnek, Applying Domain Knowledge in Association- mentation of the ASSOC procedure, are given in [6, 7]. The applica- Rules Mining Process - First Experience, 113–122, Foundations of In- tion of ASSOC results into a set T rue(S(Φ), M) of all association telligent Systems, Springer-Verlag, Berlin, 2011.