INTRODUCTION

Formal Frame for Data Mining with Association Rules - a Tool for Workflow Planning

Jan Rauch

rauch@vse.cz

Milan Sˇ im u˚nek

and The goal of this extended abstract is to contribute to the forum for research on construction of data mining workflows. We briefly introduce a formal framework called FOFRADAR (FOrmal FRAmework for Data mining with Association Rules) and then we outline how it can be used to control a workflow of data mining with association rules. We consider this relevant to associative classifiers that use association rule mining in the training phase [3]. We deal with association rules ' where ' and are general Boolean attributes derived from columns of analyzed data matrices. Symbol is called 4ft-quantifier and it stands for a condition concerning a contingency table of ' and [6]. Such rules are more general than rules introduced in [1]. We consider data mining process as described by the well known CRISP-DM methodology. The FOFRADAR is introduced in [5]. Its goal is to formally describe a data mining process such that domain knowledge can be used both in formulation of reasonable analytical questions and in interpretation of resulting set of association rules. No similar approach to dealing with domain knowledge in data mining is known to the authors. An application of the FOFRADAR in data mining workflows is outlined here for the first time.

INTRODUCTION FOFRADAR FOFRADAR is based on a logical calculus LC of association rules.

Formulas of LC correspond to the association rules ' [ 4 ]. Such rules are evaluated in data matrices rows of which correspond to observed objects o1; : : : ; on and columns correspond to observed attributes A1; : : : ; AK . We assume that Ai has a finite number ti 2 of possible values 1; : : : ; ti (i.e. categories) and Ai(oj ) is a value of Ai in row oj for i = 1; : : : ; K and j = 1; : : : ; n.

Boolean attributes ', are derived from basic Boolean attributes i.e expressions Ai( ) where f1; : : : ; tig. A basic Boolean attribute Ai( ) is true in a row oj of a given data matrix M if Ai(oj ) 2 , otherwise it is false. Thus, we do not deal only with Boolean attributes - conjunctions of attribute-value pairs Ai(a) where a 2 f1; : : : ; tig but we use general Boolean attributes derived by connectives ^; _; : from columns of a given data matrix.

The 4ft-table 4f t(', , M) of ' and in a data matrix M is a quadruple ha; b; c; di where a is the number of rows of M satisfying both ' and , b is the number of rows satisfying ' and not satisfying , c is the number of rows not satisfying ' and satisfying and d is the number of rows satisfying neither ' nor . A f0; 1g-valued associated function F (a; b; c; d) is defined for each 4ft-quantifier . The rule ' is true in a data matrix M if F (a; b; c; d) = 1 where ha; b; c; di = 4f t(', , M), otherwise it is false in M.

Expression A1(1; 2; 3) _ A2(4; 6) )p;B A3(8; 9) ^ A4(1) is an example of association rule, )p;B is a 4ft-quantifier of founded implication. It is F)p;B (a; b; c; d) = 1 if and only if a+ab p ^ a B [ 2 ]. There are various additional 4ft-quantifiers defined in [ 2, 4 ].

A deduction rule ''0 0 is correct if the following is true for each data matrix M: if ' is true in M then also '0 0 is true in M. There are reasonable criteria making possible to decide if ''0 0 is a correct deduction rule [ 4 ].

FOFRADAR consists of a logical calculus LC of association rules and of several mutually related languages and procedures used to formalize both items of domain knowledge and important steps in the data mining process. They are shortly introduced below, relations of some of them to the CRISP-DM are sketched in Fig. 1. Language LDK of domain knowledge – formulas of LDK correspond to items of domain knowledge. A formula A1 "" A11 meaning that if A1 increases then A11 increases too is an example. We consider formulas of LDK as results of business understanding.

Language LDt of data knowledge – its formulas can be considered as results of data understanding. An example is information that 90 per cent of observed patients are men.

Language LAQ of analytical questions – the expression

[M : A1; : : : ; A10 ? A11; : : : ; A20; 6! A1 "" A11] is an example of a formula of LAQ. It corresponds to a question Q1: In data matrix M, are there any relations between combinations of values of attributes A1; : : : ; A10 and combinations of values of attributes A11; : : : ; A20 which are not consequences of A1 "" A11?

Language LRAR of sets of relevant association rules – each for

mula of LRAR defines a set S( ) of rules relevant to a given analytical question. The set S( ) relevant to Q1 can consist of rules ' )0:9;100 where ' is a conjunction of some of basic Boolean attributes A1( 1); : : : ; A10( 10), similarly for and A11; : : : ; A20. Here 1 can be e.g. any interval of maximal 3 consecutive categories, similarly for additional basic Boolean attributes.

Procedure ASSOC – its input consists of a formula of LRAR and of an analyzed data matrix M. Output of the ASSOC procedure is a set T rue(S( ); M) of all rules ' belonging to S( ) which are true in M. The procedure 4ft-Miner [ 6 ] is an implementation of ASSOC. It deals with a very sophisticated language LRAR.

Procedure Cons – this procedure maps a formula of LDK to a set Cons( ; ) of association rules ' which can be considered as consequences of . The set Cons(A1 "" A11; )p;B ) is a set of all rules ' )p;B for which '!))pp;;BB is a correct deduction rule and ! )p;B is an atomic consequence of Cons(A1 "" A11). Rules A1(low) )p;B A11(low)) and A1(high) )p;B A11(high) are examples of atomic consequences of A1 "" A11, low and high are suitable subsets of categories of A1 and A11. Some additional rules can also be considered as belonging to Cons(A1 "" A11; )p;B ), see [ 5 ] for details.

Language LConcl – formulas of this language correspond to conclusions which can be made on the basis of the set T rue(S( ); M) produced by the ASSOC procedure. Two examples of such conclusions follow. (1): All rules in T rue(S( ); M) can be considered as consequences of known items of domain knowledge A1 "" A11 or A2 "" A19. (2): Lot of rules from T rue(S( ); M) can be considered as consequences of yet unknown item of domain knowledge A9 "" A17.

There are additional procedures belonging to FOFRADAR, they transform formulas of a particular language to formulas of another language of FOFRADAR [ 5 ]. 3

FOFRADAR and Workflow of Data Mining

To keep things simple and the explanation concise we assume that the analyzed data matrix M is given as a result of necessary transformations. In addition, we assume that a set DK of formulas of the language LDK and a set DtK of formulas of the language LDt are given. A workflow of data mining with association rules can be then described according to Fig. 2.

The first row in Fig. 2 means that an analytical question Q which can be solved by the procedure ASSOC is formulated using set DK of formulas of the language LDK . The set DtK of formulas of the language LDK can also be used to formulate reasonable analytical questions.

A solution of Q starts with a definition of a set S( ) of relevant association rules which have to be verified in M, see row 2 in Fig. 2. The set S( ) is given by a formula of language LRAR. The formula is a result of application of a procedure transforming formulas of LAQ to formulas of LRAR.

Then, the procedure ASSOC is applied, see row 3 in Fig. 2. Experience with the procedure 4ft-Miner, which is an enhanced implementation of the ASSOC procedure, are given in [ 6, 7 ]. The application of ASSOC results into a set T rue(S( ); M) of all association 1 Formulate_Analytical_Question

Define_Set_of_Relevant_Rules 2 Apply ASSOC

Apply CONCL IF Continue_ASSOC THEN

BEGIN Modify Set_of_Relevant_Rules GOTO 2

END IF Continue_Analysis THEN GOTO 1

STOP rules ' which belong to S( ) and which are true in M.

A next step is interpretation of the set T rue(S( ); M). This is realized by the procedure CON CL, see row 4 in Fig. 2. Consequences of particular items of domain knowledge are used which means that the procedure Cons is applied and several formulas of the language LConcl are produced by the procedure CON CL.

One of formulas produced by CON CL is a simple Boolean variable Continue ASSOC. If its value is setup as true, then the set S( ) of relevant association rules which have to be verified is modified and the process of solution of the analytical question Q continues, ses rows 5 – 9 in Fig. 2. The modification of the set S( ) is done by the procedure M odif ySet of Relevant Rules which uses experience from applications of the 4ft-Miner procedure.

If the value of Continue ASSOC is setup to f alse, then the process of solution of the particular analytical question is terminated. Then a procedure Continue Analysis is used to decide if an additional analytical question will be generated and solved.

There are first experiences with ”manually driven” processes corresponding to procedures used in Fig. 2. The most important is experience with process corresponding to the CON CL procedure, see [ 7 ]. We believe to get enough experience to run the whole data workflow process automatically as outlined in Fig. 2.

ACKNOWLEDGEMENTS

The work described here has been supported by Grant No. 201/08/0802 of the Czech Science Foundation.

[1]

Aggraval et al., Fast Discovery of Association Rules , 307 - 328 , Advances in Knowledge Discovery and Data Mining , AAAI Press, Menlo Park, 1996 .

[2]

Ha ´jek and T . Havra´nek, Mechanising Hypothesis Formation - Mathematical Foundations for a General Theory , Springer, Berlin, 1978

[3]

Jalali-Heravi and

Zaiane , A study on interestingness measures for associative classifiers , 1039 - 1046 , Proceedings of the 2010 ACM Symposium on Applied Computing , ACM New York, 2010

[4]

Rauch : Logic of association rules, Applied Intelligence , 22 , 9 - 28 , 2005

[5]

Rauch , Consideration on a Formal Frame for Data Mining , 562 - 569 , Proceedings of Granular Computing 2011 , IEEE Computer Society, Piscataway, 2011 .

[6]

Rauch and M. Sˇ imu˚nek, An Alternative Approach to Mining Association Rules, 219 - 238 , Data Mining: Foundations, Methods, and Applications, Springer-Verlag, Berlin, 2005 .

[7]

Rauch and M. Sˇ imu˚nek, Applying Domain Knowledge in AssociationRules Mining Process - First Experience , 113 - 122 , Foundations of Intelligent Systems, Springer-Verlag, Berlin, 2011 .