=Paper= {{Paper |id=None |storemode=property |title=Formal Frame for Data Mining with Association Rules - a Tool for Workflow Planning |pdfUrl=https://ceur-ws.org/Vol-950/planlearn2012_submission_2.pdf |volume=Vol-950 }} ==Formal Frame for Data Mining with Association Rules - a Tool for Workflow Planning== https://ceur-ws.org/Vol-950/planlearn2012_submission_2.pdf
     Formal Frame for Data Mining with Association Rules
               – a Tool for Workflow Planning
                                                       Jan Rauch and Milan Šimůnek 1


1     INTRODUCTION                                                              ≈. The rule ϕ ≈ ψ is true in a data matrix M if F≈ (a, b, c, d) = 1
                                                                                where ha, b, c, di = 4f t(ϕ,ψ, M), otherwise it is false in M.
The goal of this extended abstract is to contribute to the forum for               Expression A1 (1, 2, 3) ∨ A2 (4, 6) ⇒p,B A3 (8, 9) ∧ A4 (1) is an
research on construction of data mining workflows. We briefly intro-            example of association rule, ⇒p,B is a 4ft-quantifier of founded im-
duce a formal framework called FOFRADAR (FOrmal FRAmework                                                                               a
                                                                                plication. It is F⇒p,B (a, b, c, d) = 1 if and only if a+b  ≥ p∧a ≥ B
for Data mining with Association Rules) and then we outline how                 [2]. There are various additional 4ft-quantifiers defined in [2, 4].
it can be used to control a workflow of data mining with associa-                  A deduction rule ϕϕ≈ψ
                                                                                                       0 ≈ψ 0 is correct if the following is true for each
tion rules. We consider this relevant to associative classifiers that use
                                                                                data matrix M: if ϕ ≈ ψ is true in M then also ϕ0 ≈ ψ 0 is true in
association rule mining in the training phase [3].
                                                                                M. There are reasonable criteria making possible to decide if ϕϕ≈ψ    0 ≈ψ 0
   We deal with association rules ϕ ≈ ψ where ϕ and ψ are general
                                                                                is a correct deduction rule [4].
Boolean attributes derived from columns of analyzed data matrices.
                                                                                   FOFRADAR consists of a logical calculus LC of association rules
Symbol ≈ is called 4ft-quantifier and it stands for a condition con-
                                                                                and of several mutually related languages and procedures used to
cerning a contingency table of ϕ and ψ [6]. Such rules are more gen-
                                                                                formalize both items of domain knowledge and important steps in
eral than rules introduced in [1]. We consider data mining process as
                                                                                the data mining process. They are shortly introduced below, relations
described by the well known CRISP-DM methodology.
                                                                                of some of them to the CRISP-DM are sketched in Fig. 1.
   The FOFRADAR is introduced in [5]. Its goal is to formally de-
scribe a data mining process such that domain knowledge can be used
both in formulation of reasonable analytical questions and in inter-
pretation of resulting set of association rules. No similar approach to
dealing with domain knowledge in data mining is known to the au-
thors. An application of the FOFRADAR in data mining workflows
is outlined here for the first time.


2     FOFRADAR
FOFRADAR is based on a logical calculus LC of association rules.
Formulas of LC correspond to the association rules ϕ ≈ φ [4]. Such
rules are evaluated in data matrices rows of which correspond to ob-
served objects o1 , . . . , on and columns correspond to observed at-
tributes A1 , . . . , AK . We assume that Ai has a finite number ti ≥ 2
of possible values 1, . . . , ti (i.e. categories) and Ai (oj ) is a value of
Ai in row oj for i = 1, . . . , K and j = 1, . . . , n.
   Boolean attributes ϕ, φ are derived from basic Boolean attributes
i.e expressions Ai (α) where α ⊂ {1, . . . , ti }. A basic Boolean
attribute Ai (α) is true in a row oj of a given data matrix M if
Ai (oj ) ∈ α, otherwise it is false. Thus, we do not deal only
                                                                                    Figure 1. FOFRADAR framework and CRISP-DM methodology
with Boolean attributes - conjunctions of attribute-value pairs Ai (a)
where a ∈ {1, . . . , ti } but we use general Boolean attributes derived
by connectives ∧, ∨, ¬ from columns of a given data matrix.
   The 4ft-table 4f t(ϕ,ψ, M) of ϕ and ψ in a data matrix M is a
                                                                                   Language LDK of domain knowledge – formulas of LDK corre-
quadruple ha, b, c, di where a is the number of rows of M satisfying
                                                                                spond to items of domain knowledge. A formula A1 ↑↑ A11 mean-
both ϕ and ψ, b is the number of rows satisfying ϕ and not satisfying
                                                                                ing that if A1 increases then A11 increases too is an example. We
ψ, c is the number of rows not satisfying ϕ and satisfying ψ and d
                                                                                consider formulas of LDK as results of business understanding.
is the number of rows satisfying neither ϕ nor ψ. A {0, 1}-valued
                                                                                   Language LDt of data knowledge – its formulas can be consid-
associated function F≈ (a, b, c, d) is defined for each 4ft-quantifier
                                                                                ered as results of data understanding. An example is information that
1 University of Economics, Prague, Czech Republic, email: rauch@vse.cz          90 per cent of observed patients are men.
    and simunek@vse.cz                                                             Language LAQ of analytical questions – the expression
[M : A1 , . . . , A10 ≈? A11 , . . . , A20 ; 6→ A1 ↑↑ A11 ] is an example           1      1 Formulate_Analytical_Question
of a formula of LAQ . It corresponds to a question Q1 : In data                     2        Define_Set_of_Relevant_Rules
matrix M, are there any relations between combinations of values                    3      2 Apply ASSOC
of attributes A1 , . . . , A10 and combinations of values of attributes             4        Apply CONCL
A11 , . . . , A20 which are not consequences of A1 ↑↑ A11 ?                         5        IF Continue_ASSOC THEN
   Language LRAR of sets of relevant association rules – each for-                  6           BEGIN
mula Φ of LRAR defines a set S(Φ) of rules relevant to a given                      7           Modify Set_of_Relevant_Rules
analytical question. The set S(Φ) relevant to Q1 can consist of rules               8           GOTO 2
ϕ ⇒0.9,100 ψ where ϕ is a conjunction of some of basic Boolean                      9           END
attributes A1 (α1 ), . . . , A10 (α10 ), similarly for ψ and A11 , . . . , A20 .   10        IF Continue_Analysis THEN GOTO 1
Here α1 can be e.g. any interval of maximal 3 consecutive categories,              11        STOP
similarly for additional basic Boolean attributes.
   Procedure ASSOC – its input consists of a formula Φ of LRAR
and of an analyzed data matrix M. Output of the ASSOC procedure                     Figure 2. Association rule data mining workflow based on FOFRADAR
is a set T rue(S(Φ), M) of all rules ϕ ≈ ψ belonging to S(Φ) which
are true in M. The procedure 4ft-Miner [6] is an implementation of
ASSOC. It deals with a very sophisticated language LRAR .
                                                                                   rules ϕ ≈ ψ which belong to S(Φ) and which are true in M.
   Procedure Cons – this procedure maps a formula Ω of LDK to a
                                                                                      A next step is interpretation of the set T rue(S(Φ), M). This is re-
set Cons(Ω, ≈) of association rules ϕ ≈ ψ which can be considered
                                                                                   alized by the procedure CON CL, see row 4 in Fig. 2. Consequences
as consequences of Ω. The set Cons(A1 ↑↑ A11 , ⇒p,B ) is a set of
                                   ω⇒p,B τ                                         of particular items of domain knowledge are used which means that
all rules ϕ ⇒p,B φ for which ϕ⇒p,B         φ
                                              is a correct deduction rule and
                                                                                   the procedure Cons is applied and several formulas of the language
ω ⇒p,B τ is an atomic consequence of Cons(A1 ↑↑ A11 ). Rules                       LConcl are produced by the procedure CON CL.
A1 (low) ⇒p,B A11 (low)) and A1 (high) ⇒p,B A11 (high) are ex-                        One of formulas produced by CON CL is a simple Boolean vari-
amples of atomic consequences of A1 ↑↑ A11 , low and high are                      able Continue ASSOC. If its value is setup as true, then the set
suitable subsets of categories of A1 and A11 . Some additional rules               S(Φ) of relevant association rules which have to be verified is mod-
can also be considered as belonging to Cons(A1 ↑↑ A11 , ⇒p,B ),                    ified and the process of solution of the analytical question Q con-
see [5] for details.                                                               tinues, ses rows 5 – 9 in Fig. 2. The modification of the set S(Φ)
   Language LConcl – formulas of this language correspond to con-                  is done by the procedure M odif ySet of Relevant Rules which
clusions which can be made on the basis of the set T rue(S(Φ), M)                  uses experience from applications of the 4ft-Miner procedure.
produced by the ASSOC procedure. Two examples of such conclu-                         If the value of Continue ASSOC is setup to f alse, then the
sions follow. (1): All rules in T rue(S(Φ), M) can be considered                   process of solution of the particular analytical question is terminated.
as consequences of known items of domain knowledge A1 ↑↑ A11                       Then a procedure Continue Analysis is used to decide if an addi-
or A2 ↑↑ A19 . (2): Lot of rules from T rue(S(Φ), M) can be con-                   tional analytical question will be generated and solved.
sidered as consequences of yet unknown item of domain knowledge                       There are first experiences with ”manually driven” processes cor-
A9 ↑↑ A17 .                                                                        responding to procedures used in Fig. 2. The most important is ex-
   There are additional procedures belonging to FOFRADAR, they                     perience with process corresponding to the CON CL procedure, see
transform formulas of a particular language to formulas of another                 [7]. We believe to get enough experience to run the whole data work-
language of FOFRADAR [5].                                                          flow process automatically as outlined in Fig. 2.

3    FOFRADAR and Workflow of Data Mining                                          ACKNOWLEDGEMENTS
To keep things simple and the explanation concise we assume that                   The work described here has been supported by Grant No.
the analyzed data matrix M is given as a result of necessary trans-                201/08/0802 of the Czech Science Foundation.
formations. In addition, we assume that a set DK of formulas of the
language LDK and a set DtK of formulas of the language LDt are
given. A workflow of data mining with association rules can be then                REFERENCES
described according to Fig. 2.                                                     [1] R. Aggraval et al., Fast Discovery of Association Rules, 307–328, Ad-
   The first row in Fig. 2 means that an analytical question Q which                   vances in Knowledge Discovery and Data Mining, AAAI Press, Menlo
can be solved by the procedure ASSOC is formulated using set DK                        Park, 1996.
                                                                                   [2] P. Hájek and T. Havránek, Mechanising Hypothesis Formation - Mathe-
of formulas of the language LDK . The set DtK of formulas of the
                                                                                       matical Foundations for a General Theory, Springer, Berlin, 1978
language LDK can also be used to formulate reasonable analytical                   [3] M. Jalali-Heravi and R. Zaiane, A study on interestingness measures for
questions.                                                                             associative classifiers, 1039–1046, Proceedings of the 2010 ACM Sym-
   A solution of Q starts with a definition of a set S(Φ) of relevant                  posium on Applied Computing, ACM New York, 2010
association rules which have to be verified in M, see row 2 in Fig.                [4] J. Rauch: Logic of association rules, Applied Intelligence, 22, 9–28, 2005
                                                                                   [5] J. Rauch, Consideration on a Formal Frame for Data Mining, 562–569,
2. The set S(Φ) is given by a formula Φ of language LRAR . The                         Proceedings of Granular Computing 2011, IEEE Computer Society, Pis-
formula Φ is a result of application of a procedure transforming for-                  cataway, 2011.
mulas of LAQ to formulas of LRAR .                                                 [6] J. Rauch and M. Šimůnek, An Alternative Approach to Mining Associa-
   Then, the procedure ASSOC is applied, see row 3 in Fig. 2. Ex-                      tion Rules, 219–238, Data Mining: Foundations, Methods, and Applica-
                                                                                       tions, Springer-Verlag, Berlin, 2005.
perience with the procedure 4ft-Miner, which is an enhanced imple-
                                                                                   [7] J. Rauch and M. Šimůnek, Applying Domain Knowledge in Association-
mentation of the ASSOC procedure, are given in [6, 7]. The applica-                    Rules Mining Process - First Experience, 113–122, Foundations of In-
tion of ASSOC results into a set T rue(S(Φ), M) of all association                     telligent Systems, Springer-Verlag, Berlin, 2011.