<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Formal Frame for Data Mining with Association Rules - a Tool for Workflow Planning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Rauch</string-name>
          <email>rauch@vse.cz</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan Sˇ im u˚nek</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>and The goal of this extended abstract is to contribute to the forum for research on construction of data mining workflows. We briefly introduce a formal framework called FOFRADAR (FOrmal FRAmework for Data mining with Association Rules) and then we outline how it can be used to control a workflow of data mining with association rules. We consider this relevant to associative classifiers that use association rule mining in the training phase [3]. We deal with association rules ' where ' and are general Boolean attributes derived from columns of analyzed data matrices. Symbol is called 4ft-quantifier and it stands for a condition concerning a contingency table of ' and [6]. Such rules are more general than rules introduced in [1]. We consider data mining process as described by the well known CRISP-DM methodology. The FOFRADAR is introduced in [5]. Its goal is to formally describe a data mining process such that domain knowledge can be used both in formulation of reasonable analytical questions and in interpretation of resulting set of association rules. No similar approach to dealing with domain knowledge in data mining is known to the authors. An application of the FOFRADAR in data mining workflows is outlined here for the first time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
    </sec>
    <sec id="sec-2">
      <title>FOFRADAR</title>
    </sec>
    <sec id="sec-3">
      <title>FOFRADAR is based on a logical calculus LC of association rules.</title>
      <p>
        Formulas of LC correspond to the association rules ' [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Such
rules are evaluated in data matrices rows of which correspond to
observed objects o1; : : : ; on and columns correspond to observed
attributes A1; : : : ; AK . We assume that Ai has a finite number ti 2
of possible values 1; : : : ; ti (i.e. categories) and Ai(oj ) is a value of
Ai in row oj for i = 1; : : : ; K and j = 1; : : : ; n.
      </p>
      <p>Boolean attributes ', are derived from basic Boolean attributes
i.e expressions Ai( ) where f1; : : : ; tig. A basic Boolean
attribute Ai( ) is true in a row oj of a given data matrix M if
Ai(oj ) 2 , otherwise it is false. Thus, we do not deal only
with Boolean attributes - conjunctions of attribute-value pairs Ai(a)
where a 2 f1; : : : ; tig but we use general Boolean attributes derived
by connectives ^; _; : from columns of a given data matrix.</p>
      <p>The 4ft-table 4f t(', , M) of ' and in a data matrix M is a
quadruple ha; b; c; di where a is the number of rows of M satisfying
both ' and , b is the number of rows satisfying ' and not satisfying
, c is the number of rows not satisfying ' and satisfying and d
is the number of rows satisfying neither ' nor . A f0; 1g-valued
associated function F (a; b; c; d) is defined for each 4ft-quantifier
. The rule ' is true in a data matrix M if F (a; b; c; d) = 1
where ha; b; c; di = 4f t(', , M), otherwise it is false in M.</p>
      <p>
        Expression A1(1; 2; 3) _ A2(4; 6) )p;B A3(8; 9) ^ A4(1) is an
example of association rule, )p;B is a 4ft-quantifier of founded
implication. It is F)p;B (a; b; c; d) = 1 if and only if a+ab p ^ a B
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. There are various additional 4ft-quantifiers defined in [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ].
      </p>
      <p>
        A deduction rule ''0 0 is correct if the following is true for each
data matrix M: if ' is true in M then also '0 0 is true in
M. There are reasonable criteria making possible to decide if ''0 0
is a correct deduction rule [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>FOFRADAR consists of a logical calculus LC of association rules
and of several mutually related languages and procedures used to
formalize both items of domain knowledge and important steps in
the data mining process. They are shortly introduced below, relations
of some of them to the CRISP-DM are sketched in Fig. 1.
Language LDK of domain knowledge – formulas of LDK
correspond to items of domain knowledge. A formula A1 "" A11
meaning that if A1 increases then A11 increases too is an example. We
consider formulas of LDK as results of business understanding.</p>
      <p>Language LDt of data knowledge – its formulas can be
considered as results of data understanding. An example is information that
90 per cent of observed patients are men.</p>
    </sec>
    <sec id="sec-4">
      <title>Language LAQ of analytical questions – the expression</title>
      <p>[M : A1; : : : ; A10 ? A11; : : : ; A20; 6! A1 "" A11] is an example
of a formula of LAQ. It corresponds to a question Q1: In data
matrix M, are there any relations between combinations of values
of attributes A1; : : : ; A10 and combinations of values of attributes
A11; : : : ; A20 which are not consequences of A1 "" A11?</p>
    </sec>
    <sec id="sec-5">
      <title>Language LRAR of sets of relevant association rules – each for</title>
      <p>mula of LRAR defines a set S( ) of rules relevant to a given
analytical question. The set S( ) relevant to Q1 can consist of rules
' )0:9;100 where ' is a conjunction of some of basic Boolean
attributes A1( 1); : : : ; A10( 10), similarly for and A11; : : : ; A20.
Here 1 can be e.g. any interval of maximal 3 consecutive categories,
similarly for additional basic Boolean attributes.</p>
      <p>
        Procedure ASSOC – its input consists of a formula of LRAR
and of an analyzed data matrix M. Output of the ASSOC procedure
is a set T rue(S( ); M) of all rules ' belonging to S( ) which
are true in M. The procedure 4ft-Miner [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is an implementation of
ASSOC. It deals with a very sophisticated language LRAR.
      </p>
      <p>
        Procedure Cons – this procedure maps a formula of LDK to a
set Cons( ; ) of association rules ' which can be considered
as consequences of . The set Cons(A1 "" A11; )p;B ) is a set of
all rules ' )p;B for which '!))pp;;BB is a correct deduction rule and
! )p;B is an atomic consequence of Cons(A1 "" A11). Rules
A1(low) )p;B A11(low)) and A1(high) )p;B A11(high) are
examples of atomic consequences of A1 "" A11, low and high are
suitable subsets of categories of A1 and A11. Some additional rules
can also be considered as belonging to Cons(A1 "" A11; )p;B ),
see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for details.
      </p>
      <p>Language LConcl – formulas of this language correspond to
conclusions which can be made on the basis of the set T rue(S( ); M)
produced by the ASSOC procedure. Two examples of such
conclusions follow. (1): All rules in T rue(S( ); M) can be considered
as consequences of known items of domain knowledge A1 "" A11
or A2 "" A19. (2): Lot of rules from T rue(S( ); M) can be
considered as consequences of yet unknown item of domain knowledge
A9 "" A17.</p>
      <p>
        There are additional procedures belonging to FOFRADAR, they
transform formulas of a particular language to formulas of another
language of FOFRADAR [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>FOFRADAR and Workflow of Data Mining</title>
      <p>To keep things simple and the explanation concise we assume that
the analyzed data matrix M is given as a result of necessary
transformations. In addition, we assume that a set DK of formulas of the
language LDK and a set DtK of formulas of the language LDt are
given. A workflow of data mining with association rules can be then
described according to Fig. 2.</p>
      <p>The first row in Fig. 2 means that an analytical question Q which
can be solved by the procedure ASSOC is formulated using set DK
of formulas of the language LDK . The set DtK of formulas of the
language LDK can also be used to formulate reasonable analytical
questions.</p>
      <p>A solution of Q starts with a definition of a set S( ) of relevant
association rules which have to be verified in M, see row 2 in Fig.
2. The set S( ) is given by a formula of language LRAR. The
formula is a result of application of a procedure transforming
formulas of LAQ to formulas of LRAR.</p>
      <p>
        Then, the procedure ASSOC is applied, see row 3 in Fig. 2.
Experience with the procedure 4ft-Miner, which is an enhanced
implementation of the ASSOC procedure, are given in [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The
application of ASSOC results into a set T rue(S( ); M) of all association
1 Formulate_Analytical_Question
      </p>
      <p>Define_Set_of_Relevant_Rules
2 Apply ASSOC</p>
      <p>Apply CONCL
IF Continue_ASSOC THEN</p>
      <p>BEGIN
Modify Set_of_Relevant_Rules
GOTO 2</p>
      <p>END
IF Continue_Analysis THEN GOTO 1</p>
      <p>STOP
rules ' which belong to S( ) and which are true in M.</p>
      <p>A next step is interpretation of the set T rue(S( ); M). This is
realized by the procedure CON CL, see row 4 in Fig. 2. Consequences
of particular items of domain knowledge are used which means that
the procedure Cons is applied and several formulas of the language
LConcl are produced by the procedure CON CL.</p>
      <p>One of formulas produced by CON CL is a simple Boolean
variable Continue ASSOC. If its value is setup as true, then the set
S( ) of relevant association rules which have to be verified is
modified and the process of solution of the analytical question Q
continues, ses rows 5 – 9 in Fig. 2. The modification of the set S( )
is done by the procedure M odif ySet of Relevant Rules which
uses experience from applications of the 4ft-Miner procedure.</p>
      <p>If the value of Continue ASSOC is setup to f alse, then the
process of solution of the particular analytical question is terminated.
Then a procedure Continue Analysis is used to decide if an
additional analytical question will be generated and solved.</p>
      <p>
        There are first experiences with ”manually driven” processes
corresponding to procedures used in Fig. 2. The most important is
experience with process corresponding to the CON CL procedure, see
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We believe to get enough experience to run the whole data
workflow process automatically as outlined in Fig. 2.
      </p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The work described here has been supported by Grant No.
201/08/0802 of the Czech Science Foundation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aggraval</surname>
          </string-name>
          et al.,
          <source>Fast Discovery of Association Rules</source>
          ,
          <fpage>307</fpage>
          -
          <lpage>328</lpage>
          ,
          <article-title>Advances in Knowledge Discovery and Data Mining</article-title>
          , AAAI Press, Menlo Park,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ha</surname>
          </string-name>
          <article-title>´jek and T</article-title>
          . Havra´nek,
          <source>Mechanising Hypothesis Formation - Mathematical Foundations for a General Theory</source>
          , Springer, Berlin,
          <year>1978</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jalali-Heravi</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Zaiane</surname>
          </string-name>
          ,
          <article-title>A study on interestingness measures for associative classifiers</article-title>
          ,
          <fpage>1039</fpage>
          -
          <lpage>1046</lpage>
          ,
          <source>Proceedings of the 2010 ACM Symposium on Applied Computing</source>
          , ACM New York,
          <year>2010</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rauch</surname>
          </string-name>
          : Logic of association rules,
          <source>Applied Intelligence</source>
          ,
          <volume>22</volume>
          ,
          <fpage>9</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2005</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rauch</surname>
          </string-name>
          ,
          <source>Consideration on a Formal Frame for Data Mining</source>
          ,
          <fpage>562</fpage>
          -
          <lpage>569</lpage>
          ,
          <source>Proceedings of Granular Computing</source>
          <year>2011</year>
          , IEEE Computer Society, Piscataway,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rauch</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sˇ imu˚nek, An Alternative Approach</article-title>
          to Mining Association Rules,
          <fpage>219</fpage>
          -
          <lpage>238</lpage>
          , Data Mining: Foundations, Methods, and Applications, Springer-Verlag, Berlin,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rauch</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sˇ imu˚nek, Applying Domain Knowledge in AssociationRules Mining Process</article-title>
          - First
          <string-name>
            <surname>Experience</surname>
          </string-name>
          ,
          <fpage>113</fpage>
          -
          <lpage>122</lpage>
          , Foundations of Intelligent Systems, Springer-Verlag, Berlin,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>