=Paper=
{{Paper
|id=None
|storemode=property
|title=Modifying Logic of Discovery for Dealing with Domain Knowledge in
Data Mining
|pdfUrl=https://ceur-ws.org/Vol-672/paper16.pdf
|volume=Vol-672
|dblpUrl=https://dblp.org/rec/conf/cla/Rauch10
}}
==Modifying Logic of Discovery for Dealing with Domain Knowledge in
Data Mining==
Modifying Logic of Discovery for Dealing
with Domain Knowledge in Data Mining
Jan Rauch
Faculty of Informatics and Statistics, University of Economics, Prague ⋆
nám W. Churchilla 4, 130 67 Prague, Czech Republic rauch@vse.cz
Abstract. Logic of discovery was developed in 1970’s as an answer to
questions ”Can computers formulate and justify scientific hypotheses?”
and ”Can they comprehend empirical data and process it rationally, us-
ing the apparatus of modern mathematical logic and statistics to try
to produce a rational image of the observed empirical world?”. Logic
of discovery is based on observational and theoretical languages and on
inductive inference corresponding to statistical approaches. Formulas of
observational language concern analyzed observational data and formu-
las of theoretical language concern suitable state dependent structures.
The goal of the paper is to discuss a possibility to adapt the logic of
discovery to data mining.
1 Introduction
Logic of discovery is developed in the book [2] which starts with questions:(Q1 )
– Can computers formulate and justify scientific hypotheses? (Q2 ) – Can they
comprehend empirical data and process it rationally, using the apparatus of mod-
ern mathematical logic and statistics to try to produce a rational image of the
observed empirical world? Answers are based on a scheme of inductive inference:
theoretical assumptions, observational statement (s)
.
theoretical statement
This schema means that having accepted theoretical assumptions and having ver-
ified observational statements concerning analyzed data, we accept a theoretical
statement. The schema leads to additional questions L0 - L4: (L0) In what lan-
guages does one formulate observational and theoretical statements? (L1) What
are rational inductive inference rules bridging the gap between observational and
theoretical sentences? (L2) Are there rational methods for deciding whether a
theoretical statement is justified? (L3) What are the conditions for a theoretical
statement or a set of theoretical statements to be of interest with respect to the
task of scientific cognition? (L4) Are there methods for suggesting such a set of
statements, which is as interesting (important) as possible?
⋆
This paper was prepared with the support of Institutional funds for support of a
long-term development of science and research at the Faculty of Informatics and
Statistics of University of Economics, Prague.
176 Jan Rauch
Answering questions (LO) – (L2) leads to logic of induction, answers to ques-
tions (L3) and (L4) lead to logic of suggestions. Answers to questions (LO) –
(L4) constitute a logic of discovery. Very detailed answers to questions L0 - L4
are given in the book [2] and the logic of discovery is developed. Observational
and theoretical calculi are developed as languages for observational and theoret-
ical statements respectively. Principles of logic of discovery are briefly outlined
in section 2.
Various observational calculi are defined in [2]. The most studied calculi are
monadic observational predicate calculi. They can be understood as a modifica-
tion of classical predicate calculi – only finite models are allowed and generalized
quantifiers are added. Finite models correspond to observational data we ana-
lyze and generalized quantifiers make possible to express important statements
on observational data. Monadic observational predicate calculi were modified
and significantly simplified in [5] such that they can be understood as a logic
of association rules. Association rules - formulas of these calculi are more gen-
eral than association rules defined in [1]. These generalized association rules are
produced by the procedure 4ft-Miner [7].
Application of domain knowledge in data mining is introduced among 10 chal-
lenging problems of data mining [4], see also http://www.cs.uvm.edu/~icdm/.
Domain knowledge is also referred as background knowledge, world knowledge,
or business knowledge. Results presented in [5] make possible to deal with domain
knowledge in the process of data mining. A way of filtering out consequences of
domain knowledge from results of the 4ft-Miner procedure is outlined in [6]. It
is based on application of logic of association rules [5]. The goal of this paper is
to present (in a given scope) a theoretical elaboration of this approach.
We are going to modify logic of discovery developed in [2]. A modified the-
oretical language is intended to express items of domain knowledge intuitively
understandable to domain experts without experience in data mining. Formu-
las of observational language correspond to patterns produced by data mining
procedures. There is a new way of a correspondence between theoretical and
observational languages.
A set of atomic consequences is assigned to each item of domain knowledge
(i.e. to a formula of the theoretical language). The atomic consequences are
simple formulas of observational calculus such that the assignment can be done
by a domain expert. Deduction rules among observational formulas are then
used to spread the consequences of items of domain knowledge among additional
formulas of observational calculus.
Let us emphasize that the theoretical language expressing domain knowledge
totally differs from the observational language of patterns produced by data
mining procedures. The correspondence between these languages is ensured by
the atomic consequences defined by the domain expert and by deduction rules
in the observational calculus. First we outline general features of this approach
and then we elaborate it for association rules. We call resulting modification of
logic of discovery as logic of mining of association rules. No analogous approach
concerning association rules is known to the author.
Modifying Logic of Discovery for Dealing with Domain Knowledge in ... 177
Logic of discovery is sketched in section 2. Principles of its modification
are discussed in section 3. Modified observational and theoretical languages are
introduced in sections 4 and 5. System 4ft-Discoverer integrating both introduced
theoretical principles and software procedures for dealing with domain knowledge
is discussed in section 6.
2 Logic of Discovery
Semantic system S = hSent, M, V, V ali is determined by a non-empty set
Sent of sentences, a non-empty set M of models, a non-empty set V of ab-
stract values and an evaluating function V al : (Sent × M) → V [2]. If ϕ ∈
Sent and M ∈ M then V al(ϕ, M ) is the value of ϕ in M . Semantic system
S = hSent, M, V, V ali is observational if Sent, M, V are recursive sets and
V al is a partial recursive function [2]. Observational semantic system S O =
hSentO , MO , V O , V alO i corresponding to analyzed data and theoretical seman-
tic system S T = hSentT , U T , V T , V alT i corresponding to the whole set of ob-
jects, we are interested in are developed in [2].
Observational predicate calculus is a result of modifications of predicate cal-
culi - only finite models are allowed and generalized quantifiers are added [2],
see introduction. System of closed formulas of such calculus is an observational
semantic system. Observational predicate calculus with formulas correspond-
ing to association rules was developed in [2]. Question of rationality of induc-
tive inference rule is very important. It is studied in [2] using statistical ap-
proaches. It leads to observational predicate calculi with generalized quantifiers
corresponding to statistical hypothesis tests and to theoretical semantic system
S T = hSentT , U T , V T , V alT i with state dependent structures. However, the more
detailed description is out of the scope of this paper.
Using induction rules based on statistical methods usually means there is 1:1
correspondence between observational and theoretical statements. Thus a task of
a suggestion of interesting theoretical statements can be converted to a task of a
suggestion of interesting observational statements. The GUHA method is defined
in [2] to solve this task. The method is carried out using GUHA procedures. A
GUHA procedure is a computer program the input of which consists of analyzed
data and a set of parameters defining the large set of relevant observational
patterns. Its output is a set of all prime patterns. A pattern is prime if it is
true in the analyzed data, and if it does not logically follow from another output
simples pattern.
The most used GUHA procedure is the ASSOC procedure. It deals with
enhanced association rules. It was several times implemented and many times
applied, see e.g. [3]. One of its implementations is the procedure 4ft-Miner, see
introduction and section 6.1. Implementations of the ASSOC procedure use the-
oretical results (namely deduction rules) concerning observational predicate cal-
culi [2]. Logic of association rules involves additional both theoretically interest-
ing and practically important results [5, 6]. Meaning of the GUHA method and
thus also meaning of logic of discovery for data mining is summarized in [3].
178 Jan Rauch
3 Modifying Logic of Discovery
3.1 Starting points
Our starting points are: (1) Data we are dealing with do not satisfy requirements
for application of statistical approaches. An example of such data is data about
patients of a particular hospital. (2) Objects described by our data belong to a
broader set of objects. An example of such a broader set is a set of residents in
a region to which the hospital belongs. (3) We have various items of knowledge
related to a particular data set. An example is identification of a particular
device used to measure each of the observed patients. (4) We have various items
of knowledge related to the broader set of objects that are not directly recorded
for each object. An example is information on specific vaccination applied in
a region in question. (5) We have various items of general knowledge about
attributes of objects described in our data. An example is a commonly accepted
fact that if weight increases, then blood pressure increases too. (6) We have data
mining procedures, which are able to produce a lot of strong patterns valid in
given data. Examples are the apriori algorithm [1] and the procedure 4ft-Miner,
see section 6.1. Both produce association rules.
(7) Most of the patterns produced by the data mining procedure are un-
interesting because of they are consequences of the above mentioned items of
knowledge. (8) There are groups of patterns hidden in patterns produced by
the data mining procedure such that each of the groups can be considered as a
consequence of a yet not known item of knowledge.
Our goal is to modify logic of discovery such that we will be able to: (I)
use items of knowledge mentioned in points 3 - 5; (II) filter out consequences
of the above introduced items of knowledge from results of mining procedures,
see point 7; (III) recognize a group of patterns, which can be considered as a
consequence of a (yet not known) item of knowledge, see point 8.
We assume to use GUHA procedures as data mining procedures together
with results on a related observational logical semantic system. To achieve re-
quirements (I)–(III) we are going to: (A) Enhance observational semantic system
by features making possible to capture items of knowledge related to particular
data sets, see (3) above. (B) Enhance theoretical semantic system by features
making possible to capture both items of general knowledge and items of knowl-
edge related to the broader set, see (4) and (5) above. (C) Enhance theoretical
semantic system by function Cons assigning to each item I of knowledge accord-
ing to (B) sets of formulas Cons(I) of a corresponding observational system (i.e.
a set of patterns produced by a GUHA data mining procedure). Set Cons(I) is
assumed to be a set of atomic consequences of I. (D) To develop a new analytical
procedure G-FILTER for each GUHA procedure. G-FILTER will filter out all
consequences of given items of knowledge from output of the GUHA procedure.
This way requirement II) will be achieved. (E) To develop a new analytical pro-
cedure G-SYNT for each GUHA procedure. G-SYNT will recognize groups of
patterns, which can be considered as a consequence of a (yet not known) items
of knowledge. This way requirement III) will be achieved.
Modifying Logic of Discovery for Dealing with Domain Knowledge in ... 179
We present principles of modification of logic of discovery using results related
to logic of association rules [5]. We deal with data matrices introduced in section
3.2. The principles of modification of observational and theoretical systems are
outlined in sections 3.3 and 3.4. The procedures G-FILTER and G-SYNT are
presented by description of procedures 4ft-Filter and 4ft-Synt, which are related
to the procedure 4ft-Miner, see section 6.2.
3.2 Data Matrices
We consider data matrices with values – natural numbers only. The natural num-
bers represent categories, i.e. possible values of observed attributes A1 , . . . , AK .
Columns of the data matrix correspond to attributes. Rows correspond to ob-
served objects, e.g. patients. An example of the data matrix is in the left part
of figure 1.
object A1 . . . AK object f1 ... fK
o1 1 ... 6 o1 f1 (o1 ) ... fK (o1 )
.. .. . . .. .. .. .. ..
. . . . . . . .
on 1 ... 1 on f1 (on ) ... fK (on )
Data matrix - informal view Data matrix M = hM, f1 , . . . , fK i
Fig. 1. Data matrix
There is only the finite number of categories for each attribute. Let us assume
that the number of categories in a column is t and that the categories are natural
numbers 1, . . . , t. All values in the data matrix are then described by the numbers
of categories for each column. The whole information on the number of columns
and categories in the data matrix is then given by type of data matrix: A type
of data matrix is a K-tuple T = ht1 , . . . , tK i where ti ≥ 2 are natural numbers
for i = 1, . . . , K.
We use a more formal definition of a data matrix with the number of columns
and the numbers of possible values in particular columns given by the type T =
ht1 , . . . , tK i: A data matrix of the type T is a K + 1-tuple M = hM, f1 , . . . , fK i,
where M is a non-empty finite set and fi is the unary function from M to
{1, . . . , ti } for i = 1, . . . , K. Set M is a set of rows of data matrix M. Set M
is called a domain of data matrix M. We write M = Dom(M). An example
of data matrix M = hM, f1 , . . . , fK i is in the right part of figure 1. We assume
that M = {o1 , . . . , on }.
3.3 Modifying Observational Semantic System
Observational semantic system S T = hMT , LC , V alC , LMT , V alMT i is used instead
of S O = hSentO , MO , V O , V alO i introduced in section 2. Set MT of all data
180 Jan Rauch
matrices M of type T is used instead of MO and two languages – LC and LMT
are used instead of SentO . We always use set {0,1} (i.e. {false, true}) as a set of
possible values of formulas of our languages instead of V O .
LC is a language of a logical calculus formulas of which correspond to patterns
produced by a data mining procedure (i.e. GUHA procedure in our case). There
is an evaluation function V alC : (LC × MT ) → {0, 1}. If ϕ ∈ LC and M ∈ MT
then V alC (ϕ, M) is the value of ϕ in M. If it is V alC (ϕ, M) = 1 then ϕ is true
in M, otherwise ϕ is false in M.
LMT is a language intended to express items of knowledge related to particular
data matrices, see point (3) in section 3.1. It is a set Θ = {θ1 , . . . θR } of formulas
corresponding to features of M, each of them can be true or false. There is an
evaluation function V alMT : (Θ × MT ) → {0, 1}. If it is θ ∈ Θ and M ∈ MT then
V alMT (θ, M) is the value of feature θ for M. If it is V alMT (θ, M) = 1 then M
has feature θ, otherwise M has not feature θ.
3.4 Modifying Theoretical Semantic System
Theoretical semantic system U T = hM , LTM , Consi is used instead of S T =
hSentT , UtV , V T , V alT i introduced in section 2. We assume that theoretical sys-
tem U T = hM , LTM , Consi is related to observational semantic system S T =
S
hMT , LC , V alC , LMT , V alMT i introduced above. M = {Dom(M) | M ∈ MT } is
a union of domains of all data matrices M ∈ MT . Language LTM is intended to
express items of knowledge introduced in points (4) and (5) in section 3.1. There
are lot of such items of knowledge [6, 8], several examples are in section 5.
We use function Cons : (LTM × MT ) → P(LC ). This function assigns to each
couple hI, Mi a set Cons(I, M) of formulas of language LC . Here I ∈ LTM is an
item of knowledge (i.e. a formula of language LTM ) and M ∈ MT is a data matrix
of type T of related observational semantic system S T . The set Cons(I, M) is
considered as a set of all atomic consequences of item I of knowledge in data
matrix M, see point (C) in section 3.1.
4 Observational Semantic System of Association Rules
T
Observational semantic system SAR = hMT , LTAR , V alAR
T
, LMT , V alMT i of type
T = ht1 , . . . , tK i concerning association rules is outlined in this section. Asso-
ciation rules of type T are couples of Boolean attributes created from columns
of data matrices M ∈ MT . Language LTAR of association rules is introduced in
section 4.1, evaluation function V alAR T
in section 4.2. Language LTAR , set MT of
T
data matrices and evaluation function V alAR constitute a logical calculus with
important deduction rules, see section 4.3. All definitions are given informally.
We will not discuss here details of language LMT and related evaluation func-
tion V alMT . They are intended to express characteristics of particular data matri-
ces. Examples: data matrix M1 concerns only pathological patients, data matrix
M2 concerns patients from mountain region, etc. More detailed description is
out of scope of this paper, additional research is assumed.
Modifying Logic of Discovery for Dealing with Domain Knowledge in ... 181
4.1 Language LT
AR of Association Rules
An association rule is expression ϕ ≈ ψ where ϕ and ψ are Boolean attributes
derived from columns of the analyzed data matrix and ≈ is a 4ft-quantifier [5].
Boolean attribute ϕ is called antecedent and ψ is called succedent. 4ft-quantifier
defines relation of ϕ and ψ by associated function F≈ of ≈, see below.
Basic Boolean attributes are created first. The basic Boolean attribute is an
expression A(α) where α ⊂ {a1 , . . . at } and {a1 , . . . at } is the set of all categories
of the attribute A. Here α is a coefficient of A(α). The basic Boolean attribute
A(α) is true in row o of M if it is A(o) ∈ α where A(o) is the value of the
attribute A in row o. Boolean attributes ϕ and ψ are derived from basic Boolean
attributes using connectives ∨, ∧ and ¬ in the usual way. Examples of Boolean
attributes are in figure 2.
M A1 . . . AK A1 (1) AK (2, 6) A1 (1) ∧ AK (2, 6)
o1 1 . . . 6 1 1 1 M ψ ¬ψ
.. .. . . .. .. .. .. ϕ a b
. . . . . . . ¬ϕ a b
on 1 . . . 1 1 0 0
Data matrix and examples of Boolean attributes 4ft(ϕ, ψ, M)
Fig. 2. Derived Boolean attributes and 4ft-table 4ft(ϕ, ψ, M)
An example of association rule is expression A1 (1) ∧ A2 (4, 5) ≈ AK (2, 6).
Note that ϕ and ψ in ϕ ≈ ψ have no common attributes.
T
4.2 Evaluation Function V alAR
Association rule ϕ ≈ ψ can be true or false in a given data matrix M ∈ MT . Rule
ϕ ≈ ψ is verified on the basis of a four-fold table 4f t(ϕ,ψ, M) of ϕ and ψ in M,
see figure 2. Here a is the number of the objects (i.e. the rows of M) satisfying
both ϕ and ψ, b is the number of the objects satisfying ϕ and not satisfying ψ,
and similarly for c and d, see figure 2. Four-fold table 4f t(ϕ,ψ, M) is written as
ha, b, c, di and called 4ft-table.
T
Evaluation function V alAR assigns a value 0 or 1 to each couple hϕ ≈ ψ, Mi
where ϕ ≈ ψ is the association rule and M ∈ MT . If V alAR T
(ϕ ≈ ψ, M) = 1
T
then we say that rule ϕ ≈ ψ is true in M and if V alAR (ϕ ≈ ψ, M) = 0 then we
T
say that rule ϕ ≈ ψ is false in M. V alAR (ϕ ≈ ψ, M) is defined using 4ft-table
4ft(ϕ, ψ, M) of ϕ and ψ in M and associated function F≈ of ≈.
Associated function F≈ of 4ft quantifier ≈ is a {0, 1} - valued function de-
fined for all quadruples ha, b, c, di of natural numbers. Value V al(ϕ ≈ ψ, M)
of association rule ϕ ≈ ψ in data matrix M ∈ MT is defined such that
V al(ϕ ≈ ψ, M) = F≈ (a, b, c, d) where ha, b, c, di = 4f t(ϕ, ψ, M). Examples of
4ft-quantifiers and their associated functions are in table 1 where 0 < p ≤ 1 and
0 < α < 0.5 are real numbers, B > 0 is an integer number.
182 Jan Rauch
4ft-quantifier Associated function F≈ (a, b, c, d)
Name Symbol ≈ F≈ (a, b, c, d) = 1 iff
a
Founded implication ⇒p,B ≥p∧a≥B
Pr `r´a+b
i
Lower critical implication ⇒!p,α,B i=a i (1 − p)r−i ≤ α ∧ a ≥ B
a+d
Founded equivalence ≡p,B a+b+c+d
≥p∧a≥B
Pmin(r,k) (ki)(n−k
r−i )
Fisher ≈α,B ≤α∧a≥B
i=a (nr )
Above average dependence ∼+
q,B
a
a+b
a+c
≥ (1 + q) a+b+c+d ∧ a ≥ B
Table 1. Examples of 4ft-quantifiers
4.3 Deduction Rules in Logical Calculus of Association Rules
Language LTAR , set of data matrices MT and evaluation function V alAR T
con-
stitute a logical calculus of association rules [5]. There are both theoretically
interesting and practically useful results concerning logical calculi of associa-
tion rules. Most of them are related to classes of 4ft-quantifiers [5]. An exam-
ple of a class of 4ft-quantifiers is the class of implicational 4ft-quantifiers. It is
defined such that 4ft-quantifier ≈ is implicational if it satisfies the condition:
if F≈ (a, b, c, d) = 1 ∧ a′ ≥ a ∧ b′ ≤ b then also F≈ (a′ , b′ , c′ , d′ ) = 1. Both 4ft-
quantifiers ⇒p,B and ⇒!p,α,B (see Table 1) are implicational.
Criteria of soundness of deduction rules ϕϕ≈ψ′ ≈ψ ′ where both ϕ ≈ ψ and ϕ ≈ ψ
′ ′
are association rules were found [5]. Criteria are related to important classes of
association rules. We outline such criterion for the class of interesting impli-
cational quantifiers. All practically important implicational 4ft-quantifiers are
interesting implicational quantifiers.
If ⇒∗ is an interesting implicational quantifier then there are formulas ω1A ,
ω1B , ω2 of
∗
propositional calculus created from ϕ, ψ, ϕ′ , ψ ′ so that the deduction
rule ϕϕ⇒ ψ
′ ⇒∗ ψ ′ is sound if and only if at least one of the following conditions (1),
(2) are satisfied: (1) – both ω1A and ω1B are tautologies, (2) – ω2 is a tautology.
Similar theorems are proved for additional important classes of 4ft-quantifiers
[5]. These results are crucial, see section 6.2.
5 Theoretical Semantic System of Association Rules
T
Theoretical semantic system UAR = hM , LTM , ConsTAR i related to observational
T T T T
semantic system SAR = hM , LAR , V alAR , LMT , V alMT i of type T = ht1 , . . . , tK i
concerning generalized association rules is sketched in this section.
Set M and language LTM are introduced in section 3.4. Language LTM is
intended to express items of knowledge introduced in points (4) and (5) in section
3.1. We give four important examples of formulas of language LTM . Then we
outline how function ConsTAR is constructed for one of these examples.
The examples are: A ↑↑ B, A ↑↓ B, A →+ ω, and ω1 →+ ω2 . Here A is one
of attributes A1 , . . . , AK of language LTAR , the same is true for B. In addition,
Modifying Logic of Discovery for Dealing with Domain Knowledge in ... 183
ω, ω1 , ω2 are Boolean attributes of LTAR and ω does not contain attribute A.
Intuitive meaning of particular formulas:
– A ↑↑ B means if A increases then B increases too
– A ↑↓ B means if A increases then B decreases
– A →+ ω means if A increases then relative frequency of ω increases
– ω1 →+ ω2 means if ω1 is satisfied then relative frequency of ω2 increases.
We show how function ConsTAR creates a set ConsTAR (A ↑↑ B, M) of association
rules – formulas of language LTAR which can be considered as a set of all atomic
consequences of A ↑↑ B in data matrix M. Function ConsTAR can be seen as a
family of functions ConsT≈ where ≈ is a 4ft-quantifier of language LTAR .
Function ConsT≈ creates a set ConsT≈ (A ↑↑ B, M) of association rules –
formulas of language LTAR such that this set can be considered as a set of all
atomic consequences of A S ↑↑ B of the form ρ ≈ σ in data matrix M. Then it is
ConsTAR (A ↑↑ B, M) = {ConsT≈ (A ↑↑ B, M) | ≈ belongs to LTAR } .
We outline function ConsT⇒p,B for 4ft-quantifier ⇒p,B of founded implication
(see Table 1) and item A ↑↑ B of domain knowledge. Functions ConsT≈ for
additional 4ft-quantifiers and formulas of LTM are defined similarly [6].
We assume that attribute A has categories 1, . . . , u and attribute B has
categories 1, . . . , v. Our task is to define a set of rules ρ ⇒p,B σ which can be
naturally considered as a set of all the consequences of item A ↑↑ B and which
are as simple as possible. In this case, we consider the simplest rules to be in the
form A(α) ⇒p,B B(β) where α ⊂ {1, . . . , u} and β ⊂ {1, . . . , v}.
Rule A(low) ⇒p,B B(low) stating that ”if A is low then B is low” can be
understood as a natural consequence of A ↑↑ B. The only problem is to define the
coefficients α and β that can be understood as ”low”. This can be done so that
we choose natural Alow , 1 < Alow < u and natural Blow , 1 < Blow < v and then
we consider α as ”low” iff α ⊂ {1, . . . , Alow } and β as ”low” iff β ⊂ {1, . . . , Blow },
see also section 6.1.
Also rule A(high) ⇒p,B B(high) stating that ”if A is high then B is high” can
be understood as a natural consequence of A ↑↑ B. The coefficients α and β can
be defined as ”high” in the following way. We choose natural Ahigh , 1 < Alow <
Ahigh < u and natural Bhigh , 1 < Blow < Bhigh < v and then we consider α as
”high” iff α ⊂ {Ahigh , . . . , v} and β as ”high” iff β ⊂ {Bhigh , . . . , v}.
It remains to define values of parameters p and B of ⇒p,B . A possibility is to
n
allow each p ≥ 0.9 and B ≥ 20 where n is the number of rows of data matrix M.
However, boundaries of p and B as well as values Alow , Ahigh , Blow , Bhigh should
be determined by a domain expert. The set of rules A(low) ⇒p,B B(low) and
A(high) ⇒p,B B(high) satisfying the above given conditions can be considered
as ConsT⇒p,B (A ↑↑ B, M) – a set of atomic consequences of A ↑↑ B of the form
ρ ⇒p,B σ in M.
Set ConsT⇒p,B (A ↑↑ B, M) can be defined in a more precise way by adding
rules A(medium) ⇒p,B B(medium) with a suitable definition of ”medium”. Rules
A(low, medium) ⇒p,B B(medium), A(low, medium) ⇒p,B B(medium, high), and
A(medium) ⇒p,B B(medium, high) can also be added.
184 Jan Rauch
Note: there is a natural requirement on the reasonable consistency of set
ConsTAR (A ↑↑ B, M) of atomic consequences of the A ↑↑ B i.e. there cannot
be two atomic consequences ρ1 ≈ σ1 and ρ2 ≈ σ2 that contradict each other. A
detailed discussion of this topic is, however, without the scope of this paper.
6 4ft-Discoverer
T
We have defined observational system SAR = hMT , LTAR , V alAR
T
, LMT , V alMT i and
T T T
related theoretical system UAR = hM , LM , ConsAR i. The goal of this section is
to discuss possibilities of implementation of a theoretical framework for dealing
with domain knowledge involved in these systems.
We use the GUHA procedure 4ft-Miner mining for association rules - couples
of Boolean attributes created from columns of data matrices M ∈ MT [7]. The
4ft-Miner procedure has very fine tools to define a set of association rules to
be generated and verified. It deals, among other, with basic Boolean attributes
A(α), B(β), A(low), B(low) etc. Main features of 4ft-Miner procedure are out-
lined in section 6.1.
Intention to develop two additional analytical procedures G-FILTER and G-
SYNT related to each GUHA procedure is announced in points (D) and (E) in
section 3.1. We present principles of procedures 4ft-Filter and 4ft-Synt related
to 4ft-Miner procedure. The 4ft-Filter procedure is intended to filter out conse-
quences of a given item of domain knowledge from the output of 4ft-Miner. Item
of domain knowledge is expressed by a formula of LMT . The 4ft-Synt procedure
is intended to recognize groups of patterns which can be considered as a conse-
quence of a yet not known item of knowledge. Principles of both procedures are
introduced in section 6.2.
T T
Semantic systems SAR and UAR together with the procedures 4ft-Miner,
4ft-Filter, and 4ft-Synt constitute a framework for a process of data mining
for association rules based on domain knowledge. We call this framework 4ft-
Discoverer, i.e. 4f tD. It is
T T
4f tD = hSAR , UAR , 4ft-Miner, 4ft-Filter, 4ft-Synt i .
6.1 4ft-Miner
The 4ft-Miner procedure mines for association rules ϕ ≈ ψ, see section 4. In-
put parameters define 4ft-quantifier ≈, set of relevant antecedents Φ and set of
relevant succedents Ψ ; we assume ϕ ∈ Φ and ψ ∈ Ψ .
Each antecedent is a conjunction τ1 ∧. . .∧τm of partial antecedents τ1 , . . . , τm .
Each partial antecedent is either a conjunction λ1 ∧ . . . ∧ λq or a disjunction
λ1 ∨ . . . ∨ λq of literals λ1 , . . . , λq . Each literal is a basic Boolean attribute A(α)
or its negation ¬A(α). Definition of a set of relevant antecedents Φ consists of
definitions of relevant partial antecedents Φ1 , . . . , Φm . Conjunction τ1 ∧ . . . ∧ τm
is a relevant antecedent if τ1 ∈ Φ1 , . . . , τm ∈ Φm .
Definition of a relevant partial antecedent is given by a list A′1 , . . . , A′u of
attributes, by a minimal and maximal number of literals in particular partial
Modifying Logic of Discovery for Dealing with Domain Knowledge in ... 185
antecedents and by a type of partial antecedent, i.e. conjunctions or disjunctions.
In addition, for each attribute A′ a set of relevant basic Boolean attributes which
are automatically generated is defined. There are various detailed possibilities
how to define all relevant basic Boolean attributes A′ (α) [7]. We outline only
one of them. We use attribute A with categories 1, 2, 3, 4, 5. Option intervals
of length 2-3 gives basic Boolean attributes A(1, 2), A(2, 3), A(3, 4), A(4, 5),
A(1, 2, 3), A(2, 3, 4), A(3, 4, 5). This way we can get basic Boolean attributes
A(low), A(high), B(low), B(high), see section 5.
Set Ψ of relevant succedents is defined analogously. The 4ft-Miner procedure
does not use well known a-priori algorithm [1]. Its implementation is based on
representation of analyzed data by suitable strings of bits [7]. Its performance
is good enough to solve a lot of practically important tasks. A detailed study of
its time and space complexity is in [7].
6.2 4ft-Filter and 4ft-Synt
The 4ft-Filter procedure is intended to filter out consequences of a given item of
domain knowledge from association rules produced by the 4ft-Miner procedure.
Item of domain knowledge is represented by a formula I ∈ LTM , see section 3.4.
Function Is4f tConsequence(I, ϕ ≈ ψ, M) defined for all formulas I ∈ LTM ,
association rules ϕ ≈ ψ ∈ LTAR and data matrices M ∈ MT can be used to
realize the 4ft-Filter procedure. It is Is4f tConsequence(I, ϕ ≈ ψ, M) = 1
if the rule ϕ ≈ ψ can be considered as a consequence of I, otherwise it is
Is4f tConsequence(I, ϕ ≈ ψ, M) = 0.
Value Is4f tConsequence(I, ϕ ≈ ψ, M) is computed using function ConsTAR ,
see section 5 and using deduction rules ϕϕ≈ψ
′ ≈ψ ′ , see section 4.3. There are criteria
of correctness of rules ϕϕ≈ψ
′ ≈ψ ′ for each 4ft-quantifier ≈ of 4ft-Miner procedure
[5, 7]. Function ConsTAR is defined for all I ∈ LTM , and M ∈ MT such that
ConsTAR (I, M) = Λ and Λ is a set of all association rules ρ ≈ σ which can be
considered as atomic consequences of I in M.
Value Is4f tConsequence(I, ϕ ≈ ψ, M) is computed in two steps. In the
first step, we compute set Λ = ConsTAR (I, M). In the second step, we test
ρ≈σ
correctness of ϕ≈ψ for each ρ ≈ σ ∈ Λ. If there is such a correct rule, then
ϕ ≈ ψ is considered as a consequence of I in M and Is4f tConsequence(I, ϕ ≈
ψ, M) = 1. Otherwise Is4f tConsequence(I, ϕ ≈ ψ, M) = 0.
Function Is4f tConsequence(I, ϕ ≈ ψ, M) can also be used to realize the
procedure 4ft-Synt which recognizes group of rules ϕ ≈ ψ, which can be con-
sidered as consequences of (yet unknown) items of knowledge. We assume that
each even yet unknown item of knowledge is represented by a formula of lan-
guage LTM . The procedure 4ft-Synt can be then realized such that we choose
formula ω ∈ LTM and using function Is4f tConsequence(ω, ϕ ≈ ψ, M) we pick
up all consequences of ω from output of 4ft-Miner procedure. However, we have
somehow to limit set of tested formulas ω ∈ LTM . A more detailed study of this
problem is out of the scope of this paper.
186 Jan Rauch
7 Conclusions
The goal of this paper was to elaborate theoretically an approach for dealing with
domain knowledge in mining association rules. It was done by modifications of
logic of discovery developed in [2]. General requirements for such modifications
were discussed in section 3.1.
T T
Then a framework 4f tD = hSAR , UAR , 4ft-Miner, 4ft-Filter, 4ft-Synt i
for dealing with domain knowledge when mining association rules is described.
Association rules are understood as interesting couples of Boolean attributes
derived from columns of the analyzed data matrix. The Boolean attributes are
derived from basic Boolean attributes by connectives ∧, ∨, ¬, see section 4.1. The
general form of the basic Boolean attributes is A(α). Here α is automatically
generated subset of categories of A. It makes possible to deal with notions like
A(low) and B(high). Implemented procedure 4ft-Miner produces such association
rules [7].
The presented approach relates these rules to items of domain knowledge like
A ↑↑ B concerning non-Boolean attributes, see section 5. The procedures 4ft-
Filter and 4ft-Synt are suggestedto deal with such items of domain knowledge
when interpreting results of 4ft-Miner. They are being implemented.
No similar approach concerning association rules is known to the author.
However, a comparison of the presented approach with ways of dealing with
domain knowledge in additional data mining areas is still a challenge and a
subject of further work. It requires both theoretical study and experiments with
4ft-Discoverer after its implementation.
References
1. Aggraval R. et al (1996) Fast Discovery of Association Rules. In: Fayyad U.M. et
al (eds) Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo
Park
2. Hájek P., Havránek T. (1978) Mechanizing Hypothesis Formation(Mathematical
Foundations for a General Theory), Springer–Verlag 1978
3. Hájek P., Holeňa M., Rauch J. (2010) The GUHA method and its meaning for data
mining. Journal of Computer and System Science, 76, pp. 34 – 48.
4. Yang Q., Wu X. (2006) 10 Challenging Problems in Data Mining Research, Inter-
national Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006,
pp. 597 – 604.
5. Rauch J. (2005) Logic of Association Rules. Applied Intelligence 22, pp. 9 – 28.
6. Rauch J. (2009) Considerations on Logical Calculi for Dealing with Knowledge in
Data Mining. In Ras Z. W., Dardzinska A. (Eds.): Advances in Data Management.
Springer, 2009, pp. 177 – 202
7. Rauch J., Šimůnek M (2005) An Alternative Approach to Mining Association Rules.
In: Lin T. Y. et al. (eds) Data Mining: Foundations, Methods, and Applications,
Springer-Verlag, pp. 219 – 238
8. Rauch J., Šimůnek M.(2009) Dealing with Background Knowledge in the SEWE-
BAR Project. In: Berendt B. et al.: Knowledge Discovery Enhanced with Semantic
and Social Information. Berlin, Springer-Verlag, 2009, pp. 89 – 106