Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining

Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining JanRauch rauch@vse.cz Faculty of Informatics and Statistics University of Economics

Prague ⋆ nám W. Churchilla 4 130 67 Prague Czech Republic

Faculty of Informatics Statistics University of Economics

Prague

Modifying Logic of Discovery for Dealing with Domain Knowledge in Data Mining A5911CF255EF886FD0ADB06E74CB37FB GROBID - A machine learning software for extracting information from scholarly documents

Logic of discovery was developed in 1970's as an answer to questions "Can computers formulate and justify scientific hypotheses?" and "Can they comprehend empirical data and process it rationally, using the apparatus of modern mathematical logic and statistics to try to produce a rational image of the observed empirical world?". Logic of discovery is based on observational and theoretical languages and on inductive inference corresponding to statistical approaches. Formulas of observational language concern analyzed observational data and formulas of theoretical language concern suitable state dependent structures. The goal of the paper is to discuss a possibility to adapt the logic of discovery to data mining.

⋆ This paper was prepared with the support of Institutional funds for support of a long-term development of science and research at the

Introduction

Logic of discovery is developed in the book [2] which starts with questions:(Q 1 ) -Can computers formulate and justify scientific hypotheses? (Q 2 ) -Can they comprehend empirical data and process it rationally, using the apparatus of modern mathematical logic and statistics to try to produce a rational image of the observed empirical world? Answers are based on a scheme of inductive inference: theoretical assumptions, observational statement (s) theoretical statement .

This schema means that having accepted theoretical assumptions and having verified observational statements concerning analyzed data, we accept a theoretical statement. The schema leads to additional questions L0 -L4: (L0) In what languages does one formulate observational and theoretical statements? (L1) What are rational inductive inference rules bridging the gap between observational and theoretical sentences? (L2) Are there rational methods for deciding whether a theoretical statement is justified? (L3) What are the conditions for a theoretical statement or a set of theoretical statements to be of interest with respect to the task of scientific cognition? (L4) Are there methods for suggesting such a set of statements, which is as interesting (important) as possible?

Answering questions (LO) -(L2) leads to logic of induction, answers to questions (L3) and (L4) lead to logic of suggestions. Answers to questions (LO) -(L4) constitute a logic of discovery. Very detailed answers to questions L0 -L4 are given in the book [2] and the logic of discovery is developed. Observational and theoretical calculi are developed as languages for observational and theoretical statements respectively. Principles of logic of discovery are briefly outlined in section 2.

Various observational calculi are defined in [2]. The most studied calculi are monadic observational predicate calculi. They can be understood as a modification of classical predicate calculi -only finite models are allowed and generalized quantifiers are added. Finite models correspond to observational data we analyze and generalized quantifiers make possible to express important statements on observational data. Monadic observational predicate calculi were modified and significantly simplified in [5] such that they can be understood as a logic of association rules. Association rules -formulas of these calculi are more general than association rules defined in [1]. These generalized association rules are produced by the procedure 4ft-Miner [7].

Application of domain knowledge in data mining is introduced among 10 challenging problems of data mining [4], see also http://www.cs.uvm.edu/ ~icdm/. Domain knowledge is also referred as background knowledge, world knowledge, or business knowledge. Results presented in [5] make possible to deal with domain knowledge in the process of data mining. A way of filtering out consequences of domain knowledge from results of the 4ft-Miner procedure is outlined in [6]. It is based on application of logic of association rules [5]. The goal of this paper is to present (in a given scope) a theoretical elaboration of this approach.

We are going to modify logic of discovery developed in [2]. A modified theoretical language is intended to express items of domain knowledge intuitively understandable to domain experts without experience in data mining. Formulas of observational language correspond to patterns produced by data mining procedures. There is a new way of a correspondence between theoretical and observational languages.

A set of atomic consequences is assigned to each item of domain knowledge (i.e. to a formula of the theoretical language). The atomic consequences are simple formulas of observational calculus such that the assignment can be done by a domain expert. Deduction rules among observational formulas are then used to spread the consequences of items of domain knowledge among additional formulas of observational calculus.

Let us emphasize that the theoretical language expressing domain knowledge totally differs from the observational language of patterns produced by data mining procedures. The correspondence between these languages is ensured by the atomic consequences defined by the domain expert and by deduction rules in the observational calculus. First we outline general features of this approach and then we elaborate it for association rules. We call resulting modification of logic of discovery as logic of mining of association rules. No analogous approach concerning association rules is known to the author. Logic of discovery is sketched in section 2. Principles of its modification are discussed in section 3. Modified observational and theoretical languages are introduced in sections 4 and 5. System 4ft-Discoverer integrating both introduced theoretical principles and software procedures for dealing with domain knowledge is discussed in section 6.

Logic of Discovery

Semantic system S = Sent, M, V, V al is determined by a non-empty set Sent of sentences, a non-empty set M of models, a non-empty set V of abstract values and an evaluating function

V al : (Sent × M) → V [2]. If ϕ ∈ Sent and M ∈ M then V al(ϕ, M ) is the value of ϕ in M . Semantic system S = Sent, M, V, V al is observational if Sent, M, V are recursive sets and V al is a partial recursive function [2]. Observational semantic system S O = Sent O , M O , V O , V al O corresponding

to analyzed data and theoretical semantic system S T = Sent T , U T , V T , V al T corresponding to the whole set of objects, we are interested in are developed in [2].

Observational predicate calculus is a result of modifications of predicate calculi -only finite models are allowed and generalized quantifiers are added [2], see introduction. System of closed formulas of such calculus is an observational semantic system. Observational predicate calculus with formulas corresponding to association rules was developed in [2]. Question of rationality of inductive inference rule is very important. It is studied in [2] using statistical approaches. It leads to observational predicate calculi with generalized quantifiers corresponding to statistical hypothesis tests and to theoretical semantic system S T = Sent T , U T , V T , V al T with state dependent structures. However, the more detailed description is out of the scope of this paper.

Using induction rules based on statistical methods usually means there is 1:1 correspondence between observational and theoretical statements. Thus a task of a suggestion of interesting theoretical statements can be converted to a task of a suggestion of interesting observational statements. The GUHA method is defined in [2] to solve this task. The method is carried out using GUHA procedures. A GUHA procedure is a computer program the input of which consists of analyzed data and a set of parameters defining the large set of relevant observational patterns. Its output is a set of all prime patterns. A pattern is prime if it is true in the analyzed data, and if it does not logically follow from another output simples pattern.

The most used GUHA procedure is the ASSOC procedure. It deals with enhanced association rules. It was several times implemented and many times applied, see e.g. [3]. One of its implementations is the procedure 4ft-Miner, see introduction and section 6.1. Implementations of the ASSOC procedure use theoretical results (namely deduction rules) concerning observational predicate calculi [2]. Logic of association rules involves additional both theoretically interesting and practically important results [5,6]. Meaning of the GUHA method and thus also meaning of logic of discovery for data mining is summarized in [3].

3 Modifying Logic of Discovery

Starting points

Our starting points are: (1) Data we are dealing with do not satisfy requirements for application of statistical approaches. An example of such data is data about patients of a particular hospital. (2) Objects described by our data belong to a broader set of objects. An example of such a broader set is a set of residents in a region to which the hospital belongs. (3) We have various items of knowledge related to a particular data set. An example is identification of a particular device used to measure each of the observed patients. ( 4) We have various items of knowledge related to the broader set of objects that are not directly recorded for each object. An example is information on specific vaccination applied in a region in question. (5) We have various items of general knowledge about attributes of objects described in our data. An example is a commonly accepted fact that if weight increases, then blood pressure increases too. (6) We have data mining procedures, which are able to produce a lot of strong patterns valid in given data. Examples are the apriori algorithm [1] and the procedure 4ft-Miner, see section 6.1. Both produce association rules.

(7) Most of the patterns produced by the data mining procedure are uninteresting because of they are consequences of the above mentioned items of knowledge. (8) There are groups of patterns hidden in patterns produced by the data mining procedure such that each of the groups can be considered as a consequence of a yet not known item of knowledge.

Our goal is to modify logic of discovery such that we will be able to: (I) use items of knowledge mentioned in points 3 -5; (II) filter out consequences of the above introduced items of knowledge from results of mining procedures, see point 7; (III) recognize a group of patterns, which can be considered as a consequence of a (yet not known) item of knowledge, see point 8.

We assume to use GUHA procedures as data mining procedures together with results on a related observational logical semantic system. To achieve requirements (I)-(III) we are going to: (A) Enhance observational semantic system by features making possible to capture items of knowledge related to particular data sets, see (3) above. (B) Enhance theoretical semantic system by features making possible to capture both items of general knowledge and items of knowledge related to the broader set, see ( 4) and ( 5) above. (C) Enhance theoretical semantic system by function Cons assigning to each item I of knowledge according to (B) sets of formulas Cons(I) of a corresponding observational system (i.e. a set of patterns produced by a GUHA data mining procedure). Set Cons(I) is assumed to be a set of atomic consequences of I. (D) To develop a new analytical procedure G-FILTER for each GUHA procedure. G-FILTER will filter out all consequences of given items of knowledge from output of the GUHA procedure. This way requirement II) will be achieved. (E) To develop a new analytical procedure G-SYNT for each GUHA procedure. G-SYNT will recognize groups of patterns, which can be considered as a consequence of a (yet not known) items of knowledge. This way requirement III) will be achieved.

We present principles of modification of logic of discovery using results related to logic of association rules [5]. We deal with data matrices introduced in section 3.2. The principles of modification of observational and theoretical systems are outlined in sections 3.3 and 3.4. The procedures G-FILTER and G-SYNT are presented by description of procedures 4ft-Filter and 4ft-Synt, which are related to the procedure 4ft-Miner, see section 6.2.

Data Matrices

We consider data matrices with values -natural numbers only. Data matrix -informal view Data matrix M = M, f1, . . . , fK

Fig. 1. Data matrix

There is only the finite number of categories for each attribute. Let us assume that the number of categories in a column is t and that the categories are natural numbers 1, . . . , t. All values in the data matrix are then described by the numbers of categories for each column. The whole information on the number of columns and categories in the data matrix is then given by type of data matrix: A type of data matrix is a K-tuple T = t 1 , . . . , t K where t i ≥ 2 are natural numbers for i = 1, . . . , K.

We use a more formal definition of a data matrix with the number of columns and the numbers of possible values in particular columns given by the type T = t 1 , . . . , t K : A data matrix of the type

T is a K + 1-tuple M = M, f 1 , . . . , f K ,

where M is a non-empty finite set and f i is the unary function from M to {1, . . . , t i } for i = 1, . . . , K. Set M is a set of rows of data matrix M. Set M is called a domain of data matrix M. We write M = Dom(M). An example of data matrix M = M, f 1 , . . . , f K is in the right part of figure 1. We assume that M = {o 1 , . . . , o n }.

Modifying Observational Semantic System

Observational semantic system L C is a language of a logical calculus formulas of which correspond to patterns produced by a data mining procedure (i.e. GUHA procedure in our case). There is an evaluation function

S T = M T , L C , V al C , L M T , V al M T is used instead of S O = Sent O , M O , V O , V al O introduced in section 2. Set M T ofV al C : (L C × M T ) → {0, 1}. If ϕ ∈ L C and M ∈ M T then V al C (ϕ, M) is the value of ϕ in M. If it is V al C (ϕ, M) = 1 then ϕ is true in M, otherwise ϕ is false in M.

L M T is a language intended to express items of knowledge related to particular data matrices, see point (3) in section 3.1. It is a set Θ = {θ 1 , . . . θ R } of formulas corresponding to features of M, each of them can be true or false. There is an evaluation function V al

M T : (Θ × M T ) → {0, 1}. If it is θ ∈ Θ and M ∈ M T then V al M T (θ, M) is the value of feature θ for M. If it is V al M T (θ, M) = 1 then M has feature θ, otherwise M has not feature θ.

Modifying Theoretical Semantic SystemTheoretical semantic system U T = M , L T M , Cons is used instead of S T = Sent T , U V t , V T , V al T introduced in section 2.

We assume that theoretical system

U T = M , L T M , Cons is related to observational semantic system S T = M T , L C , V al C , L M T , V al M T introduced above. M = {Dom(M) | M ∈ M T } is a union of domains of all data matrices M ∈ M T . Language L T

M is intended to express items of knowledge introduced in points ( 4) and ( 5) in section 3.1. There are lot of such items of knowledge [6,8], several examples are in section 5.

We use function Cons : (L T M × M T ) → P(L C ). This function assigns to each couple I, M a set Cons(I, M) of formulas of language L C . Here I ∈ L T M is an item of knowledge (i.e. a formula of language L T M ) and M ∈ M T is a data matrix of type T of related observational semantic system S T . The set Cons(I, M) is considered as a set of all atomic consequences of item I of knowledge in data matrix M, see point (C) in section 3.1.

Observational Semantic System of Association Rules

Observational semantic system S T AR = M T , L T AR , V al T AR , L M T , V al M T of type T = t 1 , . . . , t K concerning association rules is outlined in this section. Association rules of type T are couples of Boolean attributes created from columns of data matrices M ∈ M T . Language L T AR of association rules is introduced in section 4.1, evaluation function V al T AR in section 4.2. Language L T AR , set M T of data matrices and evaluation function V al T AR constitute a logical calculus with important deduction rules, see section 4.3. All definitions are given informally.

We will not discuss here details of language L M T and related evaluation function V al T M . They are intended to express characteristics of particular data matrices. Examples: data matrix M 1 concerns only pathological patients, data matrix M 2 concerns patients from mountain region, etc. More detailed description is out of scope of this paper, additional research is assumed.

Language L T

AR of Association Rules An association rule is expression ϕ ≈ ψ where ϕ and ψ are Boolean attributes derived from columns of the analyzed data matrix and ≈ is a 4ft-quantifier [5]. Boolean attribute ϕ is called antecedent and ψ is called succedent. 4ft-quantifier defines relation of ϕ and ψ by associated function F ≈ of ≈, see below.

Basic Boolean attributes are created first. The basic Boolean attribute is an expression A(α) where α ⊂ {a 1 , . . . a t } and {a 1 , . . . a t } is the set of all categories of the attribute A. Here α is a coefficient of A(α). The basic Boolean attribute

A(α) is true in row o of M if it is A(o) ∈ α where A(o)

An example of association rule is expressionA 1 (1) ∧ A 2 (4, 5) ≈ A K (2, 6

). Note that ϕ and ψ in ϕ ≈ ψ have no common attributes.

Evaluation Function V al T AR

Association rule ϕ ≈ ψ can be true or false in a given data matrix M ∈ M T . Rule ϕ ≈ ψ is verified on the basis of a four-fold table 4f t(ϕ,ψ, M) of ϕ and ψ in M, see figure 2. Here a is the number of the objects (i.e. the rows of M) satisfying both ϕ and ψ, b is the number of the objects satisfying ϕ and not satisfying ψ, and similarly for c and d, see figure 2. Four-fold table 4f t(ϕ,ψ, M) is written as a, b, c, d and called 4ft-table . Evaluation function V al T AR assigns a value 0 or 1 to each couple ϕ ≈ ψ, M where ϕ ≈ ψ is the association rule and b, c, d) where a, b, c, d = 4f t(ϕ, ψ, M). Examples of 4ft-quantifiers and their associated functions are in table 1 where 0 < p ≤ 1 and 0 < α < 0.5 are real numbers, B > 0 is an integer number.

M ∈ M T . If V al T AR (ϕ ≈ ψ, M) = 1 then we say that rule ϕ ≈ ψ is true in M and if V al T AR (ϕ ≈ ψ, M) = 0 then we say that rule ϕ ≈ ψ is false in M. V al T AR (ϕ ≈ ψ, M) is defined using 4ft-table 4ft(ϕ, ψ, M) of ϕ and ψ in M and associated function F ≈ of ≈. Associated function F ≈ of 4ft quantifier ≈ is a {0, 1} -valued function de- fined for all quadruples a, b, c, d of natural numbers. Value V al(ϕ ≈ ψ, M) of association rule ϕ ≈ ψ in data matrix M ∈ M T is defined such that V al(ϕ ≈ ψ, M) = F ≈ (a,( k i )( n−k r−i ) ( r n ) ≤ α ∧ a ≥ B Above average dependence ∼ + q,B a a+b ≥ (1 + q) a+c a+b+c+d ∧ a ≥ B Table 1. Examples of 4ft-quantifiers

Deduction Rules in Logical Calculus of Association Rules

Language L T AR , set of data matrices M T and evaluation function V al T AR constitute a logical calculus of association rules [5]. There are both theoretically interesting and practically useful results concerning logical calculi of association rules. Most of them are related to classes of 4ft-quantifiers [5]. An example of a class of 4ft-quantifiers is the class of implicational 4ft-quantifiers. It is defined such that 4ft-quantifier ≈ is implicational if it satisfies the condition:

if 1) are implicational. Criteria of soundness of deduction rules ϕ≈ψ ϕ ′ ≈ψ ′ where both ϕ ≈ ψ and ϕ ′ ≈ ψ ′ are association rules were found [5]. Criteria are related to important classes of association rules. We outline such criterion for the class of interesting implicational quantifiers. All practically important implicational 4ft-quantifiers are interesting implicational quantifiers.

F ≈ (a, b, c, d) = 1 ∧ a ′ ≥ a ∧ b ′ ≤ b then also F ≈ (a ′ , b ′ , c ′ , d ′ ) = 1. Both 4ft- quantifiers ⇒ p,B and ⇒ ! p,α,B (see Table

If ⇒ * is an interesting implicational quantifier then there are formulas ω 1A , ω 1B , ω 2 of propositional calculus created from ϕ, ψ, ϕ ′ , ψ ′ so that the deduction rule ϕ⇒ * ψ ϕ ′ ⇒ * ψ ′ is sound if and only if at least one of the following conditions (1), (2) are satisfied: (1) -both ω 1A and ω 1B are tautologies, (2) -ω 2 is a tautology.

Similar theorems are proved for additional important classes of 4ft-quantifiers [5]. These results are crucial, see section 6.2.

Theoretical Semantic System of Association RulesTheoretical semantic system U AR = M , L T M , Cons T AR related to observational semantic system S T AR = M T , L T AR , V al T AR , L M T , V al M T of type T = t 1 , .

. . , t K concerning generalized association rules is sketched in this section.

Set M and language L T M are introduced in section 3.4. Language L T M is intended to express items of knowledge introduced in points ( 4) and ( 5) in section 3.1. We give four important examples of formulas of language L T M . Then we outline how function Cons T AR is constructed for one of these examples. The examples are: A ↑↑ B, A ↑↓ B, A → + ω, and ω 1 → + ω 2 . Here A is one of attributes A 1 , . . . , A K of language L T AR , the same is true for B. In addition, ω, ω 1 , ω 2 are Boolean attributes of L T AR and ω does not contain attribute A. Intuitive meaning of particular formulas:

-A ↑↑ B means if A increases then B increases too -A ↑↓ B means if A increases then B decreases -A → + ω means if A increases then relative frequency of ω increases -ω 1 → + ω 2 means if ω 1 is satisfied then relative frequency of ω 2 increases.

We show how function Cons T AR creates a set Cons T AR (A ↑↑ B, M) of association rules -formulas of language L T AR which can be considered as a set of all atomic consequences of A ↑↑ B in data matrix M. Function Cons T AR can be seen as a family of functions Cons T ≈ where ≈ is a 4ft-quantifier of language L T AR . Function Cons T ≈ creates a set Cons T ≈ (A ↑↑ B, M) of association rulesformulas of language L T AR such that this set can be considered as a set of all atomic consequences of A ↑↑ B of the form ρ ≈ σ in data matrix M. Then it is

Cons T AR (A ↑↑ B, M) = {Cons T ≈ (A ↑↑ B, M) | ≈ belongs to L T AR } .

We outline function Cons T ⇒p,B for 4ft-quantifier ⇒ p,B of founded implication (see Table 1) and item A ↑↑ B of domain knowledge. Functions Cons T ≈ for additional 4ft-quantifiers and formulas of L T M are defined similarly [6]. We assume that attribute A has categories 1, . . . , u and attribute B has categories 1, . . . , v. Our task is to define a set of rules ρ ⇒ p,B σ which can be naturally considered as a set of all the consequences of item A ↑↑ B and which are as simple as possible. In this case, we consider the simplest rules to be in the form A(α) ⇒ p,B B(β) where α ⊂ {1, . . . , u} and β ⊂ {1, . . . , v}.

Rule A(low) ⇒ p,B B(low) stating that "if A is low then B is low" can be understood as a natural consequence of A ↑↑ B. The only problem is to define the coefficients α and β that can be understood as "low". This can be done so that we choose natural A low , 1 < A low < u and natural B low , 1 < B low < v and then we consider α as "low" iff α ⊂ {1, . . . , A low } and β as "low" iff β ⊂ {1, . . . , B low }, see also section 6.1.

Also rule A(high) ⇒ p,B B(high) stating that "if A is high then B is high" can be understood as a natural consequence of A ↑↑ B. The coefficients α and β can be defined as "high" in the following way. We choose natural A high , 1 < A low < A high < u and natural B high , 1 < B low < B high < v and then we consider α as "high" iff α ⊂ {A high , . . . , v} and β as "high" iff β ⊂ {B high , . . . , v}.

It remains to define values of parameters p and B of ⇒ p,B . A possibility is to allow each p ≥ 0.9 and B ≥ n 20 where n is the number of rows of data matrix M. However, boundaries of p and B as well as values A low , A high , B low , B high should be determined by a domain expert. The set of rules A(low) ⇒ p,B B(low) and A(high) ⇒ p,B B(high) satisfying the above given conditions can be considered as Cons T ⇒p,B (A ↑↑ B, M) -a set of atomic consequences of A ↑↑ B of the form ρ ⇒ p,B σ in M.

Set Cons T ⇒p,B (A ↑↑ B, M) can be defined in a more precise way by adding rules A(medium) ⇒ p,B B(medium) with a suitable definition of "medium". Rules A(low, medium) ⇒ p,B B(medium), A(low, medium) ⇒ p,B B(medium, high), and A(medium) ⇒ p,B B(medium, high) can also be added.

Note: there is a natural requirement on the reasonable consistency of set Cons T AR (A ↑↑ B, M) of atomic consequences of the A ↑↑ B i.e. there cannot be two atomic consequences ρ 1 ≈ σ 1 and ρ 2 ≈ σ 2 that contradict each other. A detailed discussion of this topic is, however, without the scope of this paper.

4ft-Discoverer

We have defined observational system S T AR = M T , L T AR , V al T AR , L M T , V al M T and related theoretical system U T AR = M , L T M , Cons T AR . The goal of this section is to discuss possibilities of implementation of a theoretical framework for dealing with domain knowledge involved in these systems.

We use the GUHA procedure 4ft-Miner mining for association rules -couples of Boolean attributes created from columns of data matrices M ∈ M T [7]. The 4ft-Miner procedure has very fine tools to define a set of association rules to be generated and verified. It deals, among other, with basic Boolean attributes A(α), B(β), A(low), B(low) etc. Main features of 4ft-Miner procedure are outlined in section 6.1.

Intention to develop two additional analytical procedures G-FILTER and G-SYNT related to each GUHA procedure is announced in points (D) and (E) in section 3.1. We present principles of procedures 4ft-Filter and 4ft-Synt related to 4ft-Miner procedure. The 4ft-Filter procedure is intended to filter out consequences of a given item of domain knowledge from the output of 4ft-Miner. Item of domain knowledge is expressed by a formula of L M T . The 4ft-Synt procedure is intended to recognize groups of patterns which can be considered as a consequence of a yet not known item of knowledge. Principles of both procedures are introduced in section 6.2.

Semantic systems S T AR and U T AR together with the procedures 4ft-Miner, 4ft-Filter, and 4ft-Synt constitute a framework for a process of data mining for association rules based on domain knowledge. We call this framework 4ft-Discoverer, i.e. 4f tD. It is 4f tD = S T AR , U T AR , 4ft-Miner, 4ft-Filter, 4ft-Synt .

4ft-Miner

The 4ft-Miner procedure mines for association rules ϕ ≈ ψ, see section 4. Input parameters define 4ft-quantifier ≈, set of relevant antecedents Φ and set of relevant succedents Ψ ; we assume ϕ ∈ Φ and ψ ∈ Ψ . Each antecedent is a conjunction τ 1 ∧. . .∧τ m of partial antecedents τ 1 , . . . , τ m . Each partial antecedent is either a conjunction λ 1 ∧ . . . ∧ λ q or a disjunction λ 1 ∨ . . . ∨ λ q of literals λ 1 , . . . , λ q . Each literal is a basic Boolean attribute A(α) or its negation ¬A(α). Definition of a set of relevant antecedents Φ consists of definitions of relevant partial antecedents Φ

1 , . . . , Φ m . Conjunction τ 1 ∧ . . . ∧ τ m is a relevant antecedent if τ 1 ∈ Φ 1 , . . . , τ m ∈ Φ m .

Definition of a relevant partial antecedent is given by a list A ′ 1 , . . . , A ′ u of attributes, by a minimal and maximal number of literals in particular partial antecedents and by a type of partial antecedent, i.e. conjunctions or disjunctions. In addition, for each attribute A ′ a set of relevant basic Boolean attributes which are automatically generated is defined. There are various detailed possibilities how to define all relevant basic Boolean attributes A ′ (α) [7]. We outline only one of them. We use attribute A with categories 1, 2, 3, 4, 5. Option intervals of length 2-3 gives basic Boolean attributes A(1, 2), A(2, 3), A(3, 4), A(4, 5), A(1, 2, 3), A(2, 3, 4), A (3,4,5). This way we can get basic Boolean attributes A(low), A(high), B(low), B(high), see section 5.

Set Ψ of relevant succedents is defined analogously. The 4ft-Miner procedure does not use well known a-priori algorithm [1]. Its implementation is based on representation of analyzed data by suitable strings of bits [7]. Its performance is good enough to solve a lot of practically important tasks. A detailed study of its time and space complexity is in [7].

4ft-Filter and 4ft-Synt

The 4ft-Filter procedure is intended to filter out consequences of a given item of domain knowledge from association rules produced by the 4ft-Miner procedure. Item of domain knowledge is represented by a formula I ∈ L T M , see section 3.4. Function Is4f tConsequence(I, ϕ ≈ ψ, M) defined for all formulas I ∈ L T M , association rules ϕ ≈ ψ ∈ L T AR and data matrices M ∈ M T can be used to realize the 4ft-Filter procedure. It is Is4f tConsequence(I, ϕ ≈ ψ, M) = 1 if the rule ϕ ≈ ψ can be considered as a consequence of I, otherwise it is Is4f tConsequence(I, ϕ ≈ ψ, M) = 0.

Value Is4f tConsequence(I, ϕ ≈ ψ, M) is computed using function Cons T AR , see section 5 and using deduction rules ϕ≈ψ ϕ ′ ≈ψ ′ , see section 4.3. There are criteria of correctness of rules ϕ≈ψ ϕ ′ ≈ψ ′ for each 4ft-quantifier ≈ of 4ft-Miner procedure [5,7]. Function Cons T AR is defined for all I ∈ L T M , and M ∈ M T such that Cons T AR (I, M) = Λ and Λ is a set of all association rules ρ ≈ σ which can be considered as atomic consequences of I in M.

Value Is4f tConsequence(I, ϕ ≈ ψ, M) is computed in two steps. In the first step, we compute set Λ = Cons T AR (I, M). In the second step, we test correctness of ρ≈σ ϕ≈ψ for each ρ ≈ σ ∈ Λ. If there is such a correct rule, then ϕ ≈ ψ is considered as a consequence of I in M and Is4f tConsequence(I, ϕ ≈ ψ, M) = 1. Otherwise Is4f tConsequence(I, ϕ ≈ ψ, M) = 0. Function Is4f tConsequence(I, ϕ ≈ ψ, M) can also be used to realize the procedure 4ft-Synt which recognizes group of rules ϕ ≈ ψ, which can be considered as consequences of (yet unknown) items of knowledge. We assume that each even yet unknown item of knowledge is represented by a formula of language L T M . The procedure 4ft-Synt can be then realized such that we choose formula ω ∈ L T M and using function Is4f tConsequence(ω, ϕ ≈ ψ, M) we pick up all consequences of ω from output of 4ft-Miner procedure. However, we have somehow to limit set of tested formulas ω ∈ L T M . A more detailed study of this problem is out of the scope of this paper.

Conclusions

The goal of this paper was to elaborate theoretically an approach for dealing with domain knowledge in mining association rules. It was done by modifications of logic of discovery developed in [2]. General requirements for such modifications were discussed in section 3.1.

Then a framework 4f tD = S T AR , U T AR , 4ft-Miner, 4ft-Filter, 4ft-Synt for dealing with domain knowledge when mining association rules is described. Association rules are understood as interesting couples of Boolean attributes derived from columns of the analyzed data matrix. The Boolean attributes are derived from basic Boolean attributes by connectives ∧, ∨, ¬, see section 4.1. The general form of the basic Boolean attributes is A(α). Here α is automatically generated subset of categories of A. It makes possible to deal with notions like A(low) and B(high). Implemented procedure 4ft-Miner produces such association rules [7].

The presented approach relates these rules to items of domain knowledge like A ↑↑ B concerning non-Boolean attributes, see section 5. The procedures 4ft-Filter and 4ft-Synt are suggestedto deal with such items of domain knowledge when interpreting results of 4ft-Miner. They are being implemented.

No similar approach concerning association rules is known to the author. However, a comparison of the presented approach with ways of dealing with domain knowledge in additional data mining areas is still a challenge and a subject of further work. It requires both theoretical study and experiments with 4ft-Discoverer after its implementation.

all data matrices M of type T is used instead of M O and two languages -L C and L M T are used instead of Sent O . We always use set {0,1} (i.e. {false, true}) as a set of possible values of formulas of our languages instead of V O .

`r i ´i(1 − p) r−i ≤ α ∧ a ≥ B Founded equivalence ≡p,B a+d a+b+c+d ≥ p ∧ a ≥ B Fisher ≈α,B P min(r,k) i=a

The natural numbers represent categories, i.e. possible values of observed attributes A 1 , . . . , A K . Columns of the data matrix correspond to attributes. Rows correspond to observed objects, e.g. patients. An example of the data matrix is in the left part of figure1.object A1 . . . AKobjectf1. . .fKo11 . . . 6o1f1(o1) . . .fK (o1). . .. . .. . .. . .. . .. . .. . .. . .on1 . . .1onf1(on). . .fK (on)

is the value of the attribute A in row o. Boolean attributes ϕ and ψ are derived from basic Boolean attributes using connectives ∨, ∧ and ¬ in the usual way. Examples of Boolean attributes are in figure 2.M A1 . . . AK A1(1) AK (2, 6) A1(1) ∧ AK (2, 6) o1 1 . . . 6 1 1 1M ψ ¬ψ. . .. . .. . .. . .. . .. . .. . .ϕ a ¬ϕ ab bon 1 . . .1100Data matrix and examples of Boolean attributes4ft(ϕ, ψ, M)

Fig. 2. Derived Boolean attributes and 4ft-table 4ft(ϕ, ψ, M)

Fast Discovery of Association Rules RAggraval Advances in Knowledge Discovery and Data Mining UMFayyad

Menlo Park

AAAI Press 1996 Mechanizing Hypothesis Formation(Mathematical Foundations for a General Theory PHájek THavránek 1978. 1978 Springer-Verlag The GUHA method and its meaning for data mining PHájek MHoleňa JRauch Journal of Computer and System Science 76 2010 10 Challenging Problems in Data Mining Research QYang XWu International Journal of Information Technology & Decision Making 5 4 2006. 2006 Logic of Association Rules JRauch Applied Intelligence 22 2005 Considerations on Logical Calculi for Dealing with Knowledge in Data Mining JRauch Advances in Data Management ZWRas ADardzinska Springer 2009. 2009 An Alternative Approach to Mining Association Rules JRauch MŠimůnek Data Mining: Foundations, Methods, and Applications TYLin Springer-Verlag 2005 Dealing with Background Knowledge in the SEWE-BAR Project JRauch MŠimůnek Knowledge Discovery Enhanced with Semantic and Social Information BBerendt

Berlin

Springer-Verlag 2009. 2009