Introduction

Extending Bayesian Classi er with Ontological Attributes

0 Institute of Computing Science, Poznan University of Technology , ul. Piotrowo 2, 60-965 Poznan , Poland

The goal of inductive learning classi cation is to form generalizations from a set of training examples such that the classi cation accuracy on previously unobserved examples is maximized. Given a speci c learning algorithm, it is obvious that its classi cation accuracy depends on the quality of training data. In learning from examples, noise is anything which obscures correlations between attributes and the class [1]. There are many possible solutions to deal with the existence of noise. Data cleaning or detection and elimination of noisy examples constitutes the rst approach. Due to the risk of data cleaning, when noisy examples are retained while good examples are removed, e orts have been taken to construct noise tolerant classi ers. Although both these approaches seem very di erent, they try to somehow 'clean' this noisy training data. In this paper, we propose an approach to 'admit and utilize' noisy data by enabling to model di erent levels of knowledge granularity both in training and testing examples. The proposed knowledge representation use hierarchies of sets of attribute values, derived from subsumption hierarchies of concepts from an ontology represented in description logic. The main contributions of the paper are: (i) we propose a novel extension of the nave Bayesian classi er by hierarchical, ontology based attributes (ontological attributes ), (ii) we propose an inference scheme that handles ontological attributes.

Introduction

There are three major sources of noise: (i) insu ciency of the description for attributes or the class (or both), (ii) corruption of attribute values in the training examples, (iii) erroneous classi cation of training examples [ 1 ]. The second and third source of noise can lead to so-called attribute-noise and class-noise respectively. Attribute-noise is represented by: (i) erroneous attribute values, (ii) missing or "don't care" attribute values, (iii) incomplete attributes or "don't care" values. The class-noise is represented by: (i) contradictory examples, or (ii) misclassi cation [ 2 ]. However, the rst major source of noise, although not easily quanti able, is important. This insu ciency of the description can lead to both erroneous attribute values and erroneous classi cation. Let us call this resulting noise as description-noise. Following for example [ 3 ] the main reason for description-noise may be in a language used to represent attribute values, which is not expressive enough to model di erent levels of knowledge granularity. In such a case, erroneous or missing attribute values may be introduced by users of a system that are required to provide very speci c values, but the level of their knowledge of the domain is too general to precisely describe the observation by the appropriate value of an attribute. Even if the person is an expert of the domain, erroneous or missing attribute values can be observed as a consequence of lack of time, or other resources to make detailed observations (ie. a more complete description). However, if the language enabled modeling di erent levels of knowledge granularity (very precise or more general descriptions), we would be able to decrease a level of this description-noise.

In order to model di erent levels of knowledge granularity, each testing and training example would be described by a set of values for any attribute. These sets of values should re ect the domain knowledge and could not be constructed arbitrarily. Let us notice, that in some domains, hierarchical or taxonomical relationships between sets of values, represented by so called concepts, may be observed and this knowledge could be explored. Such knowledge is currently often available in the form of ontologies. The most widely used language to represent ontologies, suitable in particular to model taxonomical knowledge, is Web Ontology Language (OWL) 1. The theoretical counterpart of OWL, from which its semantics is drawn, is constituted by a family of languages called description logics (DLs) [ 4 ]. A description logic knowledge base, KB, is typically divided into intensional part (terminological one, a TBox ), and extensional part (assertional one, an ABox ). 3

An Ontological Attribute

Given is an attribute A and the set V = fV1; V2; :::; Vng, where n > 1, of nominal values of this attribute. Let us assume that given is a TBox, which speci es domain knowledge relevant to a given classi cation task. In particular, it expresses a multilevel subsumption ("is-a") hierarchy of concepts. Each concept is described by a subset of the set V for every attribute A. Then we can formulate a de nition of an ontological attribute as follows.

Ontological attribute An ontological attribute A is de ned by a tuple hH; V i, where: { by H is denoted a multilevel subsumption hierarchy of concepts, derived from a DL knowledge base. This hierarchy of concepts consists of the set of nodes N H = froot; N C ; N T g. This hierarchy de nes a root-node, denoted by root, a set N C of complex-nodes and a set N T of terminal-nodes. 1 www.w3.org/TR/owl-features/ { by V is denoted a nite set V = fV1; V2; :::; Vng, where n > 1 of nominal values of A. { each node Nk 2 N T [ N C represents a subset of the set V , denoted as val(Nk); the root-node represents the set V

To model actual training examples, an ABox would be used. 3.1

Using Ontological Attributes in the Nave Bayesian Classi er In order to apply the proposed ontological attributes in the nave Bayesian classi er, we further specify the general de nition of an ontological attribute given in the former section. Please note, that by making the assumptions presented in the following paragraphs, we will implicitly switch from the usual open world assumption used to reason with a DL knowledge base to produce a concept hierarchy, to the closed world assumption, more appropriate to the case of inference with nave Bayesian classi er. In particular we will assume that a hierarchy of concepts would represent such hierarchical partitioning of the set V of attribute values, such that each concept would correspond to a non-empty subset of V . Properties of nodes Each complex-node represents a concept from the KB, described by a proper, non-empty subset of V . Each terminal-node represents a concept from the KB, described by a unique value Vi from the set V . Relations between nodes For a given ontological attribute A, the hierarchy H is a tree, i.e. each node Nk 2 fN C [ N T g has exactly one parent, denoted as pa(Nk), such that val(Nk) val(pa(Nk)). Moreover, each node Nk 2 froot [ N C g speci es a set ch(Nk) of his children. To model di erent levels of knowledge granularity, we assume that for each Nk 2 froot [ N C g all his children are pairwise disjoint and this node Nk is a union of his children. Finally, for each node Nk 2 froot [ N C g we de ne a set de(Nk) of descendants of this node, as a set of its children or children of his descendants.

The role of complex-nodes In the setting of learning with description-noise, each training and testing example can be described in general by a set Zl of values for each attribute A, where Zl V . We can divide training examples into no-noisy examples (jZlj = 1) and noisy examples (jZlj > 1). In order to represent noisy (training and testing) examples, the ontological attribute A uses complex-nodes. We will call such a hierarchy a complex-hierarchy.

Algorithm 1 (Populating a complex-hierarchy). For each ontological attribute A we proceed as follows: We associate each training example t described by a set Zl of values of A and a class label Cj (t : A = Zl ^ C = Cj ) to a node Nk. When jZlj = 1, Zl is associated to a terminal-node Nk, such that Zl = val(Nk). Otherwise, we associate the training example to a complex-node Nk, such that Zl val(Nk), at the lowest possible level of the complex-hierarchy.

Root : V N1 : fV1g

N2 : fV2g

N6 : Z1 = fV3; V4; V5g N3 : fV3g

N7 : Z2 = fV4; V5g N4 : fV4g

N5 : fV5g Example Given is an attribute A such that V =fV1,V2,V3,V4,V5g and given is a class variable C such that it takes values from the set fC1; C2g. Let us assume, that the description-noise is modeled by sets Z1 = fV3; V4; V5g and Z2 = fV4; V5g. Let us assume a sample scenario in which the single values of the attribute A are determined by conducting three medical tests. The rst test is able to partition the set V into the following disjoint subsets: fV1g, fV2g and Z1 = fV3; V4; V5g. If the result of the rst test is Z1, then in some cases it is conducted a second test, that partitions the set Z1 into the following disjoint subsets: fV3g and Z2 = fV4; V5g. Only in critical cases it is conducted the last test, that can partition the set Z2 into disjoint subsets: fV4g and fV5g. Following this domain-knowledge, we have introduced two complex-nodes N6 and N7, such that they represent the sets Z1 and Z2 respectively. Terminal-nodes N1; N2; N3; N4; N5 represent single values from the set V . The root-node represents the set V . The resulting complex-hierarchy is presented in Figure 1. We can approximate the required probability distribution for a noisy testing example described by a set Zl = val(Nk), following principles of the probabilistic theory, by collecting frequencies of training examples T , described by sets Zm Zl, as follows:

P (ZljCj ) = PZm Zl jT : A = Zm ^ C = Cj j (1)

jT : C = Cj j

Let us remind, that a set Zl is assigned to the node Nk, such that Zl = val(Nk). The key property of an ontological attribute A, is that for the node Nk all its children are pairwise disjoint. Since then, all training examples described by sets Zm Zl, are represented by the node Nk or its descendants, and the probability distribution for a noisy testing example described by a set Zl we can de ne as follows: P (ZljCj ) = jT : A val(Nk) ^ C = Cj j + PNd2de(Nk) jT : A jT : C = Cj j val(Nd) ^ C = Cj j(2)

In this way we are able to classify a new noisy example using other less noisy and no-noisy training examples. For example, we can classify a testing example, described by the set Z1, and associated to the node N6 using all training examples described by all subsets of the set Z1. These training examples would be associated to the complex-node N6 or his descendants. 4

Conclusions

The topic of learning with ontologies is relatively new, and so far there are few approaches in this line of research, for the classi cation task see for example [ 5 ]. The simple use of ontology (Attribute Value Taxonomies) in the nave Bayesian classi er (AVT-NBL) is presented in [ 6 ]. This approach, to the best of our knowledge, is the only one existing approach for learning the nave Bayesian classi er from noisy (partially speci ed) data. Both in our approach and in AVT-NBL, noisy (partially speci ed) data is represented using hierarchical structures and similar aggregation procedures are used. Let us notice, that AVT-NBL requires a static, prede ned, taxonomy of attribute values. In our approach, the hierarchy of sets of attribute values can be constructed dynamically driven by observations and hypotheses to prove. Moreover, our aggregation procedure allows to construct the complex-hierarchy from all possible subsets of attribute values. In this way we would be able to model any noisy training and testing example in order to achieve the highest classi cation accuracy, that is not possible using an Attribute Value Taxonomy. Due to limitations of the presentation, this generalization is not discussed in the paper. Let us point out, that AVT-NBL uses a propagation procedure, that does not follow principles of the probabilistic theory. Moreover, to the best of our knowldge, AVT-NBL does not classify noisy instances, which is the main goal of our approach.

In the future, we will concentrate on the problem of the optimality of the complex-hierarchy, derived from a knowledge domain of the form of subsumption hierarchies of concepts.

1. Hickey , R.J.: Noise Modelling and Evaluating Learning from Examples . Artif. Intell . 82 ( 1-2 ) ( 1996 ) 157 { 179

2. Zhu , X. , Wu , X. : Class Noise vs . Attribute Noise: A Quantitative Study. Artif. Intell. Rev . 22 ( 3 ) ( 2004 ) 177 { 210

3. Clark , P. , Niblett , T. : Induction in Noisy Domains . In: EWSL. ( 1987 ) 11 { 30

4. Baader , F. , Calvanese , D. , McGuinness , D. , Nardi , D. , Patel-Schneider , P., eds.: The Description Logic Handbook . Cambridge University Press ( 2003 )

5. d'Amato , C. , Fanizzi , N. , Esposito , F. : Distance-Based Classi cation in Owl Ontologies . In Lovrek, I., Howlett , R.J. , Jain , L.C., eds.: KES (2) . Volume 5178 of Lecture Notes in Computer Science., Springer ( 2008 ) 656 { 661

6. Zhang , J., Honavar , V.: AVT-NBL: An Algorithm for Learning Compact and Accurate Nave Bayes Classi ers from Attribute Value Taxonomies and Data . In: ICDM '04: Proceedings of the Fourth IEEE International Conference on Data Mining , Washington, DC, USA, IEEE Computer Society ( 2004 ) 289 { 296