Learning Terminological Naı̈ve Bayesian Classifiers Under Different Assumptions on Missing Knowledge Pasquale Minervini, Claudia d’Amato, and Nicola Fanizzi LACAM – Dipartimento di Informatica – Università degli Studi di Bari “Aldo Moro” via E. Orabona, 4 - 70125 Bari - Italia pasquale.minervini@uniba.it, {claudia.damato, fanizzi}@di.uniba.it Abstract. Knowledge available through Semantic Web standards can easily be missing, generally because of the adoption of the Open World Assumption (i.e. the truth value of an assertion is not necessarily known). However, the rich rela- tional structure that characterizes ontologies can be exploited for handling such missing knowledge in an explicit way. We present a Statistical Relational Learn- ing system designed for learning terminological naı̈ve Bayesian classifiers, which estimate the probability that a generic individual belongs to the target concept given its membership to a set of Description Logic concepts. During the learning process, we consistently handle the lack of knowledge that may be introduced by the adoption of the Open World Assumption, depending on the varying nature of the missing knowledge itself. 1 Introduction On the Semantic Web (SW) [2] difficulties arise when trying to model real-world do- mains using purely logical formalisms, since real-world knowledge generally involves some degree of uncertainty or imprecision. In recognition of the need to soundly rep- resent uncertain knowledge, the World Wide Web Consortium (W3C) created, in 2007, the Uncertainty Reasoning for the World Wide Web Incubator Group 1 (URW3-XG), with the aim of identifying the requirements for reasoning with and representing the uncertain knowledge in Web-based information. Several approaches to representation and inference with knowledge enriched with probabilistic information have been proposed: some extend knowledge representation formalisms actually used in the SW; others rely on probabilistic enrichment of Descrip- tion Logics or logic programming formalisms. Motivation The main problem of applying these approaches in real settings is given by the fact that they almost always assume the availability of probabilistic information. However, except of seldom cases, this information would be hardly known in advance. Having a method that, exploiting available information on the data, i.e. an already designed and 1 http://www.w3.org/2005/Incubator/urw3/ populated ontology, is able to capture the necessary probabilistic information would be of great help. Also, when relying on SW knowledge bases for reasoning with the Open World Assumption (OWA) (e.g. when OWL is considered as a syntactic variant of some De- scription Logic [1]), it is not always possible to know the truth value of an assertion: under OWA, a statement is true or false only if its truth value can be formally derived. As a consequence, there can be some cases (e.g. determining if an individual is a mem- ber of a given concept) for which the truth value cannot be determined (it cannot be derived neither that the individual is instance of the considered concept nor that the individual is instance of the negated concept). This is opposed by the Closed World Assumption (CWA), employed by a multitude of first order logic fragments and in the Data Base setting where every statement that cannot be proved to be true, is assumed to be false. Related Work Within the SW, Machine Learning (ML) is going to cover a relevant role in the anal- ysis of distributed data sources described using SW standards [24], with the aim of discovering new and refining existing knowledge. A collection of ML approaches ori- ented to SW have already been proposed in literature, ranging from propositional and single-relational (e.g. SPARQL-ML [14], or based on low-rank matrix approximation techniques such as in [24, 25]) to multi-relational (e.g. distance-based [6, 9] or kernel- based [10, 3]). In the class of multi-relational learning methods, Statistical Relational Learning [13] (SRL) one seem particularly appealing, being designed to learn in domains with both a complex relational and a rich probabilistic structure; the URW3-XG provided in [16] a large group of situations in which knowledge on the SW needs to represent uncertainty, ranging from recommendation and extraction/annotation to belief fusion/opinion pool- ing and healthcare/life sciences. There have already been some proposals regarding the adaptation and application of SRL systems to the SW, e.g. [7] proposes to employ Markov Logic Networks [21] for first-order probabilistic inference and learning within the SW, and [18] proposes to learn first-order probabilistic theories in a probabilistic extension of the ALC Description Logic named CRALC. However, such ML techniques make strong assumptions about the nature of the missing knowledge (e.g. both matrix completion methods and the technique proposed in [18] inherently assume data is Missing at Random [23], while Markov Logic Networks resort to Closed World Assumption during learning). Learning from incomplete knowl- edge bases by adopting methods not coherent with the nature of the missing knowledge itself (e.g. expecting it to be Missing at Random while it is Informatively Missing) can lead to misleading results with respect to the real model followed by the data [22]. We realised a SRL system for incrementally inducing a terminological naı̈ve Bayesian classifier, i.e. a naı̈ve Bayesian network modelling the conditional dependencies be- tween a learned set of Description Logic (complex) concepts and a target atomic con- cept the system aims to learn. Our system is focused to the SW, being able to learn classifiers with a structure which is both logically and statistically rich, and to deal with the missing knowledge resulting from the adoption of the OWA with methods that are consistent with the assumed nature of the missing knowledge (i.e. Missing Completely at Random, Missing at Random or Informatively Missing). In the rest of this paper, we will first describe Bayesian Networks (and some extensions we will employ to deal with some potentially problematic cases); then we will describe our probabilistic-logic model, terminological Bayesian classifiers, and the problem of learning it from a set of training individuals and a Description Logic knowledge base. In the last part, we will describe our learning algorithm, and the adaptations to learn under different assump- tions on the ignorance model. 2 Bayesian Networks and Robust Bayesian Estimation Graphical models [19] (GMs) are a popular framework to compactly describe the joint probability distribution for a set of random variables, by representing the underlying structure through a series of modular factors. Depending on the underlying semantics, GMs can be grouped into two main classes: directed graphical models, which found on directed graphs, and undirected graphical models, founding on undirected graphs. A Bayesian network (BN) is a directed GM which represents the conditional de- pendencies in a set of random variables by using a directed acyclic graph (DAG) G augmented with a set of conditional probability distributions θG associated with G’s vertices. In such graph, each vertex corresponds to a random variable Xi (e.g. an ob- servable quantity, a set of unknown parameters etc.) and each edge indicates a direct influence relation between the two random variables; this allows to define conditional independence relationships between the variables, which are independent from any of their non-descendants, given the value of their parent variables. A BN stipulates a set of conditional independence assumptions, also called local Markov assumptions, over its set of random variables: each vertex Xi in the DAG is conditionally independent of any subset S ⊆ N d(Xi ) of vertices that are not descen- dants of Xi given a joint state of its parents: ∀Xi : Pr(Xi | S, parents(Xi )) = Pr(Xi | parents(Xi )); where the function parents(Xi ) returns the parent vertices of Xi in the DAG repre- senting the BN. The conditional independence assumption allows to represent the joint probability distribution Pr(X1 , . . . , Xn ) defined by a BN over a set of random variables {X1 , . . . , Xn } as a production of the individual probability distributions, conditional on their parent variables: n Y Pr(X1 , . . . , Xn ) = Pr(Xi | parents(Xi )); i=1 As a result, it is possible to define Pr(X1 , . . . , Xn ) by only specifying, for each vertex Xi in the graph, the conditional probability distribution Pr(Xi | parents(Xi )). Given a BN specifying a joint probability distribution over a set of variables, it is possible to evaluate inference queries by marginalization, like calculating the posterior probability distribution for a set of query variables given some observed event (i.e. assignment of values to the set of evidence variables). Exact inference for general BNs is an NP-hard problem, but algorithms exist to efficiently infer in restricted classes of networks, such as variable elimination, which has linear complexity in the number of vertices if the BN is a singly connected net- work [15]. Approximate inference methods also exist in literature, such as Monte Carlo algorithms, that provide approximate answers whose accuracy depends on the num- ber of samples generated. Other methods in this family, such as belief propagation or variational methods, approximate sums of random variables through their means [15]. However, finding an optimal structure for a BN may not be trivial: the number of possible structures for a DAG is super-exponential (O(2f (n) ), with f (n) = n1+ ,  > 0) in the size of its vertices (r4 = 543, r8 ≈ 7, 8 × 1011 , r12 ≈ 5, 2 × 1026 ), making it impractical, in many cases, to perform an exhaustive search through the space of possible structures. Therefore, in our approach, we tried to find an acceptable trade-off between efficiency and expressiveness, so to make our method suitable for a context like SW: we decided to focus on a particular subclass of Bayesian networks, i.e. naı̈ve Bayesian networks, modelling the dependencies between a set of random variables F = {F1 , . . . , Fn }, also called features, and a random variable C, also called class, so that each pair of features are independent of each other given the class, i.e. ∀Fi , Fj ∈ F : i 6= j ⇒ (Fi ⊥ ⊥ Fj | C). This kind of models is especially interesting since they proved to be effective also in contexts in which the underlying independence assumptions are violated [8], even outperforming more current approaches [4]. However, defining a BN requires a number of precise probability assessments which, as we will see, will not be always possible to obtain. A generalisation of naı̈ve Bayesian networks to probability intervals is the robust Bayesian estimator [20] (RBE): each con- ditional probability in the network is a probability interval characterised by its lower and upper bounds, defined respectively as Pr(A) = min Pr(A) and Pr(A) = max Pr(A). Pr∈P Pr∈P The main problem with this approach is assigning class labels, after having calcu- lated the posterior probability intervals: if the two resulting intervals do not overlap, it is possible to apply the so called stochastic dominance criterion, which assigns a generic individual a to a target concept C iff Pr(C(a)) > Pr(¬C(a)). If the intervals overlap, to avoid undecidability, it is still possible to use a weaker criterion, called weak domi- nance criterion [20] by representing each probability interval into a single probability value represented by its middle point, which indeed underlies some assumptions on the distribution of the missing values. A similar approach, founded on imprecise probability theory, is presented in [5] and proposes using a Credal network (structurally similar to a BN, but where the conditional probability densities belong to convex sets of mass functions) to represent uncertainty about network parameters. 3 Terminological Naı̈ve Bayesian Classifiers The learning problem we intend to focus on consists in learning a terminological naı̈ve Bayesian classifier NK ; this is defined as a naı̈ve BN modelling the dependency re- lations between a set of Description Logic (DL) concepts (also referred to as feature concepts) and a target atomic concept C, given a set of training individuals. Feature concepts may eventually be complex, and the training individuals are distinguished in positive, negative and neutral, belonging respectively to the target concept C, ¬C and or whose membership of C is unknown. A DL Knowledge Base (KB) K is typically constituted by (at least) two main components, a TBox T and an ABox A: – TBox – which introduces the terminology of an application domain, in terms of axioms describing concept hierarchies; – ABox – which contains assertions (ground axioms) about named individuals in terms of this terminology. A terminological Bayesian classifier can be defined as follows: Definition 1 (Terminological Bayesian Classifier). A terminological Bayesian classi- fier NK , with respect to a DL KB K, is defined as a pair hG, ΘG i, where: – G = hV, Ei is a directed acyclic graph, in which: • V = {F1 , . . . , Fn , C} is a set of vertices, each Fi representing a DL (eventually complex) concepts defined over K and C representing a target atomic concept; • E ⊆ V × V is a set of edges, modelling the independence relations between the elements of V; – ΘG is a set of conditional probability distributions (CPD), one for each V ∈ V, representing the conditional probability of the feature concept given the state of its parents in the graph. in which the membership probability of a generic individual a to the target concept C (or ¬C) is estimated using BN inference techniques given the membership of a to the concepts in V. In particular, a terminological naı̈ve Bayesian Classifier is characterised by the fol- lowing structure: E = {hC, Fi i | i ∈ {1, . . . , n}} (i.e. each feature concept is indepen- dent from the other feature concept, given the value of the target atomic concept). Example 1 (Example of Terminological Naı̈ve Bayesian Classifier). Given a set of DL feature concepts F = {F emale, HasChild := ∃hasChild.P erson} 2 and a target concept F ather, a terminological naı̈ve Bayesian classifier expressing the target con- cept in terms of the feature concepts is the following: Pr(F emale|F ather) Pr(F emale|¬F ather) F emale F ather Pr(HasChild|F ather) Pr(HasChild|¬F ather) HasChild := ∃hasChild.P erson Let K be a DL KB and a a generic individual so that K |= HasChild(a) and the membership of a to the concept F emale is not known, i.e. K 6 |=F emale(a) ∧ K 6 |=¬F emale(a). It is possible to infer, through the given network, the probability that the individual a is a member of the target atomic concept F ather: 2 In examples, variable names are used instead of complex feature concepts for brevity Pr(F ather) Pr(HasChild | F ather) Pr(F ather(a)) = X ; Pr(F ather0 ) Pr(HasChild | F ather0 ) F ather 0 ∈{F ather,¬F ather} In the following we define the problem of learning a terminological Bayesian clas- sifier NK given a DL KB K and the training individuals IndC (A): Definition 2 (Terminological Bayesian Classifier Learning Problem). Our termino- ∗ logical naı̈ve Bayesian classifier learning problem consists in finding a network NK that maximizes the quality of the network with respect to the training instances and a specific scoring function; formally: Given – a target concept C we aim to learn; – a DL KB K = hT , Ai, where the ABox A contains membership assertions about individuals and C, while the TBox T does not contain assertions involv- ing C; – the disjoint sets of of positive, negative and neutral examples for C, denoted − with Ind+ 0 C (A), IndC (A) and IndC (A), so that: + • ∀a ∈ IndC (A) : C(a) ∈ A, • ∀a ∈ Ind− C (A) : ¬C(a) ∈ A, • ∀a ∈ Ind0C (A) : C(a) 6∈ A ∧ ¬C(a) 6∈ A; – a scoring function specifying the quality of an induced terminological Bayesian classifier NKSwith respect to the samples in IndC (A) = v∈{+,−,0} IndvC (A) and a scoring criterion; Find a network NK∗ maximizing the score function with respect to the samples: NK∗ ← arg max score(NK , IndC (A))). NK Our search space, to find the optimal network NK∗ , may be too large to explore exhaustively; therefore our learning algorithm, outlined in Alg. 1, works by greedily searching the space of features (i.e. DL complex concepts) for the ones that maximize the score of the induced network, with respect to a scoring function, and incrementally building the resulting network. While the features are added one by one, the search in the space of DL complex concepts is made through a beam search, employing the ρcl ↓ closure of the downward refinement operator ρ↓ described in [17]. For each new complex concept being evaluated, the algorithm creates a new set of concepts V 0 and finds the optimal structure (under a given set of constraints) E 0 (which, in the case of terminological naı̈ve Bayesian classifiers, is already defined) and the corresponding maximum likelihood parameters ΘG 0 (which may vary depending on the assumptions on the nature of the ignorance model), then scores the new network with respect to a scoring criterion. Different Assumptions on the Ignorance Model Let K = hT , Ai be a DL KB; under OWA, it is not always possible to know if a generic DL assertion α is or is not entailed by K (i.e. there may be cases in which K 6 |=α∧K 6 |=¬α). This allows us to characterize our lack of knowledge about concept- memberships through the probability distribution of the ignorance model [23]. Given a generic concept C, a generic individual a and a DL KB K∗ , let I be an ignorance model from which we extract a fragment of K∗ , I(K∗ ) = K (so that ∀α : K |= α ⇒ K∗ |= α ∧ K∗ |= α 6⇒ K |= α). Let denote NK as a probabilistic model that, from a DL KB K, calculates the probability that the concept-membership relation between C and a is unknown. We can say that the ignorance model underlying the concept-membership relation between a and C in K (with respect to a, K∗ and the aforementioned probabilistic model) is: – MCAR (Missing Completely at Random) – when the probability for such concept- membership to be missing is independent from the knowledge on a available in K∗ : Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K∗ ) = Pr(K 6|= C(a) ∧ K 6|= ¬C(a)); – MAR (Missing At Random) – when the probability for such concept-membership to be missing depends on the knowledge on a available in K: Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K∗ ) = Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K); – NMAR (Not Missing At Random, also referred to as IM, Informatively Missing) – when the probability for such concept-membership to be missing depends on the knowledge on a available in K∗ : Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K∗ ) 6= Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K). Algorithm 1 Algorithm for Learning Terminological Bayesian Classifiers Require: DL KB K = hT , Ai, Concept Start, Beam Width w, Search depth d, Maximum concept description length maxLen, Positive, Negative, Neutral training individuals IndC (A); Ensure: NK = hG, ΘG i, G = hV ← {C}, E ← ∅i; 1: repeat 2: Best ← ∅; Beam ← {Start}; N ewBeam ← ∅; 3: repeat 4: for c ∈ Beam do 5: for c0 ∈ {ρcl 0 ↓ (c) | |c | ≤ min(|c| + d, maxLen)} do 6: NK ← optimalN etwork(V ∪ {c0 }, IndC (A)); 0 7: s0 ← score(NK0 , IndC (A)); 8: N ewBeam ← N ewBeam ∪ {hNK0 , s0 i}; 9: end for 10: end for 11: Best ← arg maxhNK0 ,s0 i (s0 : hNK0 , s0 i ∈ N ewBeam ∪ {Best}); 12: Beam ← selectF rom(N ewBeam, w); N ewBeam ← ∅; 13: until stopping criterion on Beam; 14: NK ← NK0 : hNK0 , s0 i = Best; 15: until stopping criterion on NK ; Specifically, in our algorithm, the outer loop (lines 1-15) greedily searches for a new (complex) concept definition whose addition increases the network’s quality on the given sample instances (determined by a scoring function score). The search through the space of concept definitions is performed in the inner loop (lines 3-13) through a beam search: starting from a beginning concept Start, for each refinement level, all refinements up to a given length are memorized in a priority queue N ewBeam (sorted according to the score associated to the network generated by adding them to the set of feature concepts) from which only the k with the highest score are selected, by the selection function selectF rom, to be refined in the next iteration. The functions optimalN etwork and score are used, respectively, to find the opti- mal Bayesian network structure between the nodes in the network (eventually under a set of constraints, like in the naı̈ve Bayes case or some of its extensions) and for scoring a classifier (to compare its effectiveness with others). However, those two functions are sensitive to the assumptions made about the ignorance model. When the assumed ignorance model is MCAR, we are allowed to use an approach called available case analysis [15], in which we build an unbiased estimator of the network parameters, based only on available knowledge. A scoring function we realised for such case is the network’s log-likelihood on training data, calculated only on positive and negative training individuals, ignoring the available knowledge about the concept- membership relations between such individuals and the target concept C, and defined as: X X L(NK | IndC (A)) = log Pr(NK ) + log Pr(C(a) | NK ) + log Pr(¬C(a) | NK ); a∈Ind+ C (A) a∈Ind− C (A) Another approach we implemented consisted in ranking both positive and negative training individuals a according to P (C(a) | NK ), and then calculating the area un- der the Precision-Recall curve using different acceptance thresholds. Under the naı̈ve Bayes assumption, there is no need to perform a search for find- ing the optimal network, since the structure is already fixed (each node except the target concept node has only one parent, i.e. the target concept node); otherwise, finding a net- work structure which is optimal under some criterion (e.g. the BIC score [15]) may re- quire an exhaustive search in the space of possible structures. However, tree-augmented naı̈ve Bayesian networks (which allow for a tree structure among feature nodes), it is possible to efficiently compute the optimal structure employing the method in [12], making it appealing for real-life applications requiring efficiency and scalability. In the MAR case, a possible solution for learning models accounting for missing knowledge is to use the Expectation-Maximization (EM) algorithm, MCMC sampling or the gradient ascent method [15]. We use EM to learn terminological naı̈ve Bayesian classifiers from MAR data. In our approach, outlined in Alg. 2, we first heuristically estimate network’s parameters by only using available data; then, in order to find the maximum likelihood parameters with respect to both observed and missing knowledge, we consider individuals whose membership to a particular concept description D is not known as several fractional individuals belonging, with different weights (correspond- ing to the posterior probability of their class membership), to both the components D and ¬D. Formally, the EM algorithm for parameters learning explores the space of possible parameters through an iterative hill-climbing search, converging to a (local) maximum likelihood estimate of the unknown parameters, where the (log-)likelihood (which we also use as scoring criterion) is defined as follows: X X L(NK | IndC (A)) = log Pr(NK ) + log Pr(C 0 (a) | NK ) Pr(C 0 | NK ) a∈Ind0C (A) C 0 ∈{C,¬C} X X + log Pr(C(a) | NK ) + log Pr(¬C(a) | NK ); a∈Ind+ C (A) a∈Ind− C (A) At each iteration, the EM algorithm applies the following two steps: – Expectation step – using available data and the current network parameters, cal- culate a distribution over possible completions for the missing knowledge; – Maximization step – considering each possible completion as a fully available data case (weighted by its probability), calculate next parameters using (weighted) frequency counting. In our use of the EM algorithm, the E-step calculates the concept-membership pos- terior probability (inferencing through the network) of each individual whose concept- membership relation in unknown, thus completing the data through so called expected counts. Then, the M-step calculates a new estimate of the network’s conditional prob- ability distributions by using expected counts, maximizing the log-likelihood of both available and missing data with respect to a network NK . About finding optimal structures for networks with less restrictions on their struc- ture (such as tree-augmented naı̈ve BNs or unrestricted BNs) from MAR data, it is possible to employ the Structural EM (SEM) algorithm [11]. In SEM, the maximiza- tion step is performed both in the space of structures G and in the space of parameters ΘG , by first searching a better structure and then the best parameters associated to the given structure; it can be proven that, if the search procedure finds a structure that is better than the one used in the previous iteration with respect to e.g. the BIC score, then the structural EM algorithm will monotonically improve the score. When knowledge is NMAR, it is generally possible to extend the probabilistic model to produce one where the MAR assumption holds; e.g. if a feature concept Fi follows a NMAR ignorance model, with respect to a generic individual a and a DL KB K, we can consider its observability as an additional variable (e.g. Yi = 0 iff K 6|= Fi (a) ∧ K 6|= ¬Fi (a), Yi = 1 otherwise) in our probabilistic model, so that Fi ’s ignorance model satisfies the MAR assumption (since its missingness depends on an always observable variable). An alternate solution is recurring to robust Bayesian estimation [20] (RBE), to learn conditional probability distributions without making any sort of assumption about the nature of the missing data. RBE finds probability intervals instead of single probability values, obtained by taking in account all the possible fillings of the missing knowledge; the width of inferred intervals is therefore directly proportional to the quantity of miss- ing knowledge considered during the learning process. To score each new induced net- work, we employ the framework proposed in [26] to compare credal networks, while Algorithm 2 Outline for our implementation of the EM algorithm for parameter learn- ing in a terminological Bayesian classifier assuming the underlying ignorance model is MAR. function ExpectedCounts(NK , IndC (A)) 1: NK = hG, ΘG i, G = hV, Ei; 2: for Xi ∈ V do 3: for hxi , πxi i ∈ vals(Xi , parents(Xi )) do 4: {n̄(xi , πxi ) contains the expected count for (Xi = xi , parents(Xi ) = πxi )} 5: n̄(xi , πxi ) ← 0; 6: end for 7: end for 8: for a ∈ IndC (A) do 9: for Xi ∈ V do 10: for hxi , πxi i ∈ vals(Xi , parents(Xi )) do 11: {Each expected count n̄(xi , πxi ) is obtained summing out the probability assign- ments to the concept memberships (Xi = xi , parents(Xi ) = πxi ) for each indi- vidual, calculated using the background knowledge K and, if they are only partially known, inferring through the network NK } 12: n̄(xi , πxi ) ← n̄(xi , πxi ) + Pr(xi , πxi | NK ); 13: end for 14: end for 15: end for 16: return {n̄(xi , πxi ) | Xi ∈ V, hxi , πxi i ∈ vals(Xi , parents(Xi ))}; function ExpectationM aximization(NK0 , IndC (A)) 1: {The network was first initialized with arbitrary heuristic parameters ΘG0 } 2: NK0 = hG, ΘG0 i, G = hV, Ei; 3: t ← 0; 4: repeat 5: {n̄(xi , πxi )} ← ExpectedCounts(NK , IndC (A)); 6: for Xi ∈ V do 7: for hxi , πxi i ∈ vals(Xi , parents(Xi )) do n̄(xi ,πxi ) 8: θGt+1 (xi , πxi ) ← P 0 n̄(x0 ,πx ) ; x ∈vals(Xi ) i i i 9: end for 10: end for 11: t ← t + 1; 12: NKt = hG, ΘGt i; 13: {The EM loop ends when improvements to the network’s log-likelihood go below a cer- tain threshold} 14: until L(NKt = hG, ΘGt )i | IndC (A)) − L(NKt−1 = hG, ΘGt−1 i | IndC (A)) ≤ τ ; 15: return NKt ; we do not have implemented yet a method to search for structures other than naı̈ve Bayesian. Example 2 (Example of Terminological Naı̈ve Bayesian Classifier using Robust Bayesian Estimation). The following is a terminological naı̈ve Bayesian classifier using robust Bayesian estimation for inferring posterior probability intervals in presence of NMAR knowledge. In this networks, conditional probability tables associated to each node con- tain probability intervals instead of probability values, each defined by its upper and lower bound. [Pr(F emale|F ather),Pr(F emale|F ather)] [Pr(F emale|¬F ather),Pr(F emale|¬F ather)] F e := F emale F a := F ather [Pr(HasChild|F ather),Pr(HasChild|F ather)] [Pr(HasChild|¬F ather),Pr(HasChild|¬F ather)] HC := ∃hasChild.P erson Inference, using such network, can be performed as follows – given a generic indi- vidual a and given that K |= HC(a), the posterior probability interval that a is a mem- ber of F a is represented by the probability interval [Pr(F a | HC), Pr(F a | HC)], where: Pr(HC | F a)Pr(F a) Pr(F a(a)) = Pr(F a | HC) = ; Pr(HC | F a)Pr(F a) + Pr(HC | ¬F a)Pr(¬F a) Pr(HC | F a)Pr(F a) Pr(F a(a)) = Pr(F a | HC) = ; Pr(HC | F a)Pr(F a) + Pr(HC | ¬F a)Pr(¬F a) 4 Conclusions and Future Work We presented a Statistical Relational Learning method designed for learning termino- logical naı̈ve Bayesian classifiers, a ML method based on the naı̈ve Bayes assumption for estimating the probability that a generic individual belongs to a certain target con- cept, given its membership relation to an induced set of complex Description Logic concepts. We gave a characterisation of the lack of knowledge that may be introduced by the OWA depending on the underlying ignorance model, and handled such missing knowledge under different assumptions on the nature of missing knowledge itself (i.e. Missing Completely at Random, Missing at Random or Informatively Missing). In the future, we aim at estimating computationally the ignorance model followed by each feature, at developing new methods to exploit the potential information contained in knowledge’s missingness and evaluate our methods’ effectiveness on real world ontolo- gies. References [1] OWL 2 Web Ontology Language Direct Semantics (October 2009), http://www.w3. org/TR/owl2-direct-semantics/ [2] Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (May 2001) [3] Bicer, V., et al.: Relational kernel machines for learning from graph-structured rdf data. In: Antoniou, G., et al. (eds.) ESWC (1). LNCS, vol. 6643, pp. 47–62. Springer (2011) [4] Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algo- rithms. In: ICML2006. pp. 161–168. ACM, New York, NY, USA (2006) [5] Corani, G., Zaffalon, M.: Naive credal classifier 2: an extension of naive bayes for deliver- ing robust classifications. In: DMIN. pp. 84–90 (2008) [6] d’Amato, C., Fanizzi, N., Esposito, F.: Query answering and ontology population: an in- ductive approach. In: ESWC 2008. pp. 288–302. Springer (2008) [7] Domingos, P., et al.: Uncertainty reasoning for the semantic web i. chap. Just Add Weights: Markov Logic for the Semantic Web, pp. 1–25. Springer (2008) [8] Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zero- one loss. Machine Learning 29(2-3), 103–130 (1997) [9] Fanizzi, N., et al.: Reduce: A reduced coulomb energy network method for approximate classification. In: Aroyo, L., et al. (eds.) ESWC. pp. 323–337. Springer (2009) [10] Fanizzi, N., D’Amato, C., Esposito, F.: Learning with kernels in description logics. In: ILP 2008. pp. 210–225. Springer (2008) [11] Friedman, N.: The Bayesian structural EM algorithm. In: UAI 1998. pp. 129–138. Morgan Kaufmann Publishers Inc., San Francisco, CA (1998) [12] Friedman, N., Geiger, D., Goldszmidt, M., Provan, G., Langley, P., Smyth, P.: Bayesian network classifiers. In: Machine Learning. pp. 131–163 (1997) [13] Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning (Adaptive Computa- tion and Machine Learning). The MIT Press (2007) [14] Kiefer, C., et al.: Adding Data Mining Support to SPARQL via Statistical Relational Learn- ing Methods. In: ESWC 2008. LNCS, vol. 5021, pp. 478–492. Springer (2008) [15] Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press (2009) [16] Laskey, K.J., Laskey, K.B.: Uncertainty reasoning for the world wide web: Report on the urw3-xg incubator group. In: Bobillo, F., et al. (eds.) URSW. CEUR Workshop Proceed- ings, vol. 423. CEUR-WS.org (2008) [17] Lehmann, J., Hitzler, P.: Concept learning in description logics using refinement operators. Mach. Learn. 78, 203–250 [18] Luna, J.E.O., Cozman, F.G.: An algorithm for learning with probabilistic description logics. In: Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.) URSW. CEUR Workshop Proceedings, vol. 527, pp. 63–74. CEUR-WS.org (2009) [19] Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1988) [20] Ramoni, M., Sebastiani, P.: Robust learning with missing data. Mach. Learn. 45, 147–170 (October 2001) [21] Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62, 107–136 (Febru- ary 2006) [22] Rodrigues De Morais, S., Aussem, A.: Exploiting data missingness in bayesian network modeling. In: IDA 2009. pp. 35–46. Springer (2009) [23] Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976) [24] Tresp, V., et al.: Uncertainty reasoning for the semantic web i. chap. Towards Machine Learning on the Semantic Web, pp. 282–314. Springer (2008) [25] Tresp, V., Huang, Y., Bundschus, M., Rettinger, A.: Materializing and querying learned knowledge. In: IRMLeS 2009 (2009) [26] Zaffalon, M., Corani, G., Mauá, D.: Utility-based accuracy measures to empirically evalu- ate credal classifiers. In: ISIPTA 2011. pp. 401–410. Innsbruck (2011)