Learning Terminological Naı̈ve Bayesian Classifiers
            Under Different Assumptions on Missing Knowledge

                 Pasquale Minervini, Claudia d’Amato, and Nicola Fanizzi

       LACAM – Dipartimento di Informatica – Università degli Studi di Bari “Aldo Moro”
                          via E. Orabona, 4 - 70125 Bari - Italia
      pasquale.minervini@uniba.it, {claudia.damato, fanizzi}@di.uniba.it


         Abstract. Knowledge available through Semantic Web standards can easily be
         missing, generally because of the adoption of the Open World Assumption (i.e.
         the truth value of an assertion is not necessarily known). However, the rich rela-
         tional structure that characterizes ontologies can be exploited for handling such
         missing knowledge in an explicit way. We present a Statistical Relational Learn-
         ing system designed for learning terminological naı̈ve Bayesian classifiers, which
         estimate the probability that a generic individual belongs to the target concept
         given its membership to a set of Description Logic concepts. During the learning
         process, we consistently handle the lack of knowledge that may be introduced by
         the adoption of the Open World Assumption, depending on the varying nature of
         the missing knowledge itself.


1     Introduction

On the Semantic Web (SW) [2] difficulties arise when trying to model real-world do-
mains using purely logical formalisms, since real-world knowledge generally involves
some degree of uncertainty or imprecision. In recognition of the need to soundly rep-
resent uncertain knowledge, the World Wide Web Consortium (W3C) created, in 2007,
the Uncertainty Reasoning for the World Wide Web Incubator Group 1 (URW3-XG),
with the aim of identifying the requirements for reasoning with and representing the
uncertain knowledge in Web-based information.
    Several approaches to representation and inference with knowledge enriched with
probabilistic information have been proposed: some extend knowledge representation
formalisms actually used in the SW; others rely on probabilistic enrichment of Descrip-
tion Logics or logic programming formalisms.


Motivation

The main problem of applying these approaches in real settings is given by the fact
that they almost always assume the availability of probabilistic information. However,
except of seldom cases, this information would be hardly known in advance. Having a
method that, exploiting available information on the data, i.e. an already designed and
 1
     http://www.w3.org/2005/Incubator/urw3/
populated ontology, is able to capture the necessary probabilistic information would be
of great help.
    Also, when relying on SW knowledge bases for reasoning with the Open World
Assumption (OWA) (e.g. when OWL is considered as a syntactic variant of some De-
scription Logic [1]), it is not always possible to know the truth value of an assertion:
under OWA, a statement is true or false only if its truth value can be formally derived.
As a consequence, there can be some cases (e.g. determining if an individual is a mem-
ber of a given concept) for which the truth value cannot be determined (it cannot be
derived neither that the individual is instance of the considered concept nor that the
individual is instance of the negated concept). This is opposed by the Closed World
Assumption (CWA), employed by a multitude of first order logic fragments and in the
Data Base setting where every statement that cannot be proved to be true, is assumed to
be false.

Related Work
Within the SW, Machine Learning (ML) is going to cover a relevant role in the anal-
ysis of distributed data sources described using SW standards [24], with the aim of
discovering new and refining existing knowledge. A collection of ML approaches ori-
ented to SW have already been proposed in literature, ranging from propositional and
single-relational (e.g. SPARQL-ML [14], or based on low-rank matrix approximation
techniques such as in [24, 25]) to multi-relational (e.g. distance-based [6, 9] or kernel-
based [10, 3]).
    In the class of multi-relational learning methods, Statistical Relational Learning [13]
(SRL) one seem particularly appealing, being designed to learn in domains with both a
complex relational and a rich probabilistic structure; the URW3-XG provided in [16] a
large group of situations in which knowledge on the SW needs to represent uncertainty,
ranging from recommendation and extraction/annotation to belief fusion/opinion pool-
ing and healthcare/life sciences. There have already been some proposals regarding
the adaptation and application of SRL systems to the SW, e.g. [7] proposes to employ
Markov Logic Networks [21] for first-order probabilistic inference and learning within
the SW, and [18] proposes to learn first-order probabilistic theories in a probabilistic
extension of the ALC Description Logic named CRALC.
    However, such ML techniques make strong assumptions about the nature of the
missing knowledge (e.g. both matrix completion methods and the technique proposed in
[18] inherently assume data is Missing at Random [23], while Markov Logic Networks
resort to Closed World Assumption during learning). Learning from incomplete knowl-
edge bases by adopting methods not coherent with the nature of the missing knowledge
itself (e.g. expecting it to be Missing at Random while it is Informatively Missing) can
lead to misleading results with respect to the real model followed by the data [22].
    We realised a SRL system for incrementally inducing a terminological naı̈ve Bayesian
classifier, i.e. a naı̈ve Bayesian network modelling the conditional dependencies be-
tween a learned set of Description Logic (complex) concepts and a target atomic con-
cept the system aims to learn. Our system is focused to the SW, being able to learn
classifiers with a structure which is both logically and statistically rich, and to deal with
the missing knowledge resulting from the adoption of the OWA with methods that are
consistent with the assumed nature of the missing knowledge (i.e. Missing Completely
at Random, Missing at Random or Informatively Missing). In the rest of this paper, we
will first describe Bayesian Networks (and some extensions we will employ to deal
with some potentially problematic cases); then we will describe our probabilistic-logic
model, terminological Bayesian classifiers, and the problem of learning it from a set of
training individuals and a Description Logic knowledge base. In the last part, we will
describe our learning algorithm, and the adaptations to learn under different assump-
tions on the ignorance model.


2   Bayesian Networks and Robust Bayesian Estimation
Graphical models [19] (GMs) are a popular framework to compactly describe the joint
probability distribution for a set of random variables, by representing the underlying
structure through a series of modular factors. Depending on the underlying semantics,
GMs can be grouped into two main classes: directed graphical models, which found on
directed graphs, and undirected graphical models, founding on undirected graphs.
    A Bayesian network (BN) is a directed GM which represents the conditional de-
pendencies in a set of random variables by using a directed acyclic graph (DAG) G
augmented with a set of conditional probability distributions θG associated with G’s
vertices. In such graph, each vertex corresponds to a random variable Xi (e.g. an ob-
servable quantity, a set of unknown parameters etc.) and each edge indicates a direct
influence relation between the two random variables; this allows to define conditional
independence relationships between the variables, which are independent from any of
their non-descendants, given the value of their parent variables.
    A BN stipulates a set of conditional independence assumptions, also called local
Markov assumptions, over its set of random variables: each vertex Xi in the DAG is
conditionally independent of any subset S ⊆ N d(Xi ) of vertices that are not descen-
dants of Xi given a joint state of its parents:

               ∀Xi : Pr(Xi | S, parents(Xi )) = Pr(Xi | parents(Xi ));

where the function parents(Xi ) returns the parent vertices of Xi in the DAG repre-
senting the BN. The conditional independence assumption allows to represent the joint
probability distribution Pr(X1 , . . . , Xn ) defined by a BN over a set of random variables
{X1 , . . . , Xn } as a production of the individual probability distributions, conditional on
their parent variables:
                                              n
                                              Y
                     Pr(X1 , . . . , Xn ) =         Pr(Xi | parents(Xi ));
                                              i=1

    As a result, it is possible to define Pr(X1 , . . . , Xn ) by only specifying, for each
vertex Xi in the graph, the conditional probability distribution Pr(Xi | parents(Xi )).
    Given a BN specifying a joint probability distribution over a set of variables, it is
possible to evaluate inference queries by marginalization, like calculating the posterior
probability distribution for a set of query variables given some observed event (i.e.
assignment of values to the set of evidence variables).
    Exact inference for general BNs is an NP-hard problem, but algorithms exist to
efficiently infer in restricted classes of networks, such as variable elimination, which
has linear complexity in the number of vertices if the BN is a singly connected net-
work [15]. Approximate inference methods also exist in literature, such as Monte Carlo
algorithms, that provide approximate answers whose accuracy depends on the num-
ber of samples generated. Other methods in this family, such as belief propagation or
variational methods, approximate sums of random variables through their means [15].
    However, finding an optimal structure for a BN may not be trivial: the number of
possible structures for a DAG is super-exponential (O(2f (n) ), with f (n) = n1+ ,  >
0) in the size of its vertices (r4 = 543, r8 ≈ 7, 8 × 1011 , r12 ≈ 5, 2 × 1026 ), making
it impractical, in many cases, to perform an exhaustive search through the space of
possible structures. Therefore, in our approach, we tried to find an acceptable trade-off
between efficiency and expressiveness, so to make our method suitable for a context
like SW: we decided to focus on a particular subclass of Bayesian networks, i.e. naı̈ve
Bayesian networks, modelling the dependencies between a set of random variables F =
{F1 , . . . , Fn }, also called features, and a random variable C, also called class, so that
each pair of features are independent of each other given the class, i.e. ∀Fi , Fj ∈ F :
i 6= j ⇒ (Fi ⊥    ⊥ Fj | C).
    This kind of models is especially interesting since they proved to be effective also
in contexts in which the underlying independence assumptions are violated [8], even
outperforming more current approaches [4].
    However, defining a BN requires a number of precise probability assessments which,
as we will see, will not be always possible to obtain. A generalisation of naı̈ve Bayesian
networks to probability intervals is the robust Bayesian estimator [20] (RBE): each con-
ditional probability in the network is a probability interval characterised by its lower
and upper bounds, defined respectively as Pr(A) = min Pr(A) and Pr(A) = max Pr(A).
                                                        Pr∈P                         Pr∈P
    The main problem with this approach is assigning class labels, after having calcu-
lated the posterior probability intervals: if the two resulting intervals do not overlap, it is
possible to apply the so called stochastic dominance criterion, which assigns a generic
individual a to a target concept C iff Pr(C(a)) > Pr(¬C(a)). If the intervals overlap,
to avoid undecidability, it is still possible to use a weaker criterion, called weak domi-
nance criterion [20] by representing each probability interval into a single probability
value represented by its middle point, which indeed underlies some assumptions on the
distribution of the missing values.
    A similar approach, founded on imprecise probability theory, is presented in [5] and
proposes using a Credal network (structurally similar to a BN, but where the conditional
probability densities belong to convex sets of mass functions) to represent uncertainty
about network parameters.


3   Terminological Naı̈ve Bayesian Classifiers
The learning problem we intend to focus on consists in learning a terminological naı̈ve
Bayesian classifier NK ; this is defined as a naı̈ve BN modelling the dependency re-
lations between a set of Description Logic (DL) concepts (also referred to as feature
concepts) and a target atomic concept C, given a set of training individuals. Feature
concepts may eventually be complex, and the training individuals are distinguished in
positive, negative and neutral, belonging respectively to the target concept C, ¬C and
or whose membership of C is unknown. A DL Knowledge Base (KB) K is typically
constituted by (at least) two main components, a TBox T and an ABox A:
 – TBox – which introduces the terminology of an application domain, in terms of
   axioms describing concept hierarchies;
 – ABox – which contains assertions (ground axioms) about named individuals in
   terms of this terminology.
      A terminological Bayesian classifier can be defined as follows:

Definition 1 (Terminological Bayesian Classifier). A terminological Bayesian classi-
fier NK , with respect to a DL KB K, is defined as a pair hG, ΘG i, where:
 – G = hV, Ei is a directed acyclic graph, in which:
     • V = {F1 , . . . , Fn , C} is a set of vertices, each Fi representing a DL (eventually
       complex) concepts defined over K and C representing a target atomic concept;
     • E ⊆ V × V is a set of edges, modelling the independence relations between the
       elements of V;
 – ΘG is a set of conditional probability distributions (CPD), one for each V ∈ V,
   representing the conditional probability of the feature concept given the state of its
   parents in the graph.
in which the membership probability of a generic individual a to the target concept C
(or ¬C) is estimated using BN inference techniques given the membership of a to the
concepts in V.

   In particular, a terminological naı̈ve Bayesian Classifier is characterised by the fol-
lowing structure: E = {hC, Fi i | i ∈ {1, . . . , n}} (i.e. each feature concept is indepen-
dent from the other feature concept, given the value of the target atomic concept).

Example 1 (Example of Terminological Naı̈ve Bayesian Classifier). Given a set of DL
feature concepts F = {F emale, HasChild := ∃hasChild.P erson} 2 and a target
concept F ather, a terminological naı̈ve Bayesian classifier expressing the target con-
cept in terms of the feature concepts is the following:
                              Pr(F emale|F ather)
                             Pr(F emale|¬F ather)               F emale

               F ather       Pr(HasChild|F ather)
                            Pr(HasChild|¬F ather)
                                                    HasChild := ∃hasChild.P erson


    Let K be a DL KB and a a generic individual so that K |= HasChild(a) and the
membership of a to the concept F emale is not known, i.e. K 6 |=F emale(a) ∧ K 6
|=¬F emale(a). It is possible to infer, through the given network, the probability that
the individual a is a member of the target atomic concept F ather:
 2
     In examples, variable names are used instead of complex feature concepts for brevity
                                 Pr(F ather) Pr(HasChild | F ather)
Pr(F ather(a)) =                X                                                  ;
                                            Pr(F ather0 ) Pr(HasChild | F ather0 )
                    F ather 0 ∈{F ather,¬F ather}


     In the following we define the problem of learning a terminological Bayesian clas-
sifier NK given a DL KB K and the training individuals IndC (A):

Definition 2 (Terminological Bayesian Classifier Learning Problem). Our termino-
                                                                                     ∗
logical naı̈ve Bayesian classifier learning problem consists in finding a network NK
that maximizes the quality of the network with respect to the training instances and a
specific scoring function; formally:

Given

     – a target concept C we aim to learn;
     – a DL KB K = hT , Ai, where the ABox A contains membership assertions
       about individuals and C, while the TBox T does not contain assertions involv-
       ing C;
     – the disjoint sets of of positive, negative and neutral examples for C, denoted
                            −
       with Ind+                            0
                 C (A), IndC (A) and IndC (A), so that:
                      +
         • ∀a ∈ IndC (A) : C(a) ∈ A,
         • ∀a ∈ Ind−  C (A) : ¬C(a) ∈ A,
         • ∀a ∈ Ind0C (A) : C(a) 6∈ A ∧ ¬C(a) 6∈ A;
     – a scoring function specifying the quality of an induced terminological Bayesian
       classifier NKSwith respect to the samples in
       IndC (A) = v∈{+,−,0} IndvC (A) and a scoring criterion;
Find a network NK∗ maximizing the score function with respect to the samples:

                          NK∗ ← arg max score(NK , IndC (A))).
                                        NK


    Our search space, to find the optimal network NK∗ , may be too large to explore
exhaustively; therefore our learning algorithm, outlined in Alg. 1, works by greedily
searching the space of features (i.e. DL complex concepts) for the ones that maximize
the score of the induced network, with respect to a scoring function, and incrementally
building the resulting network. While the features are added one by one, the search in
the space of DL complex concepts is made through a beam search, employing the ρcl        ↓
closure of the downward refinement operator ρ↓ described in [17].
    For each new complex concept being evaluated, the algorithm creates a new set of
concepts V 0 and finds the optimal structure (under a given set of constraints) E 0 (which,
in the case of terminological naı̈ve Bayesian classifiers, is already defined) and the
corresponding maximum likelihood parameters ΘG 0 (which may vary depending on the
assumptions on the nature of the ignorance model), then scores the new network with
respect to a scoring criterion.
Different Assumptions on the Ignorance Model

Let K = hT , Ai be a DL KB; under OWA, it is not always possible to know if a
generic DL assertion α is or is not entailed by K (i.e. there may be cases in which
K 6 |=α∧K 6 |=¬α). This allows us to characterize our lack of knowledge about concept-
memberships through the probability distribution of the ignorance model [23]. Given
a generic concept C, a generic individual a and a DL KB K∗ , let I be an ignorance
model from which we extract a fragment of K∗ , I(K∗ ) = K (so that ∀α : K |=
α ⇒ K∗ |= α ∧ K∗ |= α 6⇒ K |= α). Let denote NK as a probabilistic model
that, from a DL KB K, calculates the probability that the concept-membership relation
between C and a is unknown. We can say that the ignorance model underlying the
concept-membership relation between a and C in K (with respect to a, K∗ and the
aforementioned probabilistic model) is:

 – MCAR (Missing Completely at Random) – when the probability for such concept-
   membership to be missing is independent from the knowledge on a available in K∗ :
   Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K∗ ) = Pr(K 6|= C(a) ∧ K 6|= ¬C(a));
 – MAR (Missing At Random) – when the probability for such concept-membership
   to be missing depends on the knowledge on a available in K:
   Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K∗ ) = Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K);
 – NMAR (Not Missing At Random, also referred to as IM, Informatively Missing)
   – when the probability for such concept-membership to be missing depends on the
   knowledge on a available in K∗ :
   Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K∗ ) 6= Pr(K 6|= C(a) ∧ K 6|= ¬C(a) | K).


Algorithm 1 Algorithm for Learning Terminological Bayesian Classifiers
Require: DL KB K = hT , Ai, Concept Start, Beam Width w,
    Search depth d, Maximum concept description length maxLen,
    Positive, Negative, Neutral training individuals IndC (A);
Ensure: NK = hG, ΘG i, G = hV ← {C}, E ← ∅i;
 1: repeat
 2:   Best ← ∅; Beam ← {Start}; N ewBeam ← ∅;
 3:   repeat
 4:       for c ∈ Beam do
 5:          for c0 ∈ {ρcl        0
                        ↓ (c) | |c | ≤ min(|c| + d, maxLen)} do
 6:             NK ← optimalN etwork(V ∪ {c0 }, IndC (A));
                   0

 7:             s0 ← score(NK0 , IndC (A));
 8:             N ewBeam ← N ewBeam ∪ {hNK0 , s0 i};
 9:          end for
10:       end for
11:       Best ← arg maxhNK0 ,s0 i (s0 : hNK0 , s0 i ∈ N ewBeam ∪ {Best});
12:       Beam ← selectF rom(N ewBeam, w); N ewBeam ← ∅;
13:    until stopping criterion on Beam;
14:    NK ← NK0 : hNK0 , s0 i = Best;
15: until stopping criterion on NK ;
    Specifically, in our algorithm, the outer loop (lines 1-15) greedily searches for a
new (complex) concept definition whose addition increases the network’s quality on the
given sample instances (determined by a scoring function score). The search through
the space of concept definitions is performed in the inner loop (lines 3-13) through a
beam search: starting from a beginning concept Start, for each refinement level, all
refinements up to a given length are memorized in a priority queue N ewBeam (sorted
according to the score associated to the network generated by adding them to the set
of feature concepts) from which only the k with the highest score are selected, by the
selection function selectF rom, to be refined in the next iteration.
    The functions optimalN etwork and score are used, respectively, to find the opti-
mal Bayesian network structure between the nodes in the network (eventually under a
set of constraints, like in the naı̈ve Bayes case or some of its extensions) and for scoring
a classifier (to compare its effectiveness with others). However, those two functions are
sensitive to the assumptions made about the ignorance model.
    When the assumed ignorance model is MCAR, we are allowed to use an approach
called available case analysis [15], in which we build an unbiased estimator of the
network parameters, based only on available knowledge. A scoring function we realised
for such case is the network’s log-likelihood on training data, calculated only on positive
and negative training individuals, ignoring the available knowledge about the concept-
membership relations between such individuals and the target concept C, and defined
as:
                                             X                                 X
L(NK | IndC (A)) = log Pr(NK ) +                     log Pr(C(a) | NK ) +              log Pr(¬C(a) | NK );
                                      a∈Ind+
                                           C (A)                         a∈Ind−
                                                                              C (A)


Another approach we implemented consisted in ranking both positive and negative
training individuals a according to P (C(a) | NK ), and then calculating the area un-
der the Precision-Recall curve using different acceptance thresholds.
     Under the naı̈ve Bayes assumption, there is no need to perform a search for find-
ing the optimal network, since the structure is already fixed (each node except the target
concept node has only one parent, i.e. the target concept node); otherwise, finding a net-
work structure which is optimal under some criterion (e.g. the BIC score [15]) may re-
quire an exhaustive search in the space of possible structures. However, tree-augmented
naı̈ve Bayesian networks (which allow for a tree structure among feature nodes), it is
possible to efficiently compute the optimal structure employing the method in [12],
making it appealing for real-life applications requiring efficiency and scalability.
     In the MAR case, a possible solution for learning models accounting for missing
knowledge is to use the Expectation-Maximization (EM) algorithm, MCMC sampling
or the gradient ascent method [15]. We use EM to learn terminological naı̈ve Bayesian
classifiers from MAR data. In our approach, outlined in Alg. 2, we first heuristically
estimate network’s parameters by only using available data; then, in order to find the
maximum likelihood parameters with respect to both observed and missing knowledge,
we consider individuals whose membership to a particular concept description D is not
known as several fractional individuals belonging, with different weights (correspond-
ing to the posterior probability of their class membership), to both the components D
and ¬D.
    Formally, the EM algorithm for parameters learning explores the space of possible
parameters through an iterative hill-climbing search, converging to a (local) maximum
likelihood estimate of the unknown parameters, where the (log-)likelihood (which we
also use as scoring criterion) is defined as follows:
                                           X          X
L(NK | IndC (A)) = log Pr(NK ) +                           log Pr(C 0 (a) | NK ) Pr(C 0 | NK )
                                         a∈Ind0C (A) C 0 ∈{C,¬C}
                           X                                   X
                    +                log Pr(C(a) | NK ) +                log Pr(¬C(a) | NK );
                        a∈Ind+
                             C (A)                          a∈Ind−
                                                                 C (A)


   At each iteration, the EM algorithm applies the following two steps:

 – Expectation step – using available data and the current network parameters, cal-
   culate a distribution over possible completions for the missing knowledge;
 – Maximization step – considering each possible completion as a fully available
   data case (weighted by its probability), calculate next parameters using (weighted)
   frequency counting.

     In our use of the EM algorithm, the E-step calculates the concept-membership pos-
terior probability (inferencing through the network) of each individual whose concept-
membership relation in unknown, thus completing the data through so called expected
counts. Then, the M-step calculates a new estimate of the network’s conditional prob-
ability distributions by using expected counts, maximizing the log-likelihood of both
available and missing data with respect to a network NK .
     About finding optimal structures for networks with less restrictions on their struc-
ture (such as tree-augmented naı̈ve BNs or unrestricted BNs) from MAR data, it is
possible to employ the Structural EM (SEM) algorithm [11]. In SEM, the maximiza-
tion step is performed both in the space of structures G and in the space of parameters
ΘG , by first searching a better structure and then the best parameters associated to the
given structure; it can be proven that, if the search procedure finds a structure that is
better than the one used in the previous iteration with respect to e.g. the BIC score, then
the structural EM algorithm will monotonically improve the score.
     When knowledge is NMAR, it is generally possible to extend the probabilistic
model to produce one where the MAR assumption holds; e.g. if a feature concept Fi
follows a NMAR ignorance model, with respect to a generic individual a and a DL
KB K, we can consider its observability as an additional variable (e.g. Yi = 0 iff
K 6|= Fi (a) ∧ K 6|= ¬Fi (a), Yi = 1 otherwise) in our probabilistic model, so that
Fi ’s ignorance model satisfies the MAR assumption (since its missingness depends on
an always observable variable).
     An alternate solution is recurring to robust Bayesian estimation [20] (RBE), to learn
conditional probability distributions without making any sort of assumption about the
nature of the missing data. RBE finds probability intervals instead of single probability
values, obtained by taking in account all the possible fillings of the missing knowledge;
the width of inferred intervals is therefore directly proportional to the quantity of miss-
ing knowledge considered during the learning process. To score each new induced net-
work, we employ the framework proposed in [26] to compare credal networks, while
Algorithm 2 Outline for our implementation of the EM algorithm for parameter learn-
ing in a terminological Bayesian classifier assuming the underlying ignorance model is
MAR.
function ExpectedCounts(NK , IndC (A))
 1: NK = hG, ΘG i, G = hV, Ei;
 2: for Xi ∈ V do
 3:    for hxi , πxi i ∈ vals(Xi , parents(Xi )) do
 4:       {n̄(xi , πxi ) contains the expected count for (Xi = xi , parents(Xi ) = πxi )}
 5:       n̄(xi , πxi ) ← 0;
 6:    end for
 7: end for
 8: for a ∈ IndC (A) do
 9:    for Xi ∈ V do
10:       for hxi , πxi i ∈ vals(Xi , parents(Xi )) do
11:          {Each expected count n̄(xi , πxi ) is obtained summing out the probability assign-
             ments to the concept memberships (Xi = xi , parents(Xi ) = πxi ) for each indi-
             vidual, calculated using the background knowledge K and, if they are only partially
             known, inferring through the network NK }
12:          n̄(xi , πxi ) ← n̄(xi , πxi ) + Pr(xi , πxi | NK );
13:       end for
14:    end for
15: end for
16: return {n̄(xi , πxi ) | Xi ∈ V, hxi , πxi i ∈ vals(Xi , parents(Xi ))};
function ExpectationM aximization(NK0 , IndC (A))
 1: {The network was first initialized with arbitrary heuristic parameters ΘG0 }
 2: NK0 = hG, ΘG0 i, G = hV, Ei;
 3: t ← 0;
 4: repeat
 5:    {n̄(xi , πxi )} ← ExpectedCounts(NK , IndC (A));
 6:    for Xi ∈ V do
 7:       for hxi , πxi i ∈ vals(Xi , parents(Xi )) do
                                         n̄(xi ,πxi )
 8:          θGt+1 (xi , πxi ) ← P 0              n̄(x0 ,πx )
                                                              ;
                                x ∈vals(Xi )   i   i
                                 i
 9:       end for
10:    end for
11:    t ← t + 1;
12:    NKt = hG, ΘGt i;
13:    {The EM loop ends when improvements to the network’s log-likelihood go below a cer-
      tain threshold}
14: until L(NKt = hG, ΘGt )i | IndC (A)) − L(NKt−1 = hG, ΘGt−1 i | IndC (A)) ≤ τ ;
15: return NKt ;
we do not have implemented yet a method to search for structures other than naı̈ve
Bayesian.

Example 2 (Example of Terminological Naı̈ve Bayesian Classifier using Robust Bayesian
Estimation). The following is a terminological naı̈ve Bayesian classifier using robust
Bayesian estimation for inferring posterior probability intervals in presence of NMAR
knowledge. In this networks, conditional probability tables associated to each node con-
tain probability intervals instead of probability values, each defined by its upper and
lower bound.
                          [Pr(F emale|F ather),Pr(F emale|F ather)]
                        [Pr(F emale|¬F ather),Pr(F emale|¬F ather)]
                                                                           F e := F emale

     F a := F ather     [Pr(HasChild|F ather),Pr(HasChild|F ather)]
                      [Pr(HasChild|¬F ather),Pr(HasChild|¬F ather)]
                                                                      HC := ∃hasChild.P erson


    Inference, using such network, can be performed as follows – given a generic indi-
vidual a and given that K |= HC(a), the posterior probability interval that a is a mem-
ber of F a is represented by the probability interval [Pr(F a | HC), Pr(F a | HC)],
where:
                                                   Pr(HC | F a)Pr(F a)
 Pr(F a(a)) = Pr(F a | HC) =                                                       ;
                                       Pr(HC | F a)Pr(F a) + Pr(HC | ¬F a)Pr(¬F a)
                                                   Pr(HC | F a)Pr(F a)
 Pr(F a(a)) = Pr(F a | HC) =                                                       ;
                                       Pr(HC | F a)Pr(F a) + Pr(HC | ¬F a)Pr(¬F a)

4     Conclusions and Future Work
We presented a Statistical Relational Learning method designed for learning termino-
logical naı̈ve Bayesian classifiers, a ML method based on the naı̈ve Bayes assumption
for estimating the probability that a generic individual belongs to a certain target con-
cept, given its membership relation to an induced set of complex Description Logic
concepts. We gave a characterisation of the lack of knowledge that may be introduced
by the OWA depending on the underlying ignorance model, and handled such missing
knowledge under different assumptions on the nature of missing knowledge itself (i.e.
Missing Completely at Random, Missing at Random or Informatively Missing). In the
future, we aim at estimating computationally the ignorance model followed by each
feature, at developing new methods to exploit the potential information contained in
knowledge’s missingness and evaluate our methods’ effectiveness on real world ontolo-
gies.


References
    [1] OWL 2 Web Ontology Language Direct Semantics (October 2009), http://www.w3.
        org/TR/owl2-direct-semantics/
 [2] Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5),
     34–43 (May 2001)
 [3] Bicer, V., et al.: Relational kernel machines for learning from graph-structured rdf data. In:
     Antoniou, G., et al. (eds.) ESWC (1). LNCS, vol. 6643, pp. 47–62. Springer (2011)
 [4] Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algo-
     rithms. In: ICML2006. pp. 161–168. ACM, New York, NY, USA (2006)
 [5] Corani, G., Zaffalon, M.: Naive credal classifier 2: an extension of naive bayes for deliver-
     ing robust classifications. In: DMIN. pp. 84–90 (2008)
 [6] d’Amato, C., Fanizzi, N., Esposito, F.: Query answering and ontology population: an in-
     ductive approach. In: ESWC 2008. pp. 288–302. Springer (2008)
 [7] Domingos, P., et al.: Uncertainty reasoning for the semantic web i. chap. Just Add Weights:
     Markov Logic for the Semantic Web, pp. 1–25. Springer (2008)
 [8] Domingos, P., Pazzani, M.J.: On the optimality of the simple bayesian classifier under zero-
     one loss. Machine Learning 29(2-3), 103–130 (1997)
 [9] Fanizzi, N., et al.: Reduce: A reduced coulomb energy network method for approximate
     classification. In: Aroyo, L., et al. (eds.) ESWC. pp. 323–337. Springer (2009)
[10] Fanizzi, N., D’Amato, C., Esposito, F.: Learning with kernels in description logics. In: ILP
     2008. pp. 210–225. Springer (2008)
[11] Friedman, N.: The Bayesian structural EM algorithm. In: UAI 1998. pp. 129–138. Morgan
     Kaufmann Publishers Inc., San Francisco, CA (1998)
[12] Friedman, N., Geiger, D., Goldszmidt, M., Provan, G., Langley, P., Smyth, P.: Bayesian
     network classifiers. In: Machine Learning. pp. 131–163 (1997)
[13] Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning (Adaptive Computa-
     tion and Machine Learning). The MIT Press (2007)
[14] Kiefer, C., et al.: Adding Data Mining Support to SPARQL via Statistical Relational Learn-
     ing Methods. In: ESWC 2008. LNCS, vol. 5021, pp. 478–492. Springer (2008)
[15] Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT
     Press (2009)
[16] Laskey, K.J., Laskey, K.B.: Uncertainty reasoning for the world wide web: Report on the
     urw3-xg incubator group. In: Bobillo, F., et al. (eds.) URSW. CEUR Workshop Proceed-
     ings, vol. 423. CEUR-WS.org (2008)
[17] Lehmann, J., Hitzler, P.: Concept learning in description logics using refinement operators.
     Mach. Learn. 78, 203–250
[18] Luna, J.E.O., Cozman, F.G.: An algorithm for learning with probabilistic description logics.
     In: Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.)
     URSW. CEUR Workshop Proceedings, vol. 527, pp. 63–74. CEUR-WS.org (2009)
[19] Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference.
     Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1988)
[20] Ramoni, M., Sebastiani, P.: Robust learning with missing data. Mach. Learn. 45, 147–170
     (October 2001)
[21] Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62, 107–136 (Febru-
     ary 2006)
[22] Rodrigues De Morais, S., Aussem, A.: Exploiting data missingness in bayesian network
     modeling. In: IDA 2009. pp. 35–46. Springer (2009)
[23] Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
[24] Tresp, V., et al.: Uncertainty reasoning for the semantic web i. chap. Towards Machine
     Learning on the Semantic Web, pp. 282–314. Springer (2008)
[25] Tresp, V., Huang, Y., Bundschus, M., Rettinger, A.: Materializing and querying learned
     knowledge. In: IRMLeS 2009 (2009)
[26] Zaffalon, M., Corani, G., Mauá, D.: Utility-based accuracy measures to empirically evalu-
     ate credal classifiers. In: ISIPTA 2011. pp. 401–410. Innsbruck (2011)