<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Terminological Bayesian Classifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pasquale Minervini</string-name>
          <email>pasquale.minervini@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia d'Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Fanizzi</string-name>
          <email>fanizzig@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LACAM - Dipartimento di Informatica - Universita` degli Studi di Bari “Aldo Moro” via E. Orabona</institution>
          ,
          <addr-line>4 - 70125 Bari -</addr-line>
          <country country="IT">Italia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge available through Semantic Web representation formalisms can be missing, i.e. it is not always possible to infer the truth value of an assertion (due to the Open World Assumption). We propose a method for incrementally inducing terminological (tree-augmented) na¨ıve Bayesian classifiers, which aim at estimating the probability that an individual belongs to a target concept given its membership to a learned set of Description Logic concepts. We then evaluate the impact of employing different methods of handling assertions whose truth value is unknown, each consistent with a different assumption on the ignorance model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Real-world knowledge often involves various degrees of uncertainty. For such reason,
in the context of Semantic Web (SW), difficulties arise when trying to model real-world
domains using purely logical formalisms. The World Wide Web Consortium (W3C),
recognising the need of soundly represent such knowledge, in 2007 created the
Uncertainty Reasoning for the World Wide Web Incubator Group 1 (URW3-XG), with the
aim of identifying the requirements for reasoning with and representing the uncertain
knowledge in Web-based information; URW3-XG provided in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] a number of
situations in which there is a clear need of explicitly represent and reason in presence of
uncertainty. A wide range of approaches to represent and infer with knowledge enriched
with probabilistic information has been proposed: some of them extend knowledge
representation formalisms actually used in the SW, while others rely on probabilistic
enrichment of Description Logics or logic programming formalisms.
      </p>
      <sec id="sec-1-1">
        <title>Motivation</title>
        <p>
          The main problem of applying such approaches in real world settings is given by the
fact that they almost always assume the availability of probabilistic information, while
it is hardly known in advance. Having a method that, by exploiting available knowledge
(such as an already designed and populated ontology) is able to extract both the needed
logic and the probabilistic structure, would be of great benefit. During this process, the
1 http://www.w3.org/2005/Incubator/urw3/
Open World Assumption (OWA) must be taken into account: under OWA, an assertion
is true or false only if its truth value can be formally derived. As a consequence, there
may be reasoning tasks (such as instance checking) for which the truth value cannot
be determined. This is opposed by the commonly employed Closed World Assumption
(CWA), where every statement that cannot be proved to be true, is assumed to be false.
Machine Learning (ML) is already covering a relevant role in the analysis of SW
knowledge bases, to overcome the limitations of purely deductive reasoning [
          <xref ref-type="bibr" rid="ref10 ref17">17, 10</xref>
          ]. In fact,
purely deductive inference does not scale up easily to the size of the web, does not
exploit regularities in data, the construction of a SW knowledge base can be an expensive
process and commonly used SW inference and representation formalisms do not
consider the inherent uncertainty characterizing the knowledge in various domains. In this
paper, we face the problem of finding a (locally optimal) set of logic features (in the
form of Description Logic concepts) that, used within a probabilistic graphical model,
can be used to estimate the probability of a previously unknown concept membership
relation between a generic individual and a target concept. Also, we evaluate different
methods of dealing with missing concept-memberships, each coherent with a different
assumption on the missingness mechanism. We will start by describing Bayesian
Networks (representation, inference and learning) and their extensions towards probability
intervals. Then we will describe our probabilistic-logic model, named terminological
Bayesian classifiers, and the problem of learning them from a set of training individuals
and a Description Logic knowledge base. Also, we will describe our learning algorithm,
and the adaptations to learn under different assumptions on the ignorance model. In the
final part, we will give experimental evidence on the effectiveness of our method.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Related Work</title>
        <p>
          A variety of ML approaches specifically designed for SW knowledge bases have been
proposed; the expressive power of such ontological knowledge representation formalisms
may vary, ranging from languages such as RDF(S) to Description Logics (theoretical
fundation of many OWL variants). A recent survey on this topic is in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. In the class of
multi-relational learning techniques, Statistical Relational Learning [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (SRL) methods
seem particularly appealing, being designed to learn in domains with both a complex
relational and a rich probabilistic structure. There have been proposals for employing
SRL methods when learning from Description Logic knowledge bases: in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], authors
propose to employ Markov Logic Networks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] (MLN) for first-order probabilistic
inference and learning within the SW context; learning concepts in a probabilistic
extension of the ALC Description Logic named CRALC is proposed in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]; in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], the
Infinite Hidden Relational Models [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] framework is extended to also take into account
a set of constraints in the form of (even more expressive) Description Logic concepts
(such as SHOIN (D)). The aforementioned methods rely on probabilistic graphical
models, which offer sound methods for both inferencing and learning in the presence
of latent variables and missing values [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] (given some assumptions on the missingness
pattern), providing a way for handling assertions whose truth value is not known due to
the adoption of the OWA. However, in the literature it is not clear whether such
assumptions hold in the SW context: this may be an issue, since from incomplete knowledge
bases by adopting methods not coherent with the nature of the missing knowledge itself
can lead to misleading results with respect to the real model followed by the data [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Bayesian Networks and Robust Bayesian Estimation</title>
      <p>
        Graphical models [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (GMs) are a popular framework to compactly describe the joint
probability distribution for a set of random variables, by representing the underlying
structure through a series of modular factors. Depending on the underlying semantics,
GMs can be grouped into two main classes: directed graphical models, which found on
directed graphs, and undirected graphical models, which found on undirected graphs.
A Bayesian network (BN) is a directed GM which represents the conditional
dependencies in a set of random variables by using a directed acyclic graph (DAG) G augmented
with a set of conditional probability distributions G (also referred to as parameters)
associated with G’s vertices. In such a graph, each vertex corresponds to a random
variable Xi and each edge indicates a direct influence relation between the two random
variables. A BN stipulates a set of conditional independence assumptions over its set
of random variables: each vertex Xi in the DAG is conditionally independent of any
subset S N d(Xi) of vertices that are not descendants of Xi given a joint state of its
parents, or formally: 8Xi : Pr(Xi j S; parents(Xi)) = Pr(Xi j parents(Xi)), where
the function parents(Xi) returns the parent vertices of Xi in the DAG representing the
BN. The conditional independence assumption allows to represent the joint
probability distribution Pr(X1; : : : ; Xn) defined by a Bayesian network over a set of random
variables fX1; : : : ; Xng as a production of the individual probability distributions,
conditional on their parent variables:
      </p>
      <p>
        Pr(X1; : : : ; Xn) =
n
Y Pr(Xi j parents(Xi)):
i=1
As a result, it is possible to define Pr(X1; : : : ; Xn) by only specifying, for each
vertex Xi in the graph, the conditional probability distribution Pr(Xi j parents(Xi)).
Given a BN specifying a joint probability distribution over a set of variables, it is
possible to evaluate inference queries by marginalization, like calculating the posterior
probability distribution for a set of query variables given some observed event (i.e.
assignment of values to the set of evidence variables). Exact inference for general BNs
is an NP-hard problem, but algorithms exist to efficiently infer in restricted classes
of networks, such as variable elimination, which has linear complexity in the
number of vertices if the BN is a singly connected network [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Approximate inference
methods also exist in literature, such as Monte Carlo algorithms, belief propagation
or variational methods [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The compact parametrization in graphical models allows
for effective learning both model selection (structural learning) and parameter
estimation. In the case of BNs, however, finding a model which is optimal with respect to a
given scoring criterion (which measures how well the model fits observed data) may
be not trivial: the number of possible structures for a BN is super-exponential in the
size of its vertices, making it generally impractical to perform an exhaustive search
through the space of its possible models. For this reason we tried to find an
acceptable trade-off between efficiency and expressiveness, so to make our method suitable
for a context like SW: we focused on particular subclasses of Bayesian networks in
which both inference and structure/parameters learning can be performed in
polynomial time. The first is na¨ıve Bayesian networks, modelling the dependencies between a
set of random variables X = fX1; : : : ; Xng, also called features, and a random
variable C, also called class, so that each pair of features are independent of each other
given the class, i.e. 8Xi; Xj 2 X : i 6= j ) (Xi ?? Xj j C). This type of models
is especially interesting since it proved to be effective also in contexts in which the
underlying independence assumptions do not hold [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], even outperforming more
current approaches [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, the mutual conditional independence assumption behind
na¨ıve Bayesian networks can be quite strong: therefore, we also propose employing
tree-augmented na¨ıve (TAN) Bayesian networks, which also allow a tree structure to
exist between feature variables [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is relevant to note that BNs can be used as
classifiers, by assigning each new, unclassified instance to the class C maximizing the
probability value Pr(C j e), where e indicates the evidence available about the instance and
Pr the probability distribution encoded by the BN. Defining a BN requires a number
of precise probability assessments which, as we will see, will not be always possible
to obtain. A generalisation of na¨ıve Bayesian networks to probability intervals is the
robust Bayesian estimator [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (RBE): each conditional probability in the network is a
probability interval characterised by its lower and upper bounds, defined respectively
as Pr(A) = minPr2P Pr(A) and Pr(A) = maxPr2P Pr(A), where P is a convex
set of probability distributions. An approach very similar to RBE is presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and proposes using Credal networks (which are structurally similar to a BN, but where
the conditional probability densities belong to convex sets of mass functions) to
represent uncertainty about network parameters. A problem with this class of approaches
arises when using such model for classification – in the case of binary classification
with classes C1 and C2, given evidence e for a new, unclassified instance, two posterior
intervals are obtained, i.e. P(C1 j e) and P(C2 j e). If such intervals do not overlap,
the stochastic dominance criterion can be employed, which assigns a new unclassified
instance to class C1 iff P(C1 j e) &gt; P(C2 j e); otherwise, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] proposes using a
weaker criterion, called weak dominance criterion, which is based on representing each
probability interval into a single probability value represented by its middle point. Due
to the low complexity of inferencing and learning in (tree-augmented) na¨ıve Bayesian
networks, we choose to employ such structures to represent dependency relations
between variables in our probabilistic-logic model; also, we attempt to employ RBE to
explicitly encode the uncertainty about parameters introduced by the adoption of the
OWA, and empirically evaluate different approaches to handling missing attributes.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Terminological Bayesian Classifiers</title>
      <p>We introduce a formalism, named terminological Bayesian classifier (TBC), consisting
of a BN defined over a set of variables, each mapped to a (possibly complex)
Description Logic (DL) concept defined over a DL knowledge base (KB). Each of such DL
concepts can be considered as a feature (so we will refer to them as feature concepts)
so that, given a generic individual a defined over a DL KB K, inferring the membership
relation to such concepts allows us, by means of a TBC defined over K, to infer the
membership probability to a given target concept in K if it was previously unknown.
This means that, within the TBC, each input individual is described by its
conceptmembership relation with respect to the feature concepts contained in it. Given a generic
individual a in K, a variable assigned to a DL feature concept F in a TBC defined over
K takes value T rue if K j= D(a), F alse if K j= :D(a) and the variable is considered
not observable otherwise. A more formal definition of TBC can be given as follows:
Definition 1. (Terminological Bayesian Classifier) A terminological Bayesian
classifier NK, with respect to a DL KB K, is defined as a pair hG; G i, representing
respectively the structure and parameters of a BN, in which:
– G = hV; E i is an augmented directed acyclic graph, in which:</p>
      <p>V = fF1; : : : ; Fn; Cg (vertices) is a set of random Boolean variables, each
linked to a DL concept defined over K. Each Fi (i = 1; : : : ; n) is associated to
a feature concept, and C to the target (class) concept (we will use the names
of variables in V to represent the corresponding DL concept for brevity);
E V V is a set of edges, which model the (in)dependence relations among
the variables in V.
–</p>
      <p>G is a set of conditional probability distributions (CPD), one for each variable
V 2 V, representing the conditional probability distribution of the feature concept
given the state of its parents in the graph.</p>
      <p>Given a generic individual a in K, each variable Fi 2 V in the TBC has value T rue
(resp. F alse) if K j= Fi(a) (resp. K j= :Fi(a)); otherwise (i.e. when K 6j= Fi(a) and
K 6j= :Fi(a)) its value is considered as not observable (or missing). If the
conceptmembership relation between a and the target concept C cannot be inferred from K, the
probability of such concept-membership can be estimated by calculating the conditional
posterior probability using regular BN inference algorithms Pr(C j F1; : : : ; Fn) (such
as Variable Elimination).</p>
      <p>In the case of terminological na¨ıve Bayesian classifiers, E = fhC; Fii j i 2 f1; : : : ; ngg,
i.e. each feature variable is independent on other feature variables, given the value of
the target variable. TAN networks relax such independence assumptions by allowing a
tree structure among feature variables: in terminological TAN Bayesian classifiers, E =
fhC; Fii j i 2 f1; : : : ; ngg [ ET , where ET = fhFi; Fj i j i; j 2 f1; : : : ; ng; i 6= jg is
a set of directed edges defining a directed tree structure.</p>
      <p>Example 1. (Example of Terminological Na¨ıve Bayesian Classifier) Given a set of DL
feature concepts F = fF e := F emale; HC := 9hasChild:&gt;; HS := 9hasSibling:&gt;g 2
and a target concept F W S := F atherW ithSibling, a terminological na¨ıve Bayesian
classifier expressing the target concept in terms of the feature concepts is the following:
2 Here DL concepts have been aliased for brevity.</p>
      <p>Pr(FWS)
F W S := F atherW ithSibling
Let K be a DL KB and a a generic individual so that K j= HC(a), and the membership
of a to the concepts F e and HS is not known, i.e. K 6j= F e(a) and K 6j= :F e(a). It
is possible to infer, through the given network, the probability that the individual a is a
member of the target concept F W S:</p>
      <p>Pr(F W S(a)) =</p>
      <p>F W S02fF W S;:F W Sg</p>
      <p>P</p>
      <p>Pr(F W S) Pr(HC j F W S)</p>
      <p>Pr(F W S0) Pr(HC j F W S0)
;
In the following we define the problem of learning a terminological Bayesian classifier
NK, given a DL KB K and a set of positive, negative and neutral training individuals
IndC (K) = IndC+(K) [ IndC (K) [ Ind0C (K).</p>
      <p>Definition 2. (Terminological Bayesian Classifier Learning Problem) The TBC
learning problem consists in finding a TBC NK maximizing a TBC scoring function with
respect of the training individuals IndC (K) organised in positive, negative and neutral
examples, given their concept-membership to the target concept C in K. Formally:
Given the following:
– a target concept C;
– a set of training individuals IndC (K) in a DL KB K such that:
8a 2 IndC+(K) positive example: K j= C(a),
8a 2 IndC (K) negative example: K j= :C(a),
8a 2 Ind0C (K) neutral example: K 6j= C(a) ^ K 6j= :C(a);
– A scoring function specifying a measure of the quality of an induced
terminological Bayesian classifier NK w.r.t. the samples in IndC (K);
Find a network NK maximizing a given scoring function Score wrt the samples:
NK
arg max Score(NK; IndC (K))):</p>
      <p>
        NK
The search space to find the optimal network NK may be too large to explore
exhaustively. For this reason the learning approach proposed here works by incrementally
building the set of feature concepts, with the aim of obtaining a set of concepts
maximizing the score of the induced network; each feature concepts is individually searched
by an inner search process, guided by the scoring function itself, and the whole strategy
of adding and removing feature concepts follows a forward selection/backward
elimination strategy. This approach is motivated by the literature about selective Bayesian
classifiers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], where forward selection of attributes generally increases the classifier’s
accuracy. The algorithm proposed here is organised in two nested loops: the inner loop
is concerned with finding the best feature DL concept addition/removal operation, while
the outer loop implements the abstract greedy feature selection strategy; both are guided
by the network scoring function. In the inner loop, outlined in Alg. 1, the search through
Algorithm 1 Scoring function-driven beam search for a new concept to add to the
terminological Bayesian network.
the space of concept definitions is performed through a beam search, using the cl
re#
finement operator [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] ( cl(C) returns a set of refinements D of C so that D @ C,
#
which we consider only up to a given concept length n). For each new complex concept
being evaluated, the algorithm creates a new set of concepts V0 and finds the optimal
structure, under a given set of constraints (which, in the case of terminological na¨ıve
Bayesian classifiers, is already fixed) and parameters (which may vary depending on the
assumptions on the nature of the ignorance model). Then, the new network is scored,
with respect to a given scoring criterion. In the outer loop, a variety of feature selection
strategies can be implemented [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this particular case, a Forward Selection Backward
Elimination approach is proposed, at each iteration considering to add a new concept
to the network or removing at most a variable number of concepts. We experimented
with two variants of such approach, both implemented through Alg. 2: Forward
Selection (FS) which adds a single concept to the network at each iteration and Fast Forward
Selection Backward Elimination (FFSBE) which at each iteration adds or removes one
concept from the network. In the algorithm, such feature selection methods correspond
to different values of the max parameter in Alg. 2 (representing the maximum number
of concepts that can be removed from the network), i.e. 0 and 1 respectively.
Algorithm 2 Forward Selection Backward Elimination approach for the incremental
construction of terminological Bayesian classifiers.
function F SBE(K; IndC (K); max)
1: NK0 = hG0; G0 i; G0 = hV0 fCg; E0 ;i; t 0;
2: repeat
3: t t + 1;
4: fA new network is selected among a set of possible candidates, obtained by either adding
of removing a set of concepts to the structure, so to maximize the scoring criterion Scoreg
5: Candidates = fExtend(N Kt 1; IndC (K)); Remove(N Kt 1; IndC (K); max)g;
6: N Kt arg maxNK2Candidates Score(NK; IndC (K));
7: until Scorte(1N Kt; IndC (K)) Score(N Kt 1; IndC (K));
8: return NK ;
function Remove(NK; IndC (K); max)
1: fFinds the best network that could be obtained by removing at most max feature concepts
from the network structure, wrt a given scoring criterion Scoreg
2: NK = hG; Gi; G = hV; Ei; BestN etwork NK;
3: for V 0 V : jV j jV 0j max do
4: N K0 BuildOptimalN etwork(V0; IndC (K));
5: if Score(N K0; IndC (K)) Score(BestN etwork; IndC (K)) then
6: BestN etwork N K0;
7: end if
8: end for
9: return BestN etwork;
      </p>
      <sec id="sec-3-1">
        <title>Different Assumptions on the Ignorance Model</title>
        <p>
          However, during the learning process, it may happen that the concept membership
between a training individual and some of the feature concepts may not be known.
Depending on the reason of such missingness, Probabilistic Graphical Models offer a
variety of approaches of handling this [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Formally, the missing data handling method
depends on the probability distribution underlying the missingness pattern [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], which
in turn can be classified on the basis of its behaviour with respect to the variable of
interest.
        </p>
        <p>
          – Missing Completely At Random (MCAR) – in this case, the variable of interest is
independent from its observability, as any other variable in the probabilistic model.
This is the precondition for case deletion to be valid, and missing data does not
usually belong to such class [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
– Missing At Random (MAR) – happens when the observability of the variable of
interest depends on the value of some other variable in the probabilistic model.
– Not Missing At Random/Informatively Missing (NMAR, IM) – here, the actual
value of the variable of interest influences the probability of its observability.
Example 2. (Different Ignorance Models in Terminological Bayesian Classifiers)
Consider the network in Ex. 1: if the probability that the variable F e is observable is
independent on all other variables in the network, then it’s missing completely at random;
if it only depends, for example, on the value of F W S, then it’s missing at random; if it
is dependent on the value F e would have if it was not missing, then it is informatively
missing.
        </p>
        <p>
          Each of the aforementioned assumptions on the missingness pattern implies a different
way of learning both network structure and parameters in presence of partially observed
data. If MCAR holds, Available Case Analysis [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] can be used, where maximum
likelihood network parameters are estimated using only available knowledge (i.e. ignoring
missing data); we are adopting the heuristic used in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] of setting network’s parameters
to their maximum likelihood value, which is both accurate and efficient. As scoring
function, similarly to [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], we adopt the conditional log-likelihood on positive and
negative training individuals, defined as 3:
CLL(NK j IndC (K)) =
log Pr(C(a) j NK)+
        </p>
        <p>log Pr(:C(a) j NK);</p>
        <p>X
a2IndC+(K)</p>
        <p>
          X
a2IndC (K)
A problem with using simply CLL as scoring criterion is that it tends to favour complex
structures [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] that overfit the training data. To avoid overfitting, we penalize the
conditional log-likelihood through the Bayesian Information Criterion (BIC) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], where the
penalty is proportional to the number of independent parameters in a network
(according to the minimum description length principle) and is defined as follows:
BIC(NK j IndC (K)) = CLL(NK j IndC (K))
log N
2
j G j;
(1)
where N is the number of data points and j G j is the number of independent parameters
in the network. Under the na¨ıve Bayes assumption, there is no need to perform a search
for finding the optimal network, since the structure is already fixed (each node except
the target concept node has only one parent, which is the target concept node).
Without constraining the space of possible network structures, finding a structure which is
optimal under some criterion may require an exhaustive search in the space of possible
structures. However, in the case of TAN networks, if the scoring function is
decomposable, there is an efficient method of finding a globally optimal network structure [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
In this work, we create a complete weighted digraph among feature variables, each
directed edge weighted with the BIC score (defined using the model’s log-likelihood)
gain that adding that edge would provide to the network, and then find the
maximum weighted spanning tree structure using Chu-Liu-Edmonds algorithm (which has
a O(V 2) time complexity on dense digraphs, where V is the number of nodes). When
learning network parameters from MAR data, a variety of techniques is available, such
as Expectation-Maximization (EM), MCMC sampling or gradient ascent [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In this
work, EM is used as outlined in Alg. 3: it first initialises network parameters using
estimates that ignore missing data; Then, it considers individuals whose membership to a
generic concept D is not known as several fractional individuals belonging, with
different weights (corresponding to the posterior probability of their concept membership), to
both the components D and :D; such fractional individuals are used to recalculate
network parameters (obtaining the so-called expected counts) and the process is repeated
3 When used to score networks, conditional log-likelihoods are calculated ignoring available
knowledge about the membership between training individuals and the target concept.
until convergence (e.g. when the improvement in log-likelihood is lower than a specific
threshold). At each iteration, the EM algorithm applies the following two steps:
Algorithm 3 Outline for our implementation of the EM algorithm for parameter
learning from MAR data in a terminological Bayesian classifier.
        </p>
        <p>
          For structure learning in TAN Bayesian networks from MAR data, this works used
the Structural EM (SEM) algorithm [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In SEM, outlined in Alg. 4, the maximization
step is performed both in the space of structures G and in the space of parameters G ,
by first searching a better structure (maximizing the expected score of the network)
and then the best parameters associated to the given structure. It can be proven that, if
the search procedure finds a structure that is better than the one used in the previous
iteration wrt a scoring function, then the SEM algorithm will monotonically improve
such score. At each iteration of the SEM algorithm, we find the very same approach we
used with MCAR data, except that we employ the expected value of the BIC score [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
on training individuals.
        </p>
        <p>When data is NMAR/IM it may be harder to model, since we cannot assume that
observed and missing values follow the same distributions. However, it is generally
possible to extend the probabilistic model to produce one where the MAR assumption
holds; if the value of a variable associated to the feature concept Fi is informatively
missing, we can consider its observability as a indicator Boolean variable Oi (such that
Oi = F alse iff K 6j= Fi(a) and K 6j= :Fi(a), Oi = T rue otherwise) and include it
in our probabilistic model, so that Fi’s ignorance model satisfies the MAR assumption
(since the probability of Fi to be observable depends on the always observable indicator
variable Oi). Doing this may however raise some problems, since the induced
probabilistic model will be dependent on the specific ignorance model in the training set, and
changes in such missingness pattern may impact on the model’s effectiveness.</p>
        <p>
          An alternate solution proposed in literature is Robust Bayesian Estimation [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
(RBE), which allows to learn interval-valued conditional probability distributions which
explicitly represent the uncertainty about network parameters. RBE allows to infer
posterior probability intervals instead of single posterior probability values, obtained by
taking in account all the possible fillings of the missing knowledge. Such
intervalvalued and posterior intervals 4 can be calculated in closed form, as described in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
To score each induced network, we empirically choose to calculate posterior intervals,
get their central point and then use them as probability values to calculate e.g. the BIC
score as in Eq. 1. Another evaluation approach has been proposed in [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] to compare
credal classifiers, and proposes using a scoring criterion based on discounted accuracy
and a function indicating risk-aversion.
        </p>
        <p>Example 3. (Example of Terminological Na¨ıve Bayesian Classifier using RBE)
Consider again the terminological na¨ıve Bayesian classifier in Example 1: when learning in
4 A posterior interval estimate represents the range of probability values associated to the
membership of an instance to a class.
presence of NMAR data, it can be extended with interval-valued network parameters
for inferring posterior probability intervals instead of single posterior probability values
through Robust Bayesian Estimation. In such class of networks, conditional probability
tables associated to each node contain convex intervals of probability values instead of
single probability values, each defined by its upper and lower bound.</p>
        <p>[Pr(F W S);Pr(F W S)]
F W S := F atherW ithSibling</p>
        <p>Pr(F ejF W S);Pr(F ejF W S)]
[Pr(F ej:F W S);Pr(F ej:F W S)]</p>
        <p>Pr(HCjF W S);Pr(HCjF W S)]
[Pr(HCj:F W S);Pr(HCj:F W S)]</p>
        <p>
          Pr(HSjF W S);Pr(HSjF W S)]
[Pr(HSj:F W S);Pr(HSj:F W S)]
Interval-valued network parameters can be calculated efficiently [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. E.g. the
parameters associated to the feature concept HC can be calculated as follows:
n(HCjF W S)=n(?jF W S)+n(HCj?)+n(?j?); n(HCjF W S)=n(?jF W S)+n(:HCj?)+n(?j?);
        </p>
        <p>Pr(HCjF W S)= n(nH(CFjaF)+a)n+(nH(CHjCFjWF WS)S) ; Pr(HCjF W S)= n(F WnS( H)+Cnj(FHWCSjF) W S) ;
where n(? j F W S) = jfa 2 IndF+W S (K) j K 6j= HC(a) and K 6j= :HC(a)gj,
n(HC j?) = jfa 2 Ind0F W S (K) j K j= HC(a)gj and n(? j?) = jfa 2 Ind0F W S (K) j
K 6j= HC(a)^K 6j= :HC(a)gj. Inference can be performed as follows: given a generic
individual a such that K j= HC(a), the probability that a is a member of concept F W S
belongs to the posterior probability interval [Pr(F W S j HC); Pr(F W S j HC)],
where:</p>
        <p>Pr(F W S j HC)= Pr(HCjF W S)PPrr((FHWCjSF)W+PSr)(PHr(CFj:WFSW) S)Pr(:F W S) ;</p>
        <p>Pr(F W S j HC)= Pr(HCjF W S)PrP(rF(HWCSj)F+WPrS(H)PCr(jF:Fa)W S)Pr(:F W S) ;
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section we empirically evaluate the impact of adopting different missing
knowledge handling methods and search strategies, during the process of learning (na¨ıve and
TAN) TBCs from real world ontologies. Starting from a set of real ontologies 5
(outlined in Table 1), we generated a set of 20 random query concepts for each ontology 6,
so that the number of individuals belonging to the target query concept C (resp. :C)
was at least of 10 elements and the number of individuals in C and :C was in the
5 From TONES Ontology Repository: http://owl.cs.manchester.ac.uk/repository/
6 Using the query concept generation method available at http://lacam.di.uniba.it:
8000/~nico/research/ontologymining.html</p>
      <p>Ontology
MDM0.73</p>
      <p>LEO
FAMILY-TREE</p>
      <p>WINE
BIOPAX (PROTEOMICS)</p>
      <p>
        ALCHOF (D)
ALCHIF (D)
SROIF (D)
SHOIN (D)
ALCHN (D)
same order of magnitude. A DL reasoner 7 was employed to decide on the
theoretical concept-membership of individuals wrt the query concepts. In experiments, we
relearned such concept queries as (na¨ıve and TAN) TBCs, using individuals retrieved by
each query (resp. its complement) as positive (resp. negative) examples. The evaluated
missing knowledge handling methods were Robust Bayesian Estimation (ROBUST)
and, for na¨ıve and TAN networks respectively: Available Case Analysis (ACA and
TACA), the (structural) EM algorithm (EM and SEM), and two additional approaches
aiming at including a features’ observability in the resulting model (IM3 and IM2 for
na¨ıve and TIM3 and TIM2 for TAN structures). The last two approaches build networks
which are dependant on the ignorance model: IM3 and TIM3(where IM here stands for
Informatively Missing) makes use of three-valued feature variables taking a value in
fT rue; F alse; U nknowng when the membership to the associated feature concept is
respectively true, false or not known; while IM2 and TIM2 employ two-valued feature
variables, taking a value in fT rue; Otherg, when the membership to the associated
feature concept is respectively true or either false or not known. During experiments,
refinements were only allowed to contain conjunctions/disjunctions of concepts,
com7 Pellet v2.3.0 – http://clarkparsia.com/pellet/
plements and existential restrictions, and refinements started from concept &gt;. To avoid
overfitting, the greedy network construction was driven by the BIC score in Eq. 1. In
experiments, each of the 20 generated query concepts, was used to obtain a pair of sets
composed by positive and negative examples, selecting the individuals in the ontology
belonging respectively to the query concept and its complement. On each of such pairs
of positive/negative examples, k-fold cross validation (with k = 10) was used to
estimate k Area Under the Precision-Recall Curve [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (AUC-PR) values (for ROBUST
we used the midpoint of each posterior interval was used to associate a probability to
concept-memberships), using inferred concept-membership probability to rank testing
individuals. Results are summarised in Table 2. Parameters depth and maxLength
(indicating resp. the maximum depth of each refinement step and the maximum length
of a feature concept) were both set to 3 (2 in the case of the more complex ontology
FAMILY-TREE). In almost every case, forcing the existence of a maximum (penalized)
likelihood tree structure to exist between feature concepts did not benefit the ranking
capability: for example, in the IM3/TIM3 and IM2/TIM2 cases, na¨ıve Bayesian
networks had significantly greater AUC-PR values than TAN counterparts (with p &lt; 0:01
under a Student’s paired t-test); a reasons for this is that BIC-driven search, because
of the higher cost of adding feature concepts (depending on the higher number of
network parameters), prevents the introduction of discriminative feature concepts in the
network to keep its structure simple. This is also the reason that caused, in all
feature selection strategies, IM2/TIM2 to have higher AUC-PR scores than their IM3/TIM3
counterparts (with p &lt; 0:01). Comparing the methods to learn the parameters of na¨ıve
Bayesian networks on different assumptions on the missingness pattern, it emerged that
IM2 had greater AUC-PR results with all the experimented feature selection approaches
(with p &lt; 0:01), suggesting that the missingness of concept-membership information
was informative (except in the case of the LEO ontology, where other approaches to
dealing with missing informations had similar results). Also, using the midpoint of
Robust Bayesian Estimators’ posterior intervals led to worse results when ranking target
concept-membership probabilities. However, it can still be used to explicitly represent
the uncertainty on parameters caused by missing knowledge within the Semantic Web.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>This paper proposes a method, terminological Bayesian classifiers, to efficiently
estimate the probability that a generic individual belongs to a specific target concept, given
its concept-membership relation to a set of DL feature concepts; this work focused on
network structures which allow for efficient inference and learning, and empirically
evaluated different methods to handle missing data resulting from the adoption of the
OWA. In the future, we aim at exploring other network structure which allow for
efficient inference and learning, at extending this framework towards role-membership
prediction and to evaluate it more extensively on real world ontologies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Caruana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niculescu-Mizil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An empirical comparison of supervised learning algorithms</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on Machine learning</source>
          . pp.
          <fpage>161</fpage>
          -
          <lpage>168</lpage>
          . ICML '06,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Corani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaffalon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning reliable classifiers from small or incomplete data sets: The naive credal classifier 2</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          ,
          <fpage>581</fpage>
          -
          <lpage>621</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goadrich</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The relationship between precision-recall and roc curves</article-title>
          .
          <source>In: ICML 2006</source>
          . pp.
          <fpage>233</fpage>
          -
          <lpage>240</lpage>
          . ACM, New York, NY, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowd</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kok</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poon</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singla</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Uncertainty reasoning for the semantic web i</article-title>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazzani</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>On the optimality of the simple bayesian classifier under zeroone loss</article-title>
          .
          <source>Machine Learning</source>
          <volume>29</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>103</fpage>
          -
          <lpage>130</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldszmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Provan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Bayesian network classifiers</article-title>
          .
          <source>In: Machine Learning</source>
          . pp.
          <fpage>131</fpage>
          -
          <lpage>163</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Getoor</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)</article-title>
          . The MIT Press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Grossman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Learning bayesian network classifiers by maximizing conditional likelihood</article-title>
          . In: Brodley,
          <string-name>
            <surname>C.E</surname>
          </string-name>
          . (ed.) ICML. ACM International Conference Proceeding Series, vol.
          <volume>69</volume>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Guyon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gunn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikravesh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>L</given-names>
          </string-name>
          . (eds.): Feature Extraction,
          <source>Foundations and Applications</source>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hitzler</surname>
          </string-name>
          , P.,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A reasonable semantic web</article-title>
          .
          <source>Semantic Web</source>
          <volume>1</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          , N.:
          <article-title>Probabilistic Graphical Models: Principles and Techniques</article-title>
          . MIT Press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Langley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sage</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Induction of selective bayesian classifiers</article-title>
          . In: de Ma´ntaras,
          <string-name>
            <given-names>R.L.</given-names>
            ,
            <surname>Poole</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.) UAI. pp.
          <fpage>399</fpage>
          -
          <lpage>406</lpage>
          . Morgan Kaufmann (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Laskey</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laskey</surname>
            ,
            <given-names>K.B.</given-names>
          </string-name>
          :
          <article-title>Uncertainty reasoning for the world wide web: Report on the urw3-xg incubator group</article-title>
          .
          <source>In: URSW2008</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Concept learning in description logics using refinement operators</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>78</volume>
          ,
          <fpage>203</fpage>
          -
          <lpage>250</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Luna</surname>
            ,
            <given-names>J.E.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cozman</surname>
            ,
            <given-names>F.G.</given-names>
          </string-name>
          :
          <article-title>An algorithm for learning with probabilistic description logics</article-title>
          . In: Bobillo,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>da</surname>
          </string-name>
          <string-name>
            <given-names>Costa</given-names>
            , P.C.G.,
            <surname>d'Amato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Fanizzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Laskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.B.</given-names>
            ,
            <surname>Laskey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.J.</given-names>
            ,
            <surname>Lukasiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Nickles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Smrz</surname>
          </string-name>
          , P. (eds.) URSW. pp.
          <fpage>63</fpage>
          -
          <lpage>74</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ramoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Robust learning with missing data</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>45</volume>
          ,
          <fpage>147</fpage>
          -
          <lpage>170</lpage>
          (
          <year>October 2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Rettinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Lo¨sch, U.,
          <string-name>
            <surname>Tresp</surname>
          </string-name>
          , V.,
          <string-name>
            <surname>d'Amato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fanizzi</surname>
          </string-name>
          , N.:
          <article-title>Mining the semantic web - statistical learning for next generation knowledge bases. Data Mining and Knowledge Discovery - Special Issue on Web Mining (</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Rettinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nickles</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tresp</surname>
          </string-name>
          , V.:
          <article-title>Statistical relational learning with formal ontologies</article-title>
          . In: Buntine,
          <string-name>
            <given-names>W.L.</given-names>
            ,
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Mladenic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J. (eds.)
          <article-title>ECML/PKDD (2)</article-title>
          . LNCS, vol.
          <volume>5782</volume>
          , pp.
          <fpage>286</fpage>
          -
          <lpage>301</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov logic networks</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>62</volume>
          ,
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          (
          <year>February 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Rodrigues De Morais</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aussem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Exploiting data missingness in bayesian network modeling</article-title>
          .
          <source>In: Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII</source>
          . pp.
          <fpage>35</fpage>
          -
          <lpage>46</lpage>
          . IDA '
          <volume>09</volume>
          ,
          <string-name>
            <surname>Springer</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          :
          <article-title>Inference and missing data</article-title>
          .
          <source>Biometrika</source>
          <volume>63</volume>
          (
          <issue>3</issue>
          ),
          <fpage>581</fpage>
          -
          <lpage>592</lpage>
          (
          <year>1976</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tresp</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriegel</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          :
          <article-title>Infinite hidden relational models</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on Uncertainity in Artificial Intelligence (UAI</source>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Zaffalon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Maua´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Utility-based accuracy measures to empirically evaluate credal classifiers</article-title>
          .
          <source>In: ISIPTA 2011</source>
          . pp.
          <fpage>401</fpage>
          -
          <lpage>410</lpage>
          . Innsbruck
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>