<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Terminological Na¨ıve Bayesian Classifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pasquale Minervini</string-name>
          <email>pasquale.minervini@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia d'Amato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Fanizzi</string-name>
          <email>fanizzig@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LACAM - Dipartimento di Informatica - Universita` degli Studi di Bari “Aldo Moro” via E. Orabona</institution>
          ,
          <addr-line>4 - 70125 Bari -</addr-line>
          <country country="IT">Italia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge available through Semantic Web standards can easily be missing, generally because of the adoption of the Open World Assumption (i.e. the truth value of an assertion is not necessarily known). However, the rich relational structure that characterizes ontologies can be exploited for handling such missing knowledge in an explicit way. We present a Statistical Relational Learning system designed for learning terminological na¨ıve Bayesian classifiers, which estimate the probability that a generic individual belongs to the target concept given its membership to a set of Description Logic concepts. During the learning process, we consistently handle the lack of knowledge that may be introduced by the adoption of the Open World Assumption, depending on the varying nature of the missing knowledge itself.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        On the Semantic Web (SW) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] difficulties arise when trying to model real-world
domains using purely logical formalisms, since real-world knowledge generally involves
some degree of uncertainty or imprecision. In recognition of the need to soundly
represent uncertain knowledge, the World Wide Web Consortium (W3C) created, in 2007,
the Uncertainty Reasoning for the World Wide Web Incubator Group 1 (URW3-XG),
with the aim of identifying the requirements for reasoning with and representing the
uncertain knowledge in Web-based information.
      </p>
      <p>Several approaches to representation and inference with knowledge enriched with
probabilistic information have been proposed: some extend knowledge representation
formalisms actually used in the SW; others rely on probabilistic enrichment of
Description Logics or logic programming formalisms.</p>
      <sec id="sec-1-1">
        <title>Motivation</title>
        <p>The main problem of applying these approaches in real settings is given by the fact
that they almost always assume the availability of probabilistic information. However,
except of seldom cases, this information would be hardly known in advance. Having a
method that, exploiting available information on the data, i.e. an already designed and
1 http://www.w3.org/2005/Incubator/urw3/
populated ontology, is able to capture the necessary probabilistic information would be
of great help.</p>
        <p>
          Also, when relying on SW knowledge bases for reasoning with the Open World
Assumption (OWA) (e.g. when OWL is considered as a syntactic variant of some
Description Logic [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]), it is not always possible to know the truth value of an assertion:
under OWA, a statement is true or false only if its truth value can be formally derived.
As a consequence, there can be some cases (e.g. determining if an individual is a
member of a given concept) for which the truth value cannot be determined (it cannot be
derived neither that the individual is instance of the considered concept nor that the
individual is instance of the negated concept). This is opposed by the Closed World
Assumption (CWA), employed by a multitude of first order logic fragments and in the
Data Base setting where every statement that cannot be proved to be true, is assumed to
be false.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Related Work</title>
        <p>
          Within the SW, Machine Learning (ML) is going to cover a relevant role in the
analysis of distributed data sources described using SW standards [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], with the aim of
discovering new and refining existing knowledge. A collection of ML approaches
oriented to SW have already been proposed in literature, ranging from propositional and
single-relational (e.g. SPARQL-ML [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], or based on low-rank matrix approximation
techniques such as in [
          <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
          ]) to multi-relational (e.g. distance-based [
          <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
          ] or
kernelbased [
          <xref ref-type="bibr" rid="ref10 ref3">10, 3</xref>
          ]).
        </p>
        <p>
          In the class of multi-relational learning methods, Statistical Relational Learning [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
(SRL) one seem particularly appealing, being designed to learn in domains with both a
complex relational and a rich probabilistic structure; the URW3-XG provided in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] a
large group of situations in which knowledge on the SW needs to represent uncertainty,
ranging from recommendation and extraction/annotation to belief fusion/opinion
pooling and healthcare/life sciences. There have already been some proposals regarding
the adaptation and application of SRL systems to the SW, e.g. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] proposes to employ
Markov Logic Networks [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] for first-order probabilistic inference and learning within
the SW, and [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] proposes to learn first-order probabilistic theories in a probabilistic
extension of the ALC Description Logic named CRALC.
        </p>
        <p>
          However, such ML techniques make strong assumptions about the nature of the
missing knowledge (e.g. both matrix completion methods and the technique proposed in
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] inherently assume data is Missing at Random [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], while Markov Logic Networks
resort to Closed World Assumption during learning). Learning from incomplete
knowledge bases by adopting methods not coherent with the nature of the missing knowledge
itself (e.g. expecting it to be Missing at Random while it is Informatively Missing) can
lead to misleading results with respect to the real model followed by the data [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>We realised a SRL system for incrementally inducing a terminological na¨ıve Bayesian
classifier, i.e. a na¨ıve Bayesian network modelling the conditional dependencies
between a learned set of Description Logic (complex) concepts and a target atomic
concept the system aims to learn. Our system is focused to the SW, being able to learn
classifiers with a structure which is both logically and statistically rich, and to deal with
the missing knowledge resulting from the adoption of the OWA with methods that are
consistent with the assumed nature of the missing knowledge (i.e. Missing Completely
at Random, Missing at Random or Informatively Missing). In the rest of this paper, we
will first describe Bayesian Networks (and some extensions we will employ to deal
with some potentially problematic cases); then we will describe our probabilistic-logic
model, terminological Bayesian classifiers, and the problem of learning it from a set of
training individuals and a Description Logic knowledge base. In the last part, we will
describe our learning algorithm, and the adaptations to learn under different
assumptions on the ignorance model.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Bayesian Networks and Robust Bayesian Estimation</title>
      <p>
        Graphical models [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (GMs) are a popular framework to compactly describe the joint
probability distribution for a set of random variables, by representing the underlying
structure through a series of modular factors. Depending on the underlying semantics,
GMs can be grouped into two main classes: directed graphical models, which found on
directed graphs, and undirected graphical models, founding on undirected graphs.
      </p>
      <p>A Bayesian network (BN) is a directed GM which represents the conditional
dependencies in a set of random variables by using a directed acyclic graph (DAG) G
augmented with a set of conditional probability distributions G associated with G’s
vertices. In such graph, each vertex corresponds to a random variable Xi (e.g. an
observable quantity, a set of unknown parameters etc.) and each edge indicates a direct
influence relation between the two random variables; this allows to define conditional
independence relationships between the variables, which are independent from any of
their non-descendants, given the value of their parent variables.</p>
      <p>A BN stipulates a set of conditional independence assumptions, also called local
Markov assumptions, over its set of random variables: each vertex Xi in the DAG is
conditionally independent of any subset S N d(Xi) of vertices that are not
descendants of Xi given a joint state of its parents:</p>
      <p>8Xi : Pr(Xi j S; parents(Xi)) = Pr(Xi j parents(Xi));
where the function parents(Xi) returns the parent vertices of Xi in the DAG
representing the BN. The conditional independence assumption allows to represent the joint
probability distribution Pr(X1; : : : ; Xn) defined by a BN over a set of random variables
fX1; : : : ; Xng as a production of the individual probability distributions, conditional on
their parent variables:</p>
      <p>Pr(X1; : : : ; Xn) =
n
Y Pr(Xi j parents(Xi));
i=1</p>
      <p>As a result, it is possible to define Pr(X1; : : : ; Xn) by only specifying, for each
vertex Xi in the graph, the conditional probability distribution Pr(Xi j parents(Xi)).</p>
      <p>Given a BN specifying a joint probability distribution over a set of variables, it is
possible to evaluate inference queries by marginalization, like calculating the posterior
probability distribution for a set of query variables given some observed event (i.e.
assignment of values to the set of evidence variables).</p>
      <p>
        Exact inference for general BNs is an NP-hard problem, but algorithms exist to
efficiently infer in restricted classes of networks, such as variable elimination, which
has linear complexity in the number of vertices if the BN is a singly connected
network [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Approximate inference methods also exist in literature, such as Monte Carlo
algorithms, that provide approximate answers whose accuracy depends on the
number of samples generated. Other methods in this family, such as belief propagation or
variational methods, approximate sums of random variables through their means [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>However, finding an optimal structure for a BN may not be trivial: the number of
possible structures for a DAG is super-exponential (O(2f (n)), with f (n) = n1+ , &gt;
0) in the size of its vertices (r4 = 543, r8 7; 8 1011, r12 5; 2 1026), making
it impractical, in many cases, to perform an exhaustive search through the space of
possible structures. Therefore, in our approach, we tried to find an acceptable trade-off
between efficiency and expressiveness, so to make our method suitable for a context
like SW: we decided to focus on a particular subclass of Bayesian networks, i.e. na¨ıve
Bayesian networks, modelling the dependencies between a set of random variables F =
fF1; : : : ; Fng, also called features, and a random variable C, also called class, so that
each pair of features are independent of each other given the class, i.e. 8Fi; Fj 2 F :
i 6= j ) (Fi ?? Fj j C).</p>
      <p>
        This kind of models is especially interesting since they proved to be effective also
in contexts in which the underlying independence assumptions are violated [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], even
outperforming more current approaches [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        However, defining a BN requires a number of precise probability assessments which,
as we will see, will not be always possible to obtain. A generalisation of na¨ıve Bayesian
networks to probability intervals is the robust Bayesian estimator [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] (RBE): each
conditional probability in the network is a probability interval characterised by its lower
and upper bounds, defined respectively as Pr(A) = min Pr(A) and Pr(A) = max Pr(A).
Pr2P Pr2P
      </p>
      <p>
        The main problem with this approach is assigning class labels, after having
calculated the posterior probability intervals: if the two resulting intervals do not overlap, it is
possible to apply the so called stochastic dominance criterion, which assigns a generic
individual a to a target concept C iff Pr(C(a)) &gt; Pr(:C(a)). If the intervals overlap,
to avoid undecidability, it is still possible to use a weaker criterion, called weak
dominance criterion [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] by representing each probability interval into a single probability
value represented by its middle point, which indeed underlies some assumptions on the
distribution of the missing values.
      </p>
      <p>
        A similar approach, founded on imprecise probability theory, is presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
proposes using a Credal network (structurally similar to a BN, but where the conditional
probability densities belong to convex sets of mass functions) to represent uncertainty
about network parameters.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Terminological Na¨ıve Bayesian Classifiers</title>
      <p>The learning problem we intend to focus on consists in learning a terminological na¨ıve
Bayesian classifier NK; this is defined as a na¨ıve BN modelling the dependency
relations between a set of Description Logic (DL) concepts (also referred to as feature
concepts) and a target atomic concept C, given a set of training individuals. Feature
concepts may eventually be complex, and the training individuals are distinguished in
positive, negative and neutral, belonging respectively to the target concept C, :C and
or whose membership of C is unknown. A DL Knowledge Base (KB) K is typically
constituted by (at least) two main components, a TBox T and an ABox A:
– TBox – which introduces the terminology of an application domain, in terms of
axioms describing concept hierarchies;
– ABox – which contains assertions (ground axioms) about named individuals in
terms of this terminology.</p>
      <p>A terminological Bayesian classifier can be defined as follows:</p>
      <sec id="sec-3-1">
        <title>Definition 1 (Terminological Bayesian Classifier). A terminological Bayesian classi</title>
        <p>fier NK, with respect to a DL KB K, is defined as a pair hG; G i, where:
– G = hV; E i is a directed acyclic graph, in which:</p>
        <p>V = fF1; : : : ; Fn; Cg is a set of vertices, each Fi representing a DL (eventually
complex) concepts defined over K and C representing a target atomic concept;
E V V is a set of edges, modelling the independence relations between the
elements of V;
– G is a set of conditional probability distributions (CPD), one for each V 2 V,
representing the conditional probability of the feature concept given the state of its
parents in the graph.
in which the membership probability of a generic individual a to the target concept C
(or :C) is estimated using BN inference techniques given the membership of a to the
concepts in V.</p>
        <p>In particular, a terminological na¨ıve Bayesian Classifier is characterised by the
following structure: E = fhC; Fii j i 2 f1; : : : ; ngg (i.e. each feature concept is
independent from the other feature concept, given the value of the target atomic concept).
Example 1 (Example of Terminological Na¨ıve Bayesian Classifier). Given a set of DL
feature concepts F = fF emale; HasChild := 9hasChild:P ersong 2 and a target
concept F ather, a terminological na¨ıve Bayesian classifier expressing the target
concept in terms of the feature concepts is the following:</p>
        <p>F ather</p>
        <p>Pr(F emalejF ather)
Pr(F emalej:F ather)
Pr(HasChildjF ather)
Pr(HasChildj:F ather)</p>
        <p>F emale</p>
        <p>HasChild := 9hasChild:P erson</p>
        <p>Let K be a DL KB and a a generic individual so that K j= HasChild(a) and the
membership of a to the concept F emale is not known, i.e. K 6 j=F emale(a) ^ K 6
j=:F emale(a). It is possible to infer, through the given network, the probability that
the individual a is a member of the target atomic concept F ather:
2 In examples, variable names are used instead of complex feature concepts for brevity
Pr(F ather(a)) =</p>
        <p>X
F ather02fF ather;:F atherg</p>
        <sec id="sec-3-1-1">
          <title>Pr(F ather0) Pr(HasChild j F ather0)</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Pr(F ather) Pr(HasChild j F ather)</title>
          <p>;</p>
          <p>In the following we define the problem of learning a terminological Bayesian
classifier NK given a DL KB K and the training individuals IndC (A):</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Definition 2 (Terminological Bayesian Classifier Learning Problem). Our termino</title>
        <p>logical na¨ıve Bayesian classifier learning problem consists in finding a network NK
that maximizes the quality of the network with respect to the training instances and a
specific scoring function; formally:
Given
– a target concept C we aim to learn;
– a DL KB K = hT ; Ai, where the ABox A contains membership assertions
about individuals and C, while the TBox T does not contain assertions
involving C;
– the disjoint sets of of positive, negative and neutral examples for C, denoted
with IndC+(A), IndC (A) and Ind0C (A), so that:
8a 2 IndC+(A) : C(a) 2 A,
8a 2 IndC (A) : :C(a) 2 A,
8a 2 Ind0C (A) : C(a) 62 A ^ :C(a) 62 A;
– a scoring function specifying the quality of an induced terminological Bayesian
classifier NK with respect to the samples in</p>
        <p>IndC (A) = Sv2f+; ;0g IndvC (A) and a scoring criterion;
Find a network NK maximizing the score function with respect to the samples:
NK
arg max score(NK; IndC (A))):</p>
        <p>NK</p>
        <p>
          Our search space, to find the optimal network NK, may be too large to explore
exhaustively; therefore our learning algorithm, outlined in Alg. 1, works by greedily
searching the space of features (i.e. DL complex concepts) for the ones that maximize
the score of the induced network, with respect to a scoring function, and incrementally
building the resulting network. While the features are added one by one, the search in
the space of DL complex concepts is made through a beam search, employing the cl
#
closure of the downward refinement operator # described in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>For each new complex concept being evaluated, the algorithm creates a new set of
concepts V0 and finds the optimal structure (under a given set of constraints) E 0 (which,
in the case of terminological na¨ıve Bayesian classifiers, is already defined) and the
corresponding maximum likelihood parameters G0 (which may vary depending on the
assumptions on the nature of the ignorance model), then scores the new network with
respect to a scoring criterion.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Different Assumptions on the Ignorance Model</title>
        <p>
          Let K = hT ; Ai be a DL KB; under OWA, it is not always possible to know if a
generic DL assertion is or is not entailed by K (i.e. there may be cases in which
K 6 j= ^K 6 j=: ). This allows us to characterize our lack of knowledge about
conceptmemberships through the probability distribution of the ignorance model [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Given
a generic concept C, a generic individual a and a DL KB K , let I be an ignorance
model from which we extract a fragment of K , I(K ) = K (so that 8 : K j=
) K j= ^ K j= 6) K j= ). Let denote NK as a probabilistic model
that, from a DL KB K, calculates the probability that the concept-membership relation
between C and a is unknown. We can say that the ignorance model underlying the
concept-membership relation between a and C in K (with respect to a, K and the
aforementioned probabilistic model) is:
– MCAR (Missing Completely at Random) – when the probability for such
conceptmembership to be missing is independent from the knowledge on a available in K :
Pr(K 6j= C(a) ^ K 6j= :C(a) j K ) = Pr(K 6j= C(a) ^ K 6j= :C(a));
– MAR (Missing At Random) – when the probability for such concept-membership
to be missing depends on the knowledge on a available in K:
        </p>
        <p>Pr(K 6j= C(a) ^ K 6j= :C(a) j K ) = Pr(K 6j= C(a) ^ K 6j= :C(a) j K);
– NMAR (Not Missing At Random, also referred to as IM, Informatively Missing)
– when the probability for such concept-membership to be missing depends on the
knowledge on a available in K :</p>
        <p>Pr(K 6j= C(a) ^ K 6j= :C(a) j K ) 6= Pr(K 6j= C(a) ^ K 6j= :C(a) j K).
Algorithm 1 Algorithm for Learning Terminological Bayesian Classifiers</p>
        <p>Specifically, in our algorithm, the outer loop (lines 1-15) greedily searches for a
new (complex) concept definition whose addition increases the network’s quality on the
given sample instances (determined by a scoring function score). The search through
the space of concept definitions is performed in the inner loop (lines 3-13) through a
beam search: starting from a beginning concept Start, for each refinement level, all
refinements up to a given length are memorized in a priority queue N ewBeam (sorted
according to the score associated to the network generated by adding them to the set
of feature concepts) from which only the k with the highest score are selected, by the
selection function selectF rom, to be refined in the next iteration.</p>
        <p>The functions optimalN etwork and score are used, respectively, to find the
optimal Bayesian network structure between the nodes in the network (eventually under a
set of constraints, like in the na¨ıve Bayes case or some of its extensions) and for scoring
a classifier (to compare its effectiveness with others). However, those two functions are
sensitive to the assumptions made about the ignorance model.</p>
        <p>
          When the assumed ignorance model is MCAR, we are allowed to use an approach
called available case analysis [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], in which we build an unbiased estimator of the
network parameters, based only on available knowledge. A scoring function we realised
for such case is the network’s log-likelihood on training data, calculated only on positive
and negative training individuals, ignoring the available knowledge about the
conceptmembership relations between such individuals and the target concept C, and defined
as:
L(NK j IndC (A)) = log Pr(NK) +
Another approach we implemented consisted in ranking both positive and negative
training individuals a according to P (C(a) j NK), and then calculating the area
under the Precision-Recall curve using different acceptance thresholds.
        </p>
        <p>
          Under the na¨ıve Bayes assumption, there is no need to perform a search for
finding the optimal network, since the structure is already fixed (each node except the target
concept node has only one parent, i.e. the target concept node); otherwise, finding a
network structure which is optimal under some criterion (e.g. the BIC score [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]) may
require an exhaustive search in the space of possible structures. However, tree-augmented
na¨ıve Bayesian networks (which allow for a tree structure among feature nodes), it is
possible to efficiently compute the optimal structure employing the method in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
making it appealing for real-life applications requiring efficiency and scalability.
        </p>
        <p>
          In the MAR case, a possible solution for learning models accounting for missing
knowledge is to use the Expectation-Maximization (EM) algorithm, MCMC sampling
or the gradient ascent method [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. We use EM to learn terminological na¨ıve Bayesian
classifiers from MAR data. In our approach, outlined in Alg. 2, we first heuristically
estimate network’s parameters by only using available data; then, in order to find the
maximum likelihood parameters with respect to both observed and missing knowledge,
we consider individuals whose membership to a particular concept description D is not
known as several fractional individuals belonging, with different weights
(corresponding to the posterior probability of their class membership), to both the components D
and :D.
        </p>
        <p>Formally, the EM algorithm for parameters learning explores the space of possible
parameters through an iterative hill-climbing search, converging to a (local) maximum
likelihood estimate of the unknown parameters, where the (log-)likelihood (which we
also use as scoring criterion) is defined as follows:</p>
        <p>In our use of the EM algorithm, the E-step calculates the concept-membership
posterior probability (inferencing through the network) of each individual whose
conceptmembership relation in unknown, thus completing the data through so called expected
counts. Then, the M-step calculates a new estimate of the network’s conditional
probability distributions by using expected counts, maximizing the log-likelihood of both
available and missing data with respect to a network NK.</p>
        <p>
          About finding optimal structures for networks with less restrictions on their
structure (such as tree-augmented na¨ıve BNs or unrestricted BNs) from MAR data, it is
possible to employ the Structural EM (SEM) algorithm [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In SEM, the
maximization step is performed both in the space of structures G and in the space of parameters
        </p>
        <p>G , by first searching a better structure and then the best parameters associated to the
given structure; it can be proven that, if the search procedure finds a structure that is
better than the one used in the previous iteration with respect to e.g. the BIC score, then
the structural EM algorithm will monotonically improve the score.</p>
        <p>When knowledge is NMAR, it is generally possible to extend the probabilistic
model to produce one where the MAR assumption holds; e.g. if a feature concept Fi
follows a NMAR ignorance model, with respect to a generic individual a and a DL
KB K, we can consider its observability as an additional variable (e.g. Yi = 0 iff
K 6j= Fi(a) ^ K 6j= :Fi(a), Yi = 1 otherwise) in our probabilistic model, so that
Fi’s ignorance model satisfies the MAR assumption (since its missingness depends on
an always observable variable).</p>
        <p>
          An alternate solution is recurring to robust Bayesian estimation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] (RBE), to learn
conditional probability distributions without making any sort of assumption about the
nature of the missing data. RBE finds probability intervals instead of single probability
values, obtained by taking in account all the possible fillings of the missing knowledge;
the width of inferred intervals is therefore directly proportional to the quantity of
missing knowledge considered during the learning process. To score each new induced
network, we employ the framework proposed in [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] to compare credal networks, while
Algorithm 2 Outline for our implementation of the EM algorithm for parameter
learning in a terminological Bayesian classifier assuming the underlying ignorance model is
MAR.
function ExpectedCounts(NK; IndC (A))
1: NK = hG; G i; G = hV; E i;
2: for Xi 2 V do
3: for hxi; xi i 2 vals(Xi; parents(Xi)) do
4: fn(xi; xi ) contains the expected count for (Xi = xi; parents(Xi) = xi )g
5: n(xi; xi ) 0;
6: end for
7: end for
8: for a 2 IndC (A) do
9: for Xi 2 V do
10: for hxi; xi i 2 vals(Xi; parents(Xi)) do
11: fEach expected count n(xi; xi ) is obtained summing out the probability
assignments to the concept memberships (Xi = xi; parents(Xi) = xi ) for each
individual, calculated using the background knowledge K and, if they are only partially
known, inferring through the network NKg
12: n(xi; xi ) n(xi; xi ) + Pr(xi; xi j NK);
13: end for
14: end for
15: end for
16: return fn(xi; xi ) j Xi 2 V; hxi; xi i 2 vals(Xi; parents(Xi))g;
function ExpectationM aximization(NK0 ; IndC (A))
1: fThe network was first initialized with arbitrary heuristic parameters G0 g
2: NK0 = hG; G0 i; G = hV; E i;
3: t 0;
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>4: repeat</title>
        <p>5: fn(xi; xi )g ExpectedCounts(NK; IndC (A));
6: for Xi 2 V do
7: for hxi; xi i 2 vals(Xi; parents(Xi)) do
8: Px0i2vanl(sx(Xi;i)xni()x0i; xi ) ;
t+1(xi; xi )</p>
        <p>G
9: end for
10: end for
11: t t + 1;
12: N Kt = hG; Gt i;
13: fThe EM loop ends when improvements to the network’s log-likelihood go below a
certain thresholdg
14: until L(NKtt = hG; Gt )i j IndC (A)) L(N Kt 1 = hG; Gt 1i j IndC (A)) ;
15: return NK;
we do not have implemented yet a method to search for structures other than na¨ıve
Bayesian.</p>
        <p>Example 2 (Example of Terminological Na¨ıve Bayesian Classifier using Robust Bayesian
Estimation). The following is a terminological na¨ıve Bayesian classifier using robust
Bayesian estimation for inferring posterior probability intervals in presence of NMAR
knowledge. In this networks, conditional probability tables associated to each node
contain probability intervals instead of probability values, each defined by its upper and
lower bound.</p>
        <p>F a := F ather
[Pr(F emalejF ather);Pr(F emalejF ather)]
[Pr(F emalej:F ather);Pr(F emalej:F ather)]
[Pr(HasChildjF ather);Pr(HasChildjF ather)]
[Pr(HasChildj:F ather);Pr(HasChildj:F ather)]</p>
        <p>F e := F emale
HC := 9hasChild:P erson</p>
        <p>Inference, using such network, can be performed as follows – given a generic
individual a and given that K j= HC(a), the posterior probability interval that a is a
member of F a is represented by the probability interval [Pr(F a j HC); Pr(F a j HC)],
where:</p>
        <p>Pr(F a(a)) = Pr(F a j HC) =
Pr(F a(a)) = Pr(F a j HC) =</p>
        <p>Pr(HC j F a)Pr(F a)
Pr(HC j F a)Pr(F a) + Pr(HC j :F a)Pr(:F a)</p>
        <p>Pr(HC j F a)Pr(F a)
Pr(HC j F a)Pr(F a) + Pr(HC j :F a)Pr(:F a)
;
;
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>We presented a Statistical Relational Learning method designed for learning
terminological na¨ıve Bayesian classifiers, a ML method based on the na¨ıve Bayes assumption
for estimating the probability that a generic individual belongs to a certain target
concept, given its membership relation to an induced set of complex Description Logic
concepts. We gave a characterisation of the lack of knowledge that may be introduced
by the OWA depending on the underlying ignorance model, and handled such missing
knowledge under different assumptions on the nature of missing knowledge itself (i.e.
Missing Completely at Random, Missing at Random or Informatively Missing). In the
future, we aim at estimating computationally the ignorance model followed by each
feature, at developing new methods to exploit the potential information contained in
knowledge’s missingness and evaluate our methods’ effectiveness on real world
ontologies.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>OWL</given-names>
            <surname>2 Web Ontology Language Direct Semantics</surname>
          </string-name>
          (
          <year>October 2009</year>
          ), http://www.w3. org/TR/owl2-direct-semantics/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>The semantic web</article-title>
          .
          <source>Scientific American</source>
          <volume>284</volume>
          (
          <issue>5</issue>
          ),
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
          (May
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bicer</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Relational kernel machines for learning from graph-structured rdf data</article-title>
          . In: Antoniou,
          <string-name>
            <surname>G.</surname>
          </string-name>
          , et al. (eds.)
          <source>ESWC (1)</source>
          . LNCS, vol.
          <volume>6643</volume>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>62</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Caruana</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niculescu-Mizil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>An empirical comparison of supervised learning algorithms</article-title>
          .
          <source>In: ICML2006</source>
          . pp.
          <fpage>161</fpage>
          -
          <lpage>168</lpage>
          . ACM, New York, NY, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Corani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaffalon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Naive credal classifier 2: an extension of naive bayes for delivering robust classifications</article-title>
          .
          <source>In: DMIN</source>
          . pp.
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>d'Amato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fanizzi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esposito</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Query answering and ontology population: an inductive approach</article-title>
          .
          <source>In: ESWC 2008</source>
          . pp.
          <fpage>288</fpage>
          -
          <lpage>302</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Uncertainty reasoning for the semantic web i. chap. Just Add Weights: Markov Logic for the Semantic Web</article-title>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazzani</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>On the optimality of the simple bayesian classifier under zeroone loss</article-title>
          .
          <source>Machine Learning</source>
          <volume>29</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>103</fpage>
          -
          <lpage>130</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Fanizzi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al.:
          <article-title>Reduce: A reduced coulomb energy network method for approximate classification</article-title>
          . In: Aroyo,
          <string-name>
            <surname>L.</surname>
          </string-name>
          , et al. (eds.) ESWC. pp.
          <fpage>323</fpage>
          -
          <lpage>337</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Fanizzi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Amato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esposito</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Learning with kernels in description logics</article-title>
          .
          <source>In: ILP 2008</source>
          . pp.
          <fpage>210</fpage>
          -
          <lpage>225</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>The Bayesian structural EM algorithm</article-title>
          .
          <source>In: UAI 1998</source>
          . pp.
          <fpage>129</fpage>
          -
          <lpage>138</lpage>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldszmidt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Provan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langley</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smyth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Bayesian network classifiers</article-title>
          .
          <source>In: Machine Learning</source>
          . pp.
          <fpage>131</fpage>
          -
          <lpage>163</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Getoor</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)</article-title>
          . The MIT Press (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kiefer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al.:
          <article-title>Adding Data Mining Support to SPARQL via Statistical Relational Learning Methods</article-title>
          .
          <source>In: ESWC 2008. LNCS</source>
          , vol.
          <volume>5021</volume>
          , pp.
          <fpage>478</fpage>
          -
          <lpage>492</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          , N.:
          <article-title>Probabilistic Graphical Models: Principles and Techniques</article-title>
          . MIT Press (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Laskey</surname>
            ,
            <given-names>K.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laskey</surname>
            ,
            <given-names>K.B.</given-names>
          </string-name>
          :
          <article-title>Uncertainty reasoning for the world wide web: Report on the urw3-xg incubator group</article-title>
          . In: Bobillo,
          <string-name>
            <surname>F.</surname>
          </string-name>
          , et al. (eds.)
          <source>URSW. CEUR Workshop Proceedings</source>
          , vol.
          <volume>423</volume>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Concept learning in description logics using refinement operators</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>78</volume>
          ,
          <fpage>203</fpage>
          -
          <lpage>250</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Luna</surname>
            ,
            <given-names>J.E.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cozman</surname>
            ,
            <given-names>F.G.</given-names>
          </string-name>
          :
          <article-title>An algorithm for learning with probabilistic description logics</article-title>
          . In: Baader,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.L.</given-names>
            ,
            <surname>Nardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Patel-Schneider</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.F</surname>
          </string-name>
          . (eds.)
          <source>URSW. CEUR Workshop Proceedings</source>
          , vol.
          <volume>527</volume>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>74</lpage>
          . CEUR-WS.org (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Probabilistic reasoning in intelligent systems: networks of plausible inference</article-title>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Ramoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Robust learning with missing data</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>45</volume>
          ,
          <fpage>147</fpage>
          -
          <lpage>170</lpage>
          (
          <year>October 2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Markov logic networks</article-title>
          .
          <source>Mach. Learn</source>
          .
          <volume>62</volume>
          ,
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          (
          <year>February 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Rodrigues De Morais</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aussem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Exploiting data missingness in bayesian network modeling</article-title>
          .
          <source>In: IDA 2009</source>
          . pp.
          <fpage>35</fpage>
          -
          <lpage>46</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          :
          <article-title>Inference and missing data</article-title>
          .
          <source>Biometrika</source>
          <volume>63</volume>
          (
          <issue>3</issue>
          ),
          <fpage>581</fpage>
          -
          <lpage>592</lpage>
          (
          <year>1976</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Tresp</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Uncertainty reasoning for the semantic web i</article-title>
          .
          <source>chap. Towards Machine Learning on the Semantic Web</source>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>314</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Tresp</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bundschus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rettinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Materializing and querying learned knowledge</article-title>
          .
          <source>In: IRMLeS</source>
          <year>2009</year>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Zaffalon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Maua´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Utility-based accuracy measures to empirically evaluate credal classifiers</article-title>
          .
          <source>In: ISIPTA 2011</source>
          . pp.
          <fpage>401</fpage>
          -
          <lpage>410</lpage>
          .
          <string-name>
            <surname>Innsbruck</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>