A Graph Regularization Based Approach to
          Transductive Class-Membership Prediction

               Pasquale Minervini, Claudia d’Amato, and Nicola Fanizzi

                     LACAM Laboratory – Dipartimento di Informatica
    Università degli Studi di Bari “Aldo Moro” – via E. Orabona, 4 - 70125 Bari - Italia
{ pasquale.minervini, claudia.damato, nicola.fanizzi }@uniba.it


       Abstract. Considering the increasing availability of structured machine process-
       able knowledge in the context of the Semantic Web, only relying on purely deduc-
       tive inference may be limiting. This work proposes a new method for similarity-
       based class-membership prediction in Description Logic knowledge bases. The
       underlying idea is based on the concept of propagating class-membership in-
       formation among similar individuals; it is non-parametric in nature and charac-
       terised by interesting complexity properties, making it a potential candidate for
       large-scale transductive inference. We also evaluate its effectiveness with respect
       to other approaches based on inductive inference in SW literature.


1   Introduction

Standard Semantic Web (SW) reasoning services rely on purely deductive inference.
However, this may be limiting, e.g. due to the complexity of reasoning tasks, avail-
ability and correctness of structured knowledge. Approximate deductive and induc-
tive inference were discussed as a possible approach to try to overcome such limi-
tations [19]. Various proposals to extend inductive inference methods towards SW for-
malisms have been discussed in SW literature: inductive methods can perform some sort
of approximate and uncertain reasoning and derive conclusions which are not derivable
or refutable from the knowledge base [19].
    This work proposes a novel method for transductive inference on Description Logic
representations. In the class-membership prediction task, discriminative methods pro-
posed so far ignore unlabelled problem instances (individuals for which the value of
such class-membership is unknown); however, accounting for unlabelled instances dur-
ing learning can provide more accurate results if some conditions are met [6, 27]. Gen-
erative methods, on the other hand, try to model a joint probability distribution on both
instances and labels, thus facing a possibly harder learning problem than only predicting
the most probable label for any given instance.
    In section 2 we will first shortly survey related works, and introduce a variant to the
classic class-membership prediction problem. In section 3 we will introduce the pro-
posed method: the assumptions it relies on, and how it can be used for class-membership
prediction on large and Web scale ontological knowledge bases. In section 4, we will
provide empirical evidence for the effectiveness of the proposed method with respect to
other methods in SW literature.
2     Preliminaries
A variety of approaches have been proposed in the literature for class-membership pre-
diction, either discriminative or generative [17]. Assuming instances are sampled i.i.d.
from a distribution P ranging over a space X ×Y (where X is the space of instances and
Y a set of labels), generative prediction methods first build an estimate P̂ of the joint
probability distribution P (X, Y ), and then use it to infer P̂ (Y | x) = P̂ (Y, x)/P̂ (x)
for a given, unlabelled instance x ∈ X. On the other hand, discriminative methods sim-
ply aim at estimating when P (y | x) ≥ 0.5, for any given (x, y) ∈ X × Y (thus facing
a possibly easier problem than estimating a joint probability distribution over X × Y ).
The following shortly surveys class-membership prediction methods proposed so far.

2.1   Discriminative Methods
Some of the approaches proposed for solving the class-membership prediction problem
are similarity-based. For instance, methods relying on the k-Nearest Neighbours (k-
NN) algorithm are discussed in [7, 19]. A variety of (dis-)similarity measures between
either individuals or concepts have been proposed: according to [5], they can be based
on features (where objects are characterised by a set of features, such as in [15]), on the
semantic-network structure (where background information is provided in the form of
a semantic network, such as in [9, 16]) or on the information content (where both the
semantic network structure and population are considered, such as in [8]). Kernel-based
algorithms [21] have been proposed for various learning tasks from DL-based represen-
tations. This is made possible by the existence of a variety of kernel functions, either for
concepts or individuals (such as [10, 4, 12]). By (implicitly) projecting instances into
an high-dimensional feature space, kernel functions allow to adapt a multitude of ma-
chine learning algorithms to structured representations. SW literature includes methods
for inducing robust classifiers [11] or learning to rank [13] from DL knowledge bases
using kernel methods.

2.2   Generative Methods
For learning from formal ontologies, a generative approach has been discussed in [20].
In this work, each individual is associated to a latent variable which influences its at-
tributes and the relations it participates in. It proposes using Bayesian non-parametrics
to avoid setting the number of possible values for such latent variables (which can be
seen as cluster indicators); and an inferencing scheme based on Markov Chain Monte
Carlo, where posterior sampling is constrained by a pre-defined set of DL axioms. A
quite different approach is discussed in [18]: this work focuses on learning theories in
a probabilistic extension of the ALC DL named CRALC, using DL refinement opera-
tors to efficiently explore the space of concepts. It is inspired by literature on Bayesian
Logic Programs.

2.3   Semi-Supervised and Transductive Learning
Classic discriminative learning methods ignore unlabelled instances. However, real life
scenarios are usually characterized by an abundance of unlabelled instances and a few
labelled ones [27]. This may also be the case for class-membership prediction from
formal ontologies: class-membership relations may be difficult to obtain during ontol-
ogy engineering tasks (e.g. due to availability of domain experts) and inference (e.g.
since deciding instance-membership may have an intractable time complexity in some
languages).
    Using unlabelled instances during learning is generally known in the machine learn-
ing community as Semi-Supervised Learning [6, 27] (SSL). A variant to this setting is
known as Transductive Learning [23] and refers to finding a labelling only to unlabelled
instances provided in the training phase, without necessarily generalizing to unseen in-
stances (and thus resulting into a possibly simpler learning problem). If the marginal
distribution of instances PX is informative with respect to the conditional probability
distribution P (Y | x), accounting for unlabelled instances during learning can provide
more accurate results [6, 27]. A possible approach is including terms dependent from
PX into the objective function. This results in the two fundamental assumptions [6]:

 – Cluster assumption – The joint probability distribution P (X, Y ) is structured in
   such a way that points in the same cluster are likely to have the same label.
 – Manifold assumption – Assume that PX is supported on a low-dimensional man-
   ifold: then, P (Y | x) varies smoothly, as a function of x, with respect to the under-
   lying structure of the manifold.

    In the following sections, we discuss a similarity-based, non-parametric and com-
putationally efficient method for predicting missing class-membership relations. This
method is discriminative in nature, but also accounts for unknown class-membership
during learning.
    We will face a slightly different version of the classic class-membership prediction
problem, namely transductive class-membership prediction. It is inspired to the Main
Principle in [23]: “If you possess a restricted amount of information for solving some
problem, try to solve the problem directly and never solve a more general problem
as an intermediate step. It is possible that the available information is sufficient for
a direct solution but is insufficient for solving a more general intermediate problem”.
In this setting, the learning algorithm only aims at estimating the class-membership
relation of interest for a given training set of individuals, without necessarily being able
to generalise to individuals outside such set.
    In this work, we formalise the transductive class-membership prediction problem as
a cost minimisation problem: given a set of training individuals IndC (K) whose class-
membership relation to a target concept C is either known or unknown, find a function
f ∗ : IndC (K) → {+1, −1} defined over training individuals and returning a value +1
(resp. −1) if the individual likely to be a member of C (resp. ¬C), minimizing a given
cost function. More formally:

Definition 1. (Transductive Class-Membership Prediction) The Transductive Class-Membership
Prediction problem can be formalised as follows:

 – Given:
    • a target concept C;
        • a set of training individuals IndC (K) in a knowledge base K partitioned in
           positive, negative and neutral examples or, more formally, such that:
            Ind+C (K) = {a ∈ IndC (K) | K |= C(a)} positive examples,
            Ind−C (K) = {a ∈ IndC (K) | K |= ¬C(a)} negative examples,
            Ind0C (K) = {a ∈ IndC (K) | K 6|= C(a) ∧ K 6|= ¬C(a)} neutral examples;
        • A cost function cost(·) : F 7→ R, specifying the cost associated to a set of
           class-membership relations assigned to training individuals by f ∈ F, where
           F is a space of labelling functions of the form f : IndC (K) 7→ {+1, −1};
    – Find a labelling function f ∗ ∈ F minimizing the given cost function with respect
      to training individuals IndC (K):

                                   f ∗ ← arg min cost(f ).
                                             f ∈F


    The function f ∗ can then be used to estimate the class-membership relation with
respect to the target concept C for all training individuals a ∈ IndC (K): it will return
+1 (resp. −1) if an individual is likely to be a member of C (resp. ¬C). Note that the
function is defined on the whole set of training individuals; therefore it can possibly
contradict already known class-membership relations (thus being able to handle noisy
knowledge). If IndC (K) is finite, the space of labelling functions F is also finite, and
each function f ∈ F can be equivalently expressed as a vector in {−1, +1}n , where
n = |IndC (K)|.


3     Propagating Class-Membership Information Among Individuals

This section discusses a graph-based semi-supervised [27] method for class-membership
prediction from DL representations. The proposed method relies on a weighted seman-
tic similarity graph, where nodes represent positive, negative and neutral examples of
the transductive class-membership prediction problem, and weighted edges define sim-
ilarity relations among such individuals.
     More formally, let K be a knowledge base, IndC (K) a set of training individuals
with respect to a target concept C in K, and Y = {−1, +1} a space of labels each cor-
responding to a type of class-membership relation with respect to C. Each training indi-
vidual a ∈ IndC (K) is associated to a label, which will be +1 (resp. −1) if K |= C(a)
(resp. K |= ¬C(a)), and will be unknown otherwise, thus representing an unlabelled
instance. For defining a cost over functions f ∈ F, the proposed method relies on
regularization by graph: the learning process aims at finding a labelling function that
is both consistent with given labels, and changes smoothly between similar instances
(where similarity relations are encoded in the semantic similarity graph). This can be
formalised through a regularization framework, using a measure of the consistency to
the given labels as a loss function, and a measure of smoothness among the similarity
graph as a regulariser. Several cost functions have been proposed in SSL literature. An
appealing class of functions, from the side of computational cost, relies on the quadratic
cost criterion framework [6, ch. 11]: for this class of functions, a closed form solution
to the cost minimisation problem can be found efficiently (subsection 3.2).
3.1        Semantic Similarity Graph
A similarity graph can be represented with a weight matrix W, where the value of Wij
represents the strength of the similarity relation between two training examples xi and
xj . In graph-based SSL literature, W is often obtained either as a Nearest Neighbour
(NN) graph (where each instance is connected to the k most similar instances in the
graph, or to those with a distance under a radius ); or using a kernel function, such as
the Gaussian kernel. Finding the best way to construct W is an active area of research;
for example, in [6, ch. 20] authors discuss a method to combine multiple similarity
measures in the context of protein function prediction, while [1] proposes a method for
data-driven similarity graph construction.
     When empirically evaluating the proposed method, we employ the family of dis-
similarity measures between individuals in a DL knowledge base defined in [19], since
it does not constrain to any particular family of DLs; we refer to the resulting similarity
graph among individuals in a formal ontology as the semantic similarity graph. Given
a set of concept descriptions F = {F1 , . . . , Fn } and a weight vector w, such family of
dissimilarity measures dF p : Ind(A) × Ind(A) 7→ [0, 1] is defined as:

             
              0 if (K |= Fi (x) ∧ K |= Fi (y)) ∨ (K |= ¬Fi (x) ∧ K |= ¬Fi (y))
 δi (x, y) =    1 if (K |= Fi (x) ∧ K |= ¬Fi (y)) ∨ (K |= ¬Fi (x) ∧ K |= Fi (y)) (1)
               ui otherwise
             

where x, y ∈ Ind(A) and p > 0.

                                               unificationXref6                                                               Odlocba_30
                    unificationXref14
                 unificationXref3        unificationXref10                                                                   BirthCertificateInternational
                                                              unificationXref17                                       PassportForeignIdentifikacijskiDokument
          unificationXref7                 unificationXref9
                                                                                                                                     Odlocba_26
                           unificationXref2                       publicationXref4                                 SingleCertificate
                                                                                                                           RojstniList        EPRazdruzevalniALI_6
                                                                                                                                                 EPRazdruzevalniALI_14
              unificationXref4           unificationXref11
                                unificationXref13                                                                                                                             EPStoritev_21
                                                                                                                                                                   EPStoritev_3
                   unificationXref1                           publicationXref1                                                           EPRazdruzevalniALI_1
                                                                                                                                              EPRazdruzevalniALI_10
        unificationXref15                                                   publicationXref3          EPZdruzevalniALI_12
                                                                                                           EPZdruzevalniALI_17 EPZdruzevalniIN_17              EPStoritev_22
                unificationXref16                          publicationXref2                                                  EPZdruzevalniIN_5                                       EPStoritev_19
                                                                                                                                                                         EPStoritev_16
                                                                                                     EPZdruzevalniALI_3                         EPStart_14
                                                                                                                                 EPZdruzevalniIN_4
      unificationXref8       sequenceInterval1                                                            EPZdruzevalniALI_8
            unificationXref5                      openControlledVocabulary2                                               EPStop_6                   EPStart_3
                                                                                                                                              EPStart_20               EPStoritev_5
                                      openControlledVocabulary1                                                               EPStop_13                                         EPStoritev_11
         unificationXref12                                      dataSource1                                               EPStop_18
                                                                                                                               EPZivljenjskaSituacija_11
                         sequenceSite1
                                                bioSource2                                                                                EPRazdruzevalniIN_15EPStoritev_9
             relationshipXref2                                                                                               EPPodprocedura_15 EPRazdruzevalniIN_4               EPStoritev_33
                                                        bioSource1                                    DummyPS_MPA_DOC_018         EPPodprocedura_16
                                  sequenceSite2                         sequenceParticipant2        DummyPS_DOC_017                        EPRazdruzevalniIN_21    EPStoritev_2
                      relationshipXref3                                                                                                                          EPStoritev_7
                                                                                                        DummyPS_MPA_DOC_019
      relationshipXref9                                       sequenceParticipant1                                  PridobitevDokumentovZaTujca
                                                                                                                                           SpremembaOsebnihPodatkov
                                                                                                              DummyPS_MPA_DOC_016 Drzavljan_1
                                relationshipXref10                                 evidence1        DummyPS_DOC_015                            PorocimSe
         relationshipXref5                                        confidence1
                                                                                                                             ProceduraPorocimSe
                                                                                                                 pridobitevDokumentovDrzavljanov
                                                                                                          DummyPS_MPA_DOC_013                                OtrokSeJeRodil
                                                               protein1                                                                                  PosvojimOtroka
      relationshipXref1             relationshipXref8                            sequenceFeature1     DummyPS_MPA_DOC_020

          relationshipXref12                              physicalInteraction1                                                    PridobitevDovoljenjaZaPorokoMladoletnika
                                                                                                                                         StoritevJU_7
                             relationshipXref11
             relationshipXref6                                                    protein2
                                           relationshipXref7                                                                                 PritozbaNaPrijavoPoroke
                                                                                                                                     PrijavaPoroke

                           relationshipXref4                                                                                              RegistracijaPoroke


      (a) 3-NN graph, BioPAX (Proteomics)                                                                                 (b) 3-NN graph, Leo


Fig. 1: k-Nearest Neighbour Semantic Similarity graphs for individuals BioPAX (Pro-
teomics) ontology (left) and for the Leo ontology (right), obtained using the dissimilar-
ity measure in [19]: F was defined as the set of atomic concepts in the ontology (each
weighted with its normalized entropy [19]) and p = 2.


   Two examples of (k-NN) semantic similarity graphs among all individuals in the
ontologies B IO PAX (P ROTEOMICS ) and L EO, obtained using the aforementioned dis-
similarity measure, are provided in Fig. 1.
3.2    Quadratic Cost Criteria
In quadratic cost criteria [6, ch. 11], the original label space {−1, +1} (binary classifi-
cation case) is relaxed to [−1, +1]. This allows to express the confidence associated to
a labelling (and may give an indication about P (Y | x)). For such a reason, in the pro-
posed method, the labelling functions space F will be relaxed to functions of the form
f : IndC (K) 7→ [−1, +1]. As in subsection 2.3, labelling functions can be equivalently
represented as vectors y ∈ [−1, +1]n . Let ŷ ∈ [−1, +1]n be a possible labelling for n
instances. We can see ŷ as a (l + u) = n dimensional vector, where the first l indices
refer to already labelled instances, and the last u to unlabelled instances: ŷ = [ŷl , ŷu ].
    Consistency of ŷ with respect to original labels can be formulated in the form of a
                 Pl
quadratic cost: i=1 (yˆi − yi )2 = ||ŷl − yl ||2 .
    Similarly, labellings can be regularised with respect to the graph structure: as in
[2],Psuch consistency with respect to the geometry of instances can be estimated as
0.5 i,j=1 Wij (yˆi −P    yˆj )2 = ŷT Lŷ, where W is the semantic similarity graph and
L = D − W, Dii = j Wij ad 0 otherwise, is the unnormalized graph Laplacian. A
different criterion, discussed in [24, 25], measures it as (D−0.5 ŷ)T L(D−0.5 ŷ).
    Another regularization term in the form of ||ŷ||2 (or ||ŷu ||2 , as in [24]) can be
added to the final cost function to prefer smaller values in ŷ. This is useful e.g. to
prevent arbitrary labellings in a connected component of the semantic similarity graph
containing no labelled instances.
    Putting the pieces together, we obtain two quadratic cost criteria discussed in the lit-
erature, namely Regression on Graph [2] (RG) and the Consistency Method [24] (CM):
      RG: cost(ŷ) = ||ŷl − yl ||2 + µŷT Lŷ + µ||ŷ||2 ;
      CM: cost(ŷ) = ||ŷl − yl ||2 + µ(D−0.5 ŷ)T L(D−0.5 ŷ) + ||ŷu ||2 .
    As a title of example, we will now derive a closed form solution for the problem
of finding a (global) minimum for the quadratic cost criterion in RG. Its first order
derivative is defined as follows:
                          1 ∂cost(ŷ)
                                      = (S + µL + µI)ŷ − Sy,
                          2    ∂ ŷ
where S = diag(s1 , . . . , sn ), with si = 1 iff i ≤ l and 0 otherwise. Its second or-
der derivative is a positive definite matrix if  > 0, since L is positive semi-definite.
Therefore, setting the first order derivative to 0 leads to a global minimum:

                                ŷ = (S + µL + µI)−1 Sy,

showing that ŷ can be obtained either by matrix inversion or by solving a (possibly
sparse) linear system.
    This work leverages quadratic cost criteria to efficiently solve the transductive class-
membership prediction problem. Finding a minimum ŷ for a predefined cost criterion
is equivalent to finding a labelling function f ∗ in the form f ∗ : IndC (K) 7→ [−1, +1],
where the labelling returned for a generic training individual a ∈ IndC (K) correspond
to the value in ŷ in the position mapped to a. This can be done by representing the set of
training individuals IndC (K) as a partially labelled vector y of length |IndC (K)| = n,
such that the first l (resp. last u) components correspond to positive and negative (resp.
neutral) examples in IndC (K). Such y can be then used to measure the consistency
with original labels in a quadratic cost criterion; while the semantic similarity graph
can be employed to enforce smoothness in class-membership predictions among similar
training individuals.
    An advantage of quadratic cost criteria is that their minimization ultimately reduces
to solving a large sparse linear system [24, 6], a well-known problem in the literature
whose time complexity is nearly linear in the number of non-zero entries in the coeffi-
cient matrix [22]. For large-scale datasets, a subset selection method is described in [6,
ch. 18], which allows to greatly reduce the size of the original linear system.


4     Preliminary Empirical Evaluations

In this section, we evaluate several (inductive and transductive) methods for class-
membership prediction, with the aim of comparing the methods discussed in section
3 with respect to other methods in SW literature. We are reporting evaluations for the
Regularization on Graph [2] (RG) and the Consistency Method [24] (CM); Label Prop-
agation [26] (LP); three kinds of Support Vector Machines [21] (SVM), namely Hard-
Margin SVM (HM-SVM),√Soft-Margin SVM with L1 norm (SM-SVM) and Laplacian
SVM [3] (LapSVM); and l-Nearest Neighbors for class-membership prediction [19].


4.1   Description of Evaluated Methods

LP is a graph-based SSL algorithm relying on the idea of propagating labelling informa-
tion among similar instances through an iterative process involving matrix operations.
It can be equivalently formulated under the quadratic criterion framework [6, ch. 11].
More formally it associates, to each unlabelled instance in the graph, the probability of
performing a random walk until a positively (resp. negatively) example is found.
     We also evaluated Support Vector Machines (SVM), which have been proposed
for inducing robust classifiers from ontological knowledge bases [12, 19]. SVM clas-
sifiers come in different flavours: the classic HM-SVM binary classifier aims at find-
ing the hyperplane in the feature space separating the instances belonging to different
classes, which maximises the geometric margin between the hyperplane and nearest
training points. The SM-SVM classifier is a relaxation of HM-SVM, which allows
for some misclassification in training instances (by relaxing the need of having per-
fectly linearly separable training instances in the feature space). LapSVM is a semi-
supervised extension of the SM-SVM classifier: given a set of labelled instances and a
set of unlabelled instances, it aims at finding an hyperplane that is also smooth with
respect to the (estimated) geometry of instances. More formally, let (xl , yl ) (resp.
xu ) be a set of labelled (resp. unlabelled) instances. LapSVM finds a function f in a
space of functions HK determined by the kernel K (called Reproducing Kernel Hilbert
                            Pl
Space [21]) minimizing 1l i=1 V (xi , yi , f ) + γL ||f ||2HK + γM ||f ||2M , where V rep-
resents a costs function of errors committed by f on labeled samples (typically the
hinge loss function max{0, 1 − yi f (xi )}), || · ||HK imposes smoothness conditions on
         Ontology           Expressivity   #Axioms #Individuals #Classes #ObjectProperties
 B IO PAX (P ROTEOMICS ) ALCHN (D)            773        49          55            47
       FAMILY-T REE      SROIF(D)            2059        368         22            52
           L EO          ALCHIF (D)           430        61          32            26
        MDM0.73          ALCHOF(D)           1098        112        196            22
          W INE          SHOIN (D)           1046        218        142            21


                    Table 1: Ontologies considered in the experiments.


possible solutions [21] and || · ||2M , intuitively, penalizes rapid changes in the classifica-
tion function between close instances in the similarity graph. It generalizes HM-SVM
(γL → 0, γM = 0) and SM-SVM (γM = 0). Our implementation of LapSVM follows
the algorithm proposed in [3]; for HM-SVM, SM-SVM and LapSVM, we solve the
underlying convex optimization problems using the Gurobi optimizer [14].
    RG, CM, LP and LapSVM all rely on a semantic similarity graph W as a rep-
resentation of the geometry of instances. We first calculate distances employing the
dissimilarity measure defined in [19] and outlined in eq. 1, with p = 2; then we ob-
tain W by building a k-Nearest Neighbour graph using such distances (since sparsity
in W influences the scalability of quadratic cost criteria, as written in subsection 3.2).
When building the neighbourhood of a node, we handled the cases in which nodes had
the same distance by introducing a random ordering between such nodes. The Kernel
function used for Hard-Margin SVM, Soft-Margin SVM and Laplacian SVM are also
defined in [19], and directly correlated with the aforementioned dissimilarity measure
in eq. 1 (given a committee of concepts F and the parameters w and p, the dissimilarity
was originally obtained as 1 − k(a, b), where k(a, b) is the value of the kernel function
on a pair of individuals (a, b) in the knowledge
                                          √           base). We also provide a first evaluation
for the k-NN algorithm (with k = l, where l is the number of labelled        √       instances,
as discussed in [19]): we simply choose the majority class among the l most similar
individuals to label each unlabelled instance.


4.2   Evaluations

Starting from a set of real ontologies 1 (outlined in Table 1), we generated a set of 20
random query concepts for each ontology 2 , so that the number of individuals belonging
to the target query concept C (resp. ¬C) was at least of 10 elements and the number
of individuals in C and ¬C was in the same order of magnitude. A DL reasoner 3 was
employed to decide on the theoretical concept-membership of individuals to query con-
cepts. We employ the evaluation metrics in [7], which take into account the peculiarities
deriving by the presence of missing knowledge:
 1
   From TONES Repository: http://owl.cs.manchester.ac.uk/repository/
 2
   Using the methods available at http://lacam.di.uniba.it/˜nico/research/
   ontologymining.html
 3
   Pellet v2.3.0 – http://clarkparsia.com/pellet/
            Leo               Match       Omission     Commission      Induction
            RG                1±0           0±0           0±0            0±0
            CM                1±0           0±0           0±0            0±0
             LP           0.942 ± 0.099 0.007 ± 0.047 0.052 ± 0.091      0±0
          SM-SVM           0.963 ± 0.1      0±0        0.037 ± 0.1       0±0
          LapSVM
           √              0.978 ± 0.068     0±0       0.022 ± 0.068      0±0
             l-NN         0.971 ± 0.063     0±0       0.029 ± 0.063      0±0
    BioPAX (Proteomics)     Match       Omission     Commission     Induction
            RG          0.986 ± 0.051 0.004 ± 0.028 0.008 ± 0.039 0.002 ± 0.02
           CM           0.986 ± 0.051 0.002 ± 0.02 0.01 ± 0.044 0.002 ± 0.02
            LP          0.982 ± 0.058 0.002 ± 0.02 0.014 ± 0.051 0.002 ± 0.02
         SM-SVM         0.972 ± 0.075     0±0       0.026 ± 0.068 0.002 ± 0.02
         LapSVM
          √             0.972 ± 0.075     0±0       0.026 ± 0.068 0.002 ± 0.02
            l-NN        0.972 ± 0.075     0±0       0.026 ± 0.068 0.002 ± 0.02
          MDM0.73             Match       Omission     Commission     Induction
            RG            0.953 ± 0.063 0.003 ± 0.016 0.011 ± 0.032 0.015 ± 0.039
            CM            0.953 ± 0.063 0.001 ± 0.009 0.013 ± 0.036 0.018 ± 0.04
             LP           0.942 ± 0.065     0±0       0.026 ± 0.046 0.033 ± 0.054
          SM-SVM          0.793 ± 0.252     0±0       0.174 ± 0.255 0.033 ± 0.054
          LapSVM
           √              0.915 ± 0.086     0±0       0.052 ± 0.065 0.033 ± 0.054
             l-NN         0.944 ± 0.069     0±0       0.023 ± 0.051 0.033 ± 0.054
           Wine               Match       Omission     Commission     Induction
            RG             0.24 ± 0.03    0 ± 0.005   0.007 ± 0.017 0.5 ± 0.176
            CM            0.242 ± 0.028   0 ± 0.005   0.005 ± 0.015 0.326 ± 0.121
             LP           0.239 ± 0.035   0 ± 0.005   0.008 ± 0.021 0.656 ± 0.142
          SM-SVM          0.235 ± 0.036     0±0       0.012 ± 0.024 0.753 ± 0.024
          LapSVM
           √              0.238 ± 0.033     0±0       0.009 ± 0.021 0.753 ± 0.024
             l-NN         0.241 ± 0.031     0±0       0.006 ± 0.018 0.753 ± 0.024


Table 2: Match, Omission, Commission and Induction [19] results for a k-Fold Cross
Validation (k = 10) on 20 randomly generated queries. For each experiment, the best
parameters within the training were found using a k-Fold Cross Validation (k = 10).


Match Case of an individual that got the same label by the reasoner and the inductive
   classifier.
Omission Error Case of an individual for which the inductive method could not deter-
   mine whether it was relevant to the query concept or not while it was found relevant
   by the reasoner.
Commission Error Case of an individual found to be relevant to the query concept
   while it logically belongs to its negation or vice-versa.
Induction Case of an individual found to be relevant to the query concept or to its
   negation, while either case is not logically derivable from the knowledge base.
Before evaluating on the test set, parameter tuning was performed for each of the
methods via a k-Fold Cross Validation (k = 10) within the training set, for finding
the parameters with lower classification error in cross-validation. For LapSVM, the
                                Match rates                                                         Match rates
          1                                                                   1


        0.95                                                                0.95


         0.9                                                                 0.9
Match


                                                                    Match
        0.85                                                                0.85


         0.8                                                                 0.8
                                                   RG Match                                                            RG Match
                                                   CM Match                                                            CM Match
                                                    LP Match                                                            LP Match
        0.75                                  HM-SVM Match                  0.75                                  HM-SVM Match
                                              SM-SVM Match                                                        SM-SVM Match
                                               LapSVM Match                                                        LapSVM Match
                                                   NN Match                                                            NN Match
         0.7                                                                 0.7
               0   2         4              6           8      10                  0     2       4              6           8      10
                          Number of training folds                                            Number of training folds


                       (a) Leo ontology                                            (b) BioPAX (Proteomics) ontology

Fig. 2: Variation of average Match Rates with respect to the number of folds used in the
training step, during a k-Fold Cross Validation (with k = 10).


(γL , γM ) parameters were varied in {10−4 , 10−3 , . . . , 104 }, while for SM-SVM, which
follows the implementation in [21, pg. 223], the C parameter was allowed to vary in
{10−4 , 10−3 , . . . , 104 }. Similarly, the (µ, ) parameters in RG and CM where varied
in {10−4 , 10−3 , . . . , 104 }. The parameter k for building the k-NN semantic similar-
ity graph, used by LapSVM, RG, CM and LP, was varied in {2, 4, 8, 16}. We did not
carefully choose the concept committee F defining the dissimilarity measure: we sim-
ply used the set of atomic concepts in the ontology, thus ignoring any prior knowledge
about the structure of the target concept C or the presence of statistical correlations in
the knowledge base. Each concept in the committee F was weighted with its normal-
ized entropy [19]. RG, CM and LP give an indication of the uncertainty associated to
a specific labelling by associating values in the set [−1, +1] to each node; when such
values are ≈ 0 (specifically, when the label was in the set [−10−4 , 10−4 ] we decided to
leave the node unlabelled, so to try to provide more robust estimates of labels (and thus a
possibly lower commission error and match rates and higher omission error rates). This
may happen e.g. when there are no labelled examples within a connected component of
the semantic similarity graph.
    In Tab. 2 we report average index rates and standard deviations for each of the
ontologies in Tab. 1; the only exceptions is for the FAMILY-T REE ontology, which pro-
vided 0.76 ± 0.13 match rates and 0.24 ± 0.13 induction rates for all methods (with
the exception of LP, where the induction rates were 0.21 ± 0.14. In general, LapSVM
outperformed the other two non-SSL SVM classification methods. This happened with
varying quantities of unlabelled data; this is shown for example in the behavior of match
rates in subfigure 2a, where results obtained in a k-Fold Cross Validation using a varying
quantity of labelled instances. However, standard SVM training is O(m3 ) in general,
where m is the number of training instances; therefore, some extra effort may be nec-
essary to make SVM methods scale on SW knowledge bases. Such results may provide
some empirical evidence that inductive methods for formal ontologies may take benefit
from also accounting for unlabelled instances during learning.
5      Conclusion and Future Works
This work proposes a method for transductive class-membership prediction based on
graph-based regularisation from DL representations. It leverages neutral examples by
propagating class-membership information among similar individuals in the training
set. The proposed method relies on quadratic cost criteria, whose optimization can be
reduced to solving a (possibly sparse) linear system; this is a well-known problem in
the literature, with a nearly linear time complexity in the number of non-zero entries in
the coefficient matrix.
    We did not analyse carefully the impact of different choices in the (dis-)similarity
measure for building the semantic similarity graph. However, the similarity graph has
a strong influence on the effectiveness of the methods used [27]. The construction of
the similarity graph for class-membership learning tasks can be influenced by factors
such as the structure of the target concept C, or by finding statistical correlation within
the knowledge base. Also, it is not clear whether continuous labels assigned by the
proposed methods may correspond to posterior probability estimates from the statistical
point of view. In future work, we aim at investigating the aforementioned two aspects
of graph-based transductive and semi-supervised class-membership prediction from DL
representations.


References
    [1] Alexandrescu, A., Kirchhoff, K.: Data-driven graph construction for semi-supervised
        graph-based learning in nlp. In: Sidner, C., et al. (eds.) HLT-NAACL. pp. 204–211. The
        Association for Computational Linguistics (2007)
    [2] Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning on large
        graphs. In: Shawe-Taylor, J., et al. (eds.) COLT. LNCS, vol. 3120, pp. 624–638. Springer
        (2004)
    [3] Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for
        learning from labeled and unlabeled examples. Journal of Machine Learning Research 7,
        2399–2434 (2006)
    [4] Bloehdorn, S., Sure, Y.: Kernel methods for mining instance data in ontologies. In: Pro-
        ceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic
        Web Conference. pp. 58–71. ISWC’07/ASWC’07, Springer (2007)
    [5] Borgida, A., Walsh, T., Hirsh, H.: Towards measuring similarity in description logics. In:
        Horrocks, I., et al. (eds.) Description Logics. CEUR Workshop Proceedings, vol. 147.
        CEUR-WS.org (2005)
    [6] Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cam-
        bridge, MA (2006)
    [7] d’Amato, C., Fanizzi, N., Esposito, F.: Query answering and ontology population: an in-
        ductive approach. In: Hauswirth, M., et al. (eds.) Proceedings of the 5th European Semantic
        Web Conference (ESWC’08). Springer, Tenerife, Spain (2008)
    [8] d’Amato, C., Fanizzi, N., Esposito, F.: A semantic similarity measure for expressive de-
        scription logics. CoRR abs/0911.5043 (2009)
    [9] d’Amato, C., Staab, S., Fanizzi, N.: On the influence of description logics ontologies on
        conceptual similarity. In: Proceedings of the 16th international conference on Knowledge
        Engineering: Practice and Patterns. pp. 48–63. EKAW ’08, Springer, Berlin, Heidelberg
        (2008)
[10] Fanizzi, N., d’Amato, C.: Inductive concept retrieval and query answering with semantic
     knowledge bases through kernel methods. In: Proceedings of the 11th international con-
     ferenceon Knowledge-based intelligent information and engineering systems: Part I. pp.
     148–155. KES’07, Springer (2007)
[11] Fanizzi, N., d’Amato, C., Esposito, F.: Reduce: A reduced coulomb energy network method
     for approximate classification. In: Proceedings of the 6th European Semantic Web Confer-
     ence (ESWC’09). pp. 323–337. Springer
[12] Fanizzi, N., d’Amato, C., Esposito, F.: Statistical Learning for Inductive Query Answering
     on OWL Ontologies. In: Proceedings of the 7th International Conference on The Semantic
     Web. pp. 195–212. ISWC ’08, Springer (2008)
[13] Fanizzi, N., d’Amato, C., Esposito, F.: Towards learning to rank in description logics. In:
     Coelho, H., et al. (eds.) ECAI. Frontiers in Artificial Intelligence and Applications, vol.
     215, pp. 985–986. IOS Press (2010)
[14] Gurobi Optimization, I.: Gurobi optimizer reference manual (2012)
[15] Hu, B., Dasmahapatra, S., Lewis, P.: Semantic metrics. Int. J. Metadata Semant. Ontologies
     2(4), 242–258 (Jul 2007)
[16] Janowicz, K., Wilkes, M.: Sim-dla: A novel semantic similarity measure for description
     logics reducing inter-concept to inter-instance similarity. In: Aroyo, L., et al. (eds.) ESWC.
     LNCS, vol. 5554, pp. 353–367. Springer (2009)
[17] Lasserre, J., Bishop, C.M.: Generative or discriminative? getting the best of both worlds.
     BAYESIAN STATISTICS 8, 3–24 (2007)
[18] Ochoa-Luna, J.E., Cozman, F.G.: An algorithm for learning with probabilistic description
     logics. In: Bobillo, F., et al. (eds.) URSW. pp. 63–74 (2009)
[19] Rettinger, A., Lösch, U., Tresp, V., d’Amato, C., Fanizzi, N.: Mining the semantic web
     - statistical learning for next generation knowledge bases. Data Mining and Knowledge
     Discovery - Special Issue on Web Mining (2012)
[20] Rettinger, A., Nickles, M., Tresp, V.: Statistical relational learning with formal ontologies.
     In: Buntine, W.L., et al. (eds.) ECML/PKDD (2). LNCS, vol. 5782, pp. 286–301. Springer
     (2009)
[21] Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge Univer-
     sity Press, New York, NY, USA (2004)
[22] Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph
     sparsification, and solving linear systems. In: Proceedings of the thirty-sixth annual ACM
     symposium on Theory of computing. pp. 81–90. STOC ’04, ACM, New York, NY, USA
     (2004)
[23] Vapnik, V.N.: Statistical learning theory. Wiley, 1 edn. (Sep 1998)
[24] Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global
     consistency. In: Advances in Neural Information Processing Systems 16. pp. 321–328. MIT
     Press (2004)
[25] Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed
     graph. In: Proceedings of the 22nd international conference on Machine learning. pp. 1036–
     1043. ICML ’05, ACM (2005)
[26] Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation.
     Tech. rep., CMU CALD tech report CMU-CALD-02 (2002)
[27] Zhu, X.: Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences,
     University of Wisconsin-Madison (2005)