A Graph Regularization Based Approach to Transductive Class-Membership Prediction Pasquale Minervini, Claudia d’Amato, and Nicola Fanizzi LACAM Laboratory – Dipartimento di Informatica Università degli Studi di Bari “Aldo Moro” – via E. Orabona, 4 - 70125 Bari - Italia { pasquale.minervini, claudia.damato, nicola.fanizzi }@uniba.it Abstract. Considering the increasing availability of structured machine process- able knowledge in the context of the Semantic Web, only relying on purely deduc- tive inference may be limiting. This work proposes a new method for similarity- based class-membership prediction in Description Logic knowledge bases. The underlying idea is based on the concept of propagating class-membership in- formation among similar individuals; it is non-parametric in nature and charac- terised by interesting complexity properties, making it a potential candidate for large-scale transductive inference. We also evaluate its effectiveness with respect to other approaches based on inductive inference in SW literature. 1 Introduction Standard Semantic Web (SW) reasoning services rely on purely deductive inference. However, this may be limiting, e.g. due to the complexity of reasoning tasks, avail- ability and correctness of structured knowledge. Approximate deductive and induc- tive inference were discussed as a possible approach to try to overcome such limi- tations [19]. Various proposals to extend inductive inference methods towards SW for- malisms have been discussed in SW literature: inductive methods can perform some sort of approximate and uncertain reasoning and derive conclusions which are not derivable or refutable from the knowledge base [19]. This work proposes a novel method for transductive inference on Description Logic representations. In the class-membership prediction task, discriminative methods pro- posed so far ignore unlabelled problem instances (individuals for which the value of such class-membership is unknown); however, accounting for unlabelled instances dur- ing learning can provide more accurate results if some conditions are met [6, 27]. Gen- erative methods, on the other hand, try to model a joint probability distribution on both instances and labels, thus facing a possibly harder learning problem than only predicting the most probable label for any given instance. In section 2 we will first shortly survey related works, and introduce a variant to the classic class-membership prediction problem. In section 3 we will introduce the pro- posed method: the assumptions it relies on, and how it can be used for class-membership prediction on large and Web scale ontological knowledge bases. In section 4, we will provide empirical evidence for the effectiveness of the proposed method with respect to other methods in SW literature. 2 Preliminaries A variety of approaches have been proposed in the literature for class-membership pre- diction, either discriminative or generative [17]. Assuming instances are sampled i.i.d. from a distribution P ranging over a space X ×Y (where X is the space of instances and Y a set of labels), generative prediction methods first build an estimate P̂ of the joint probability distribution P (X, Y ), and then use it to infer P̂ (Y | x) = P̂ (Y, x)/P̂ (x) for a given, unlabelled instance x ∈ X. On the other hand, discriminative methods sim- ply aim at estimating when P (y | x) ≥ 0.5, for any given (x, y) ∈ X × Y (thus facing a possibly easier problem than estimating a joint probability distribution over X × Y ). The following shortly surveys class-membership prediction methods proposed so far. 2.1 Discriminative Methods Some of the approaches proposed for solving the class-membership prediction problem are similarity-based. For instance, methods relying on the k-Nearest Neighbours (k- NN) algorithm are discussed in [7, 19]. A variety of (dis-)similarity measures between either individuals or concepts have been proposed: according to [5], they can be based on features (where objects are characterised by a set of features, such as in [15]), on the semantic-network structure (where background information is provided in the form of a semantic network, such as in [9, 16]) or on the information content (where both the semantic network structure and population are considered, such as in [8]). Kernel-based algorithms [21] have been proposed for various learning tasks from DL-based represen- tations. This is made possible by the existence of a variety of kernel functions, either for concepts or individuals (such as [10, 4, 12]). By (implicitly) projecting instances into an high-dimensional feature space, kernel functions allow to adapt a multitude of ma- chine learning algorithms to structured representations. SW literature includes methods for inducing robust classifiers [11] or learning to rank [13] from DL knowledge bases using kernel methods. 2.2 Generative Methods For learning from formal ontologies, a generative approach has been discussed in [20]. In this work, each individual is associated to a latent variable which influences its at- tributes and the relations it participates in. It proposes using Bayesian non-parametrics to avoid setting the number of possible values for such latent variables (which can be seen as cluster indicators); and an inferencing scheme based on Markov Chain Monte Carlo, where posterior sampling is constrained by a pre-defined set of DL axioms. A quite different approach is discussed in [18]: this work focuses on learning theories in a probabilistic extension of the ALC DL named CRALC, using DL refinement opera- tors to efficiently explore the space of concepts. It is inspired by literature on Bayesian Logic Programs. 2.3 Semi-Supervised and Transductive Learning Classic discriminative learning methods ignore unlabelled instances. However, real life scenarios are usually characterized by an abundance of unlabelled instances and a few labelled ones [27]. This may also be the case for class-membership prediction from formal ontologies: class-membership relations may be difficult to obtain during ontol- ogy engineering tasks (e.g. due to availability of domain experts) and inference (e.g. since deciding instance-membership may have an intractable time complexity in some languages). Using unlabelled instances during learning is generally known in the machine learn- ing community as Semi-Supervised Learning [6, 27] (SSL). A variant to this setting is known as Transductive Learning [23] and refers to finding a labelling only to unlabelled instances provided in the training phase, without necessarily generalizing to unseen in- stances (and thus resulting into a possibly simpler learning problem). If the marginal distribution of instances PX is informative with respect to the conditional probability distribution P (Y | x), accounting for unlabelled instances during learning can provide more accurate results [6, 27]. A possible approach is including terms dependent from PX into the objective function. This results in the two fundamental assumptions [6]: – Cluster assumption – The joint probability distribution P (X, Y ) is structured in such a way that points in the same cluster are likely to have the same label. – Manifold assumption – Assume that PX is supported on a low-dimensional man- ifold: then, P (Y | x) varies smoothly, as a function of x, with respect to the under- lying structure of the manifold. In the following sections, we discuss a similarity-based, non-parametric and com- putationally efficient method for predicting missing class-membership relations. This method is discriminative in nature, but also accounts for unknown class-membership during learning. We will face a slightly different version of the classic class-membership prediction problem, namely transductive class-membership prediction. It is inspired to the Main Principle in [23]: “If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem”. In this setting, the learning algorithm only aims at estimating the class-membership relation of interest for a given training set of individuals, without necessarily being able to generalise to individuals outside such set. In this work, we formalise the transductive class-membership prediction problem as a cost minimisation problem: given a set of training individuals IndC (K) whose class- membership relation to a target concept C is either known or unknown, find a function f ∗ : IndC (K) → {+1, −1} defined over training individuals and returning a value +1 (resp. −1) if the individual likely to be a member of C (resp. ¬C), minimizing a given cost function. More formally: Definition 1. (Transductive Class-Membership Prediction) The Transductive Class-Membership Prediction problem can be formalised as follows: – Given: • a target concept C; • a set of training individuals IndC (K) in a knowledge base K partitioned in positive, negative and neutral examples or, more formally, such that: Ind+C (K) = {a ∈ IndC (K) | K |= C(a)} positive examples, Ind−C (K) = {a ∈ IndC (K) | K |= ¬C(a)} negative examples, Ind0C (K) = {a ∈ IndC (K) | K 6|= C(a) ∧ K 6|= ¬C(a)} neutral examples; • A cost function cost(·) : F 7→ R, specifying the cost associated to a set of class-membership relations assigned to training individuals by f ∈ F, where F is a space of labelling functions of the form f : IndC (K) 7→ {+1, −1}; – Find a labelling function f ∗ ∈ F minimizing the given cost function with respect to training individuals IndC (K): f ∗ ← arg min cost(f ). f ∈F The function f ∗ can then be used to estimate the class-membership relation with respect to the target concept C for all training individuals a ∈ IndC (K): it will return +1 (resp. −1) if an individual is likely to be a member of C (resp. ¬C). Note that the function is defined on the whole set of training individuals; therefore it can possibly contradict already known class-membership relations (thus being able to handle noisy knowledge). If IndC (K) is finite, the space of labelling functions F is also finite, and each function f ∈ F can be equivalently expressed as a vector in {−1, +1}n , where n = |IndC (K)|. 3 Propagating Class-Membership Information Among Individuals This section discusses a graph-based semi-supervised [27] method for class-membership prediction from DL representations. The proposed method relies on a weighted seman- tic similarity graph, where nodes represent positive, negative and neutral examples of the transductive class-membership prediction problem, and weighted edges define sim- ilarity relations among such individuals. More formally, let K be a knowledge base, IndC (K) a set of training individuals with respect to a target concept C in K, and Y = {−1, +1} a space of labels each cor- responding to a type of class-membership relation with respect to C. Each training indi- vidual a ∈ IndC (K) is associated to a label, which will be +1 (resp. −1) if K |= C(a) (resp. K |= ¬C(a)), and will be unknown otherwise, thus representing an unlabelled instance. For defining a cost over functions f ∈ F, the proposed method relies on regularization by graph: the learning process aims at finding a labelling function that is both consistent with given labels, and changes smoothly between similar instances (where similarity relations are encoded in the semantic similarity graph). This can be formalised through a regularization framework, using a measure of the consistency to the given labels as a loss function, and a measure of smoothness among the similarity graph as a regulariser. Several cost functions have been proposed in SSL literature. An appealing class of functions, from the side of computational cost, relies on the quadratic cost criterion framework [6, ch. 11]: for this class of functions, a closed form solution to the cost minimisation problem can be found efficiently (subsection 3.2). 3.1 Semantic Similarity Graph A similarity graph can be represented with a weight matrix W, where the value of Wij represents the strength of the similarity relation between two training examples xi and xj . In graph-based SSL literature, W is often obtained either as a Nearest Neighbour (NN) graph (where each instance is connected to the k most similar instances in the graph, or to those with a distance under a radius ); or using a kernel function, such as the Gaussian kernel. Finding the best way to construct W is an active area of research; for example, in [6, ch. 20] authors discuss a method to combine multiple similarity measures in the context of protein function prediction, while [1] proposes a method for data-driven similarity graph construction. When empirically evaluating the proposed method, we employ the family of dis- similarity measures between individuals in a DL knowledge base defined in [19], since it does not constrain to any particular family of DLs; we refer to the resulting similarity graph among individuals in a formal ontology as the semantic similarity graph. Given a set of concept descriptions F = {F1 , . . . , Fn } and a weight vector w, such family of dissimilarity measures dF p : Ind(A) × Ind(A) 7→ [0, 1] is defined as:   0 if (K |= Fi (x) ∧ K |= Fi (y)) ∨ (K |= ¬Fi (x) ∧ K |= ¬Fi (y)) δi (x, y) = 1 if (K |= Fi (x) ∧ K |= ¬Fi (y)) ∨ (K |= ¬Fi (x) ∧ K |= Fi (y)) (1) ui otherwise  where x, y ∈ Ind(A) and p > 0. unificationXref6 Odlocba_30 unificationXref14 unificationXref3 unificationXref10 BirthCertificateInternational unificationXref17 PassportForeignIdentifikacijskiDokument unificationXref7 unificationXref9 Odlocba_26 unificationXref2 publicationXref4 SingleCertificate RojstniList EPRazdruzevalniALI_6 EPRazdruzevalniALI_14 unificationXref4 unificationXref11 unificationXref13 EPStoritev_21 EPStoritev_3 unificationXref1 publicationXref1 EPRazdruzevalniALI_1 EPRazdruzevalniALI_10 unificationXref15 publicationXref3 EPZdruzevalniALI_12 EPZdruzevalniALI_17 EPZdruzevalniIN_17 EPStoritev_22 unificationXref16 publicationXref2 EPZdruzevalniIN_5 EPStoritev_19 EPStoritev_16 EPZdruzevalniALI_3 EPStart_14 EPZdruzevalniIN_4 unificationXref8 sequenceInterval1 EPZdruzevalniALI_8 unificationXref5 openControlledVocabulary2 EPStop_6 EPStart_3 EPStart_20 EPStoritev_5 openControlledVocabulary1 EPStop_13 EPStoritev_11 unificationXref12 dataSource1 EPStop_18 EPZivljenjskaSituacija_11 sequenceSite1 bioSource2 EPRazdruzevalniIN_15EPStoritev_9 relationshipXref2 EPPodprocedura_15 EPRazdruzevalniIN_4 EPStoritev_33 bioSource1 DummyPS_MPA_DOC_018 EPPodprocedura_16 sequenceSite2 sequenceParticipant2 DummyPS_DOC_017 EPRazdruzevalniIN_21 EPStoritev_2 relationshipXref3 EPStoritev_7 DummyPS_MPA_DOC_019 relationshipXref9 sequenceParticipant1 PridobitevDokumentovZaTujca SpremembaOsebnihPodatkov DummyPS_MPA_DOC_016 Drzavljan_1 relationshipXref10 evidence1 DummyPS_DOC_015 PorocimSe relationshipXref5 confidence1 ProceduraPorocimSe pridobitevDokumentovDrzavljanov DummyPS_MPA_DOC_013 OtrokSeJeRodil protein1 PosvojimOtroka relationshipXref1 relationshipXref8 sequenceFeature1 DummyPS_MPA_DOC_020 relationshipXref12 physicalInteraction1 PridobitevDovoljenjaZaPorokoMladoletnika StoritevJU_7 relationshipXref11 relationshipXref6 protein2 relationshipXref7 PritozbaNaPrijavoPoroke PrijavaPoroke relationshipXref4 RegistracijaPoroke (a) 3-NN graph, BioPAX (Proteomics) (b) 3-NN graph, Leo Fig. 1: k-Nearest Neighbour Semantic Similarity graphs for individuals BioPAX (Pro- teomics) ontology (left) and for the Leo ontology (right), obtained using the dissimilar- ity measure in [19]: F was defined as the set of atomic concepts in the ontology (each weighted with its normalized entropy [19]) and p = 2. Two examples of (k-NN) semantic similarity graphs among all individuals in the ontologies B IO PAX (P ROTEOMICS ) and L EO, obtained using the aforementioned dis- similarity measure, are provided in Fig. 1. 3.2 Quadratic Cost Criteria In quadratic cost criteria [6, ch. 11], the original label space {−1, +1} (binary classifi- cation case) is relaxed to [−1, +1]. This allows to express the confidence associated to a labelling (and may give an indication about P (Y | x)). For such a reason, in the pro- posed method, the labelling functions space F will be relaxed to functions of the form f : IndC (K) 7→ [−1, +1]. As in subsection 2.3, labelling functions can be equivalently represented as vectors y ∈ [−1, +1]n . Let ŷ ∈ [−1, +1]n be a possible labelling for n instances. We can see ŷ as a (l + u) = n dimensional vector, where the first l indices refer to already labelled instances, and the last u to unlabelled instances: ŷ = [ŷl , ŷu ]. Consistency of ŷ with respect to original labels can be formulated in the form of a Pl quadratic cost: i=1 (yˆi − yi )2 = ||ŷl − yl ||2 . Similarly, labellings can be regularised with respect to the graph structure: as in [2],Psuch consistency with respect to the geometry of instances can be estimated as 0.5 i,j=1 Wij (yˆi −P yˆj )2 = ŷT Lŷ, where W is the semantic similarity graph and L = D − W, Dii = j Wij ad 0 otherwise, is the unnormalized graph Laplacian. A different criterion, discussed in [24, 25], measures it as (D−0.5 ŷ)T L(D−0.5 ŷ). Another regularization term in the form of ||ŷ||2 (or ||ŷu ||2 , as in [24]) can be added to the final cost function to prefer smaller values in ŷ. This is useful e.g. to prevent arbitrary labellings in a connected component of the semantic similarity graph containing no labelled instances. Putting the pieces together, we obtain two quadratic cost criteria discussed in the lit- erature, namely Regression on Graph [2] (RG) and the Consistency Method [24] (CM): RG: cost(ŷ) = ||ŷl − yl ||2 + µŷT Lŷ + µ||ŷ||2 ; CM: cost(ŷ) = ||ŷl − yl ||2 + µ(D−0.5 ŷ)T L(D−0.5 ŷ) + ||ŷu ||2 . As a title of example, we will now derive a closed form solution for the problem of finding a (global) minimum for the quadratic cost criterion in RG. Its first order derivative is defined as follows: 1 ∂cost(ŷ) = (S + µL + µI)ŷ − Sy, 2 ∂ ŷ where S = diag(s1 , . . . , sn ), with si = 1 iff i ≤ l and 0 otherwise. Its second or- der derivative is a positive definite matrix if  > 0, since L is positive semi-definite. Therefore, setting the first order derivative to 0 leads to a global minimum: ŷ = (S + µL + µI)−1 Sy, showing that ŷ can be obtained either by matrix inversion or by solving a (possibly sparse) linear system. This work leverages quadratic cost criteria to efficiently solve the transductive class- membership prediction problem. Finding a minimum ŷ for a predefined cost criterion is equivalent to finding a labelling function f ∗ in the form f ∗ : IndC (K) 7→ [−1, +1], where the labelling returned for a generic training individual a ∈ IndC (K) correspond to the value in ŷ in the position mapped to a. This can be done by representing the set of training individuals IndC (K) as a partially labelled vector y of length |IndC (K)| = n, such that the first l (resp. last u) components correspond to positive and negative (resp. neutral) examples in IndC (K). Such y can be then used to measure the consistency with original labels in a quadratic cost criterion; while the semantic similarity graph can be employed to enforce smoothness in class-membership predictions among similar training individuals. An advantage of quadratic cost criteria is that their minimization ultimately reduces to solving a large sparse linear system [24, 6], a well-known problem in the literature whose time complexity is nearly linear in the number of non-zero entries in the coeffi- cient matrix [22]. For large-scale datasets, a subset selection method is described in [6, ch. 18], which allows to greatly reduce the size of the original linear system. 4 Preliminary Empirical Evaluations In this section, we evaluate several (inductive and transductive) methods for class- membership prediction, with the aim of comparing the methods discussed in section 3 with respect to other methods in SW literature. We are reporting evaluations for the Regularization on Graph [2] (RG) and the Consistency Method [24] (CM); Label Prop- agation [26] (LP); three kinds of Support Vector Machines [21] (SVM), namely Hard- Margin SVM (HM-SVM),√Soft-Margin SVM with L1 norm (SM-SVM) and Laplacian SVM [3] (LapSVM); and l-Nearest Neighbors for class-membership prediction [19]. 4.1 Description of Evaluated Methods LP is a graph-based SSL algorithm relying on the idea of propagating labelling informa- tion among similar instances through an iterative process involving matrix operations. It can be equivalently formulated under the quadratic criterion framework [6, ch. 11]. More formally it associates, to each unlabelled instance in the graph, the probability of performing a random walk until a positively (resp. negatively) example is found. We also evaluated Support Vector Machines (SVM), which have been proposed for inducing robust classifiers from ontological knowledge bases [12, 19]. SVM clas- sifiers come in different flavours: the classic HM-SVM binary classifier aims at find- ing the hyperplane in the feature space separating the instances belonging to different classes, which maximises the geometric margin between the hyperplane and nearest training points. The SM-SVM classifier is a relaxation of HM-SVM, which allows for some misclassification in training instances (by relaxing the need of having per- fectly linearly separable training instances in the feature space). LapSVM is a semi- supervised extension of the SM-SVM classifier: given a set of labelled instances and a set of unlabelled instances, it aims at finding an hyperplane that is also smooth with respect to the (estimated) geometry of instances. More formally, let (xl , yl ) (resp. xu ) be a set of labelled (resp. unlabelled) instances. LapSVM finds a function f in a space of functions HK determined by the kernel K (called Reproducing Kernel Hilbert Pl Space [21]) minimizing 1l i=1 V (xi , yi , f ) + γL ||f ||2HK + γM ||f ||2M , where V rep- resents a costs function of errors committed by f on labeled samples (typically the hinge loss function max{0, 1 − yi f (xi )}), || · ||HK imposes smoothness conditions on Ontology Expressivity #Axioms #Individuals #Classes #ObjectProperties B IO PAX (P ROTEOMICS ) ALCHN (D) 773 49 55 47 FAMILY-T REE SROIF(D) 2059 368 22 52 L EO ALCHIF (D) 430 61 32 26 MDM0.73 ALCHOF(D) 1098 112 196 22 W INE SHOIN (D) 1046 218 142 21 Table 1: Ontologies considered in the experiments. possible solutions [21] and || · ||2M , intuitively, penalizes rapid changes in the classifica- tion function between close instances in the similarity graph. It generalizes HM-SVM (γL → 0, γM = 0) and SM-SVM (γM = 0). Our implementation of LapSVM follows the algorithm proposed in [3]; for HM-SVM, SM-SVM and LapSVM, we solve the underlying convex optimization problems using the Gurobi optimizer [14]. RG, CM, LP and LapSVM all rely on a semantic similarity graph W as a rep- resentation of the geometry of instances. We first calculate distances employing the dissimilarity measure defined in [19] and outlined in eq. 1, with p = 2; then we ob- tain W by building a k-Nearest Neighbour graph using such distances (since sparsity in W influences the scalability of quadratic cost criteria, as written in subsection 3.2). When building the neighbourhood of a node, we handled the cases in which nodes had the same distance by introducing a random ordering between such nodes. The Kernel function used for Hard-Margin SVM, Soft-Margin SVM and Laplacian SVM are also defined in [19], and directly correlated with the aforementioned dissimilarity measure in eq. 1 (given a committee of concepts F and the parameters w and p, the dissimilarity was originally obtained as 1 − k(a, b), where k(a, b) is the value of the kernel function on a pair of individuals (a, b) in the knowledge √ base). We also provide a first evaluation for the k-NN algorithm (with k = l, where l is the number of labelled √ instances, as discussed in [19]): we simply choose the majority class among the l most similar individuals to label each unlabelled instance. 4.2 Evaluations Starting from a set of real ontologies 1 (outlined in Table 1), we generated a set of 20 random query concepts for each ontology 2 , so that the number of individuals belonging to the target query concept C (resp. ¬C) was at least of 10 elements and the number of individuals in C and ¬C was in the same order of magnitude. A DL reasoner 3 was employed to decide on the theoretical concept-membership of individuals to query con- cepts. We employ the evaluation metrics in [7], which take into account the peculiarities deriving by the presence of missing knowledge: 1 From TONES Repository: http://owl.cs.manchester.ac.uk/repository/ 2 Using the methods available at http://lacam.di.uniba.it/˜nico/research/ ontologymining.html 3 Pellet v2.3.0 – http://clarkparsia.com/pellet/ Leo Match Omission Commission Induction RG 1±0 0±0 0±0 0±0 CM 1±0 0±0 0±0 0±0 LP 0.942 ± 0.099 0.007 ± 0.047 0.052 ± 0.091 0±0 SM-SVM 0.963 ± 0.1 0±0 0.037 ± 0.1 0±0 LapSVM √ 0.978 ± 0.068 0±0 0.022 ± 0.068 0±0 l-NN 0.971 ± 0.063 0±0 0.029 ± 0.063 0±0 BioPAX (Proteomics) Match Omission Commission Induction RG 0.986 ± 0.051 0.004 ± 0.028 0.008 ± 0.039 0.002 ± 0.02 CM 0.986 ± 0.051 0.002 ± 0.02 0.01 ± 0.044 0.002 ± 0.02 LP 0.982 ± 0.058 0.002 ± 0.02 0.014 ± 0.051 0.002 ± 0.02 SM-SVM 0.972 ± 0.075 0±0 0.026 ± 0.068 0.002 ± 0.02 LapSVM √ 0.972 ± 0.075 0±0 0.026 ± 0.068 0.002 ± 0.02 l-NN 0.972 ± 0.075 0±0 0.026 ± 0.068 0.002 ± 0.02 MDM0.73 Match Omission Commission Induction RG 0.953 ± 0.063 0.003 ± 0.016 0.011 ± 0.032 0.015 ± 0.039 CM 0.953 ± 0.063 0.001 ± 0.009 0.013 ± 0.036 0.018 ± 0.04 LP 0.942 ± 0.065 0±0 0.026 ± 0.046 0.033 ± 0.054 SM-SVM 0.793 ± 0.252 0±0 0.174 ± 0.255 0.033 ± 0.054 LapSVM √ 0.915 ± 0.086 0±0 0.052 ± 0.065 0.033 ± 0.054 l-NN 0.944 ± 0.069 0±0 0.023 ± 0.051 0.033 ± 0.054 Wine Match Omission Commission Induction RG 0.24 ± 0.03 0 ± 0.005 0.007 ± 0.017 0.5 ± 0.176 CM 0.242 ± 0.028 0 ± 0.005 0.005 ± 0.015 0.326 ± 0.121 LP 0.239 ± 0.035 0 ± 0.005 0.008 ± 0.021 0.656 ± 0.142 SM-SVM 0.235 ± 0.036 0±0 0.012 ± 0.024 0.753 ± 0.024 LapSVM √ 0.238 ± 0.033 0±0 0.009 ± 0.021 0.753 ± 0.024 l-NN 0.241 ± 0.031 0±0 0.006 ± 0.018 0.753 ± 0.024 Table 2: Match, Omission, Commission and Induction [19] results for a k-Fold Cross Validation (k = 10) on 20 randomly generated queries. For each experiment, the best parameters within the training were found using a k-Fold Cross Validation (k = 10). Match Case of an individual that got the same label by the reasoner and the inductive classifier. Omission Error Case of an individual for which the inductive method could not deter- mine whether it was relevant to the query concept or not while it was found relevant by the reasoner. Commission Error Case of an individual found to be relevant to the query concept while it logically belongs to its negation or vice-versa. Induction Case of an individual found to be relevant to the query concept or to its negation, while either case is not logically derivable from the knowledge base. Before evaluating on the test set, parameter tuning was performed for each of the methods via a k-Fold Cross Validation (k = 10) within the training set, for finding the parameters with lower classification error in cross-validation. For LapSVM, the Match rates Match rates 1 1 0.95 0.95 0.9 0.9 Match Match 0.85 0.85 0.8 0.8 RG Match RG Match CM Match CM Match LP Match LP Match 0.75 HM-SVM Match 0.75 HM-SVM Match SM-SVM Match SM-SVM Match LapSVM Match LapSVM Match NN Match NN Match 0.7 0.7 0 2 4 6 8 10 0 2 4 6 8 10 Number of training folds Number of training folds (a) Leo ontology (b) BioPAX (Proteomics) ontology Fig. 2: Variation of average Match Rates with respect to the number of folds used in the training step, during a k-Fold Cross Validation (with k = 10). (γL , γM ) parameters were varied in {10−4 , 10−3 , . . . , 104 }, while for SM-SVM, which follows the implementation in [21, pg. 223], the C parameter was allowed to vary in {10−4 , 10−3 , . . . , 104 }. Similarly, the (µ, ) parameters in RG and CM where varied in {10−4 , 10−3 , . . . , 104 }. The parameter k for building the k-NN semantic similar- ity graph, used by LapSVM, RG, CM and LP, was varied in {2, 4, 8, 16}. We did not carefully choose the concept committee F defining the dissimilarity measure: we sim- ply used the set of atomic concepts in the ontology, thus ignoring any prior knowledge about the structure of the target concept C or the presence of statistical correlations in the knowledge base. Each concept in the committee F was weighted with its normal- ized entropy [19]. RG, CM and LP give an indication of the uncertainty associated to a specific labelling by associating values in the set [−1, +1] to each node; when such values are ≈ 0 (specifically, when the label was in the set [−10−4 , 10−4 ] we decided to leave the node unlabelled, so to try to provide more robust estimates of labels (and thus a possibly lower commission error and match rates and higher omission error rates). This may happen e.g. when there are no labelled examples within a connected component of the semantic similarity graph. In Tab. 2 we report average index rates and standard deviations for each of the ontologies in Tab. 1; the only exceptions is for the FAMILY-T REE ontology, which pro- vided 0.76 ± 0.13 match rates and 0.24 ± 0.13 induction rates for all methods (with the exception of LP, where the induction rates were 0.21 ± 0.14. In general, LapSVM outperformed the other two non-SSL SVM classification methods. This happened with varying quantities of unlabelled data; this is shown for example in the behavior of match rates in subfigure 2a, where results obtained in a k-Fold Cross Validation using a varying quantity of labelled instances. However, standard SVM training is O(m3 ) in general, where m is the number of training instances; therefore, some extra effort may be nec- essary to make SVM methods scale on SW knowledge bases. Such results may provide some empirical evidence that inductive methods for formal ontologies may take benefit from also accounting for unlabelled instances during learning. 5 Conclusion and Future Works This work proposes a method for transductive class-membership prediction based on graph-based regularisation from DL representations. It leverages neutral examples by propagating class-membership information among similar individuals in the training set. The proposed method relies on quadratic cost criteria, whose optimization can be reduced to solving a (possibly sparse) linear system; this is a well-known problem in the literature, with a nearly linear time complexity in the number of non-zero entries in the coefficient matrix. We did not analyse carefully the impact of different choices in the (dis-)similarity measure for building the semantic similarity graph. However, the similarity graph has a strong influence on the effectiveness of the methods used [27]. The construction of the similarity graph for class-membership learning tasks can be influenced by factors such as the structure of the target concept C, or by finding statistical correlation within the knowledge base. Also, it is not clear whether continuous labels assigned by the proposed methods may correspond to posterior probability estimates from the statistical point of view. In future work, we aim at investigating the aforementioned two aspects of graph-based transductive and semi-supervised class-membership prediction from DL representations. References [1] Alexandrescu, A., Kirchhoff, K.: Data-driven graph construction for semi-supervised graph-based learning in nlp. In: Sidner, C., et al. (eds.) HLT-NAACL. pp. 204–211. The Association for Computational Linguistics (2007) [2] Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning on large graphs. In: Shawe-Taylor, J., et al. (eds.) COLT. LNCS, vol. 3120, pp. 624–638. Springer (2004) [3] Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 7, 2399–2434 (2006) [4] Bloehdorn, S., Sure, Y.: Kernel methods for mining instance data in ontologies. In: Pro- ceedings of the 6th International Semantic Web Conference and the 2nd Asian Semantic Web Conference. pp. 58–71. ISWC’07/ASWC’07, Springer (2007) [5] Borgida, A., Walsh, T., Hirsh, H.: Towards measuring similarity in description logics. In: Horrocks, I., et al. (eds.) Description Logics. CEUR Workshop Proceedings, vol. 147. CEUR-WS.org (2005) [6] Chapelle, O., Schölkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cam- bridge, MA (2006) [7] d’Amato, C., Fanizzi, N., Esposito, F.: Query answering and ontology population: an in- ductive approach. In: Hauswirth, M., et al. (eds.) Proceedings of the 5th European Semantic Web Conference (ESWC’08). Springer, Tenerife, Spain (2008) [8] d’Amato, C., Fanizzi, N., Esposito, F.: A semantic similarity measure for expressive de- scription logics. CoRR abs/0911.5043 (2009) [9] d’Amato, C., Staab, S., Fanizzi, N.: On the influence of description logics ontologies on conceptual similarity. In: Proceedings of the 16th international conference on Knowledge Engineering: Practice and Patterns. pp. 48–63. EKAW ’08, Springer, Berlin, Heidelberg (2008) [10] Fanizzi, N., d’Amato, C.: Inductive concept retrieval and query answering with semantic knowledge bases through kernel methods. In: Proceedings of the 11th international con- ferenceon Knowledge-based intelligent information and engineering systems: Part I. pp. 148–155. KES’07, Springer (2007) [11] Fanizzi, N., d’Amato, C., Esposito, F.: Reduce: A reduced coulomb energy network method for approximate classification. In: Proceedings of the 6th European Semantic Web Confer- ence (ESWC’09). pp. 323–337. Springer [12] Fanizzi, N., d’Amato, C., Esposito, F.: Statistical Learning for Inductive Query Answering on OWL Ontologies. In: Proceedings of the 7th International Conference on The Semantic Web. pp. 195–212. ISWC ’08, Springer (2008) [13] Fanizzi, N., d’Amato, C., Esposito, F.: Towards learning to rank in description logics. In: Coelho, H., et al. (eds.) ECAI. Frontiers in Artificial Intelligence and Applications, vol. 215, pp. 985–986. IOS Press (2010) [14] Gurobi Optimization, I.: Gurobi optimizer reference manual (2012) [15] Hu, B., Dasmahapatra, S., Lewis, P.: Semantic metrics. Int. J. Metadata Semant. Ontologies 2(4), 242–258 (Jul 2007) [16] Janowicz, K., Wilkes, M.: Sim-dla: A novel semantic similarity measure for description logics reducing inter-concept to inter-instance similarity. In: Aroyo, L., et al. (eds.) ESWC. LNCS, vol. 5554, pp. 353–367. Springer (2009) [17] Lasserre, J., Bishop, C.M.: Generative or discriminative? getting the best of both worlds. BAYESIAN STATISTICS 8, 3–24 (2007) [18] Ochoa-Luna, J.E., Cozman, F.G.: An algorithm for learning with probabilistic description logics. In: Bobillo, F., et al. (eds.) URSW. pp. 63–74 (2009) [19] Rettinger, A., Lösch, U., Tresp, V., d’Amato, C., Fanizzi, N.: Mining the semantic web - statistical learning for next generation knowledge bases. Data Mining and Knowledge Discovery - Special Issue on Web Mining (2012) [20] Rettinger, A., Nickles, M., Tresp, V.: Statistical relational learning with formal ontologies. In: Buntine, W.L., et al. (eds.) ECML/PKDD (2). LNCS, vol. 5782, pp. 286–301. Springer (2009) [21] Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge Univer- sity Press, New York, NY, USA (2004) [22] Spielman, D.A., Teng, S.H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing. pp. 81–90. STOC ’04, ACM, New York, NY, USA (2004) [23] Vapnik, V.N.: Statistical learning theory. Wiley, 1 edn. (Sep 1998) [24] Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems 16. pp. 321–328. MIT Press (2004) [25] Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: Proceedings of the 22nd international conference on Machine learning. pp. 1036– 1043. ICML ’05, ACM (2005) [26] Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Tech. rep., CMU CALD tech report CMU-CALD-02 (2002) [27] Zhu, X.: Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin-Madison (2005)