Introduction and Motivation

Induction of Terminological Cluster Trees

Preliminaries

Model

Method

Perspectives

0 0 Giuseppe Rizzo , Claudia d'Amato, Nicola Fanizzi, and Floriana Esposito 1 Via Orabona 4 , 70125, Bari , Italy

In this paper, we tackle the problem of clustering individual resources in the context of the Web of Data, that is characterized by a huge amount of data published in a standard data model with a well-de ned semantics based on Web ontologies. In fact, clustering methods o er an e ective solution to support a lot of complex related activities, such as ontology construction, debugging and evolution, taking into account the inherent incompleteness underlying the representation. Web ontologies already encode a hierarchical organization of the resources by means of the subsumption hierarchy of the classes, which may be expressed explicitly, with proper subsumption axioms, or it must be detected indirectly, by reasoning on the available axioms that dene the classes (classi cation). However it frequently happens that such classes are sparsely populated as the hierarchy often re ect a view of the knowledge engineer prior to the actual introduction of assertions involving the individual resources. As a result, very general classes are often loosely populated, but this may happen also to speci c subclasses, making it more di cult to check the types of a resource (instance checking ), even through reasoning services. Among the large number of algorithms proposed in the Machine Learning literature, we propose a clustering method that is able to organize groups of resources hierarchically. Specifically, in this work, we introduce a conceptual clustering approach that combines a distance measure between individuals in a knowledge base in a divide-and-conquer solution that is intended to elicit ex post the underlying hierarchy based on the actual distributions of the instances.

Introduction and Motivation

With the growth of the Web of Data, along with the Linked Data initiative [ 12 ], a large number of datasets/vocabularies are being published on the ground of a standard data model and connected within a uniform semantic space exploiting RDF and the Web infrastructure. However, this huge quantity of data is also known as inherently uncertain, for being often inconsistent /con ictual, vague and, especially, incomplete [ 14 ].

Web ontologies already encode a hierarchical organization of the data by means of the subsumption hierarchy of the classes, which may be expressed explicitly, with proper subsumption axioms, or it has to be determined indirectly, by reasoning on the available axioms de ning the various classes (classi cation task). However it frequently happens that such classes are sparsely populated as the hierarchy often re ects a view of the knowledge engineer prior to the actual introduction of assertions involving the individual resources. As a result, very general classes are often loosely populated, but this may happen also to speci c subclasses, hindering or, at least, making it di cult to check the types of a resource (instance checking), even employing reasoning services. This may have a strong negative impact on services based on query answering, which often only rely on explicit assertions regarding the data.

Machine Learning methods can be employed to elicit implicit pieces of knowledge from the datasets, also in the face of such cases of inherent incompleteness, and provide non-standard inference services [ 17 ]. In particular, in this paper we will address conceptual clustering [18] as a means to exploit the data to detect further classes that may arise with the underlying hierarchical structure. Specifically, we introduce an approach that combines semantic distance measures over the space of individuals together with a divide-and-conquer solution that is intended to elicit in retrospection an additional class hierarchy to re ect the actual distributions of the instances.

Clustering is an unsupervised learning task aiming at partitioning a collection of objects into subsets or clusters, so that those within each cluster are more closely related to one another than to the objects assigned to di erent clusters [ 1 ]. In the Web of Data context, clustering can enable the de nition of new emerging concepts (concept formation) on the grounds of those employed in a knowledge base; supervised methods can exploit these clusters to induce new concept de nitions or to re ne existing ones (ontology evolution); intensionally de ned groupings may speed-up the task of search and discovery; clustering may also induce criteria for ranking the retrieved resources [ 10, 9 ].

An object is usually described by a xed set of feature (attribute values ) and the most common notion of similarity between the objects is expressed in terms of a distance function based on this set; for example, datasets made up of objects described by tuples of numeric features and (extensions of) the Euclidean distance are adopted to determine object and cluster similarity.

An important di erence among the various clustering techniques is related to the type of membership that is adopted. In the simplest (crisp) case, e.g. kMeans [ 16 ], cluster membership can be exclusive: each object belongs exactly to one cluster. Extensions, such as fuzzy c-means [ 3 ] or EM [ 7 ], admit an overlap between the clusters and a degree of membership (responsibility ) of each object to a cluster. Further extensions include non- at clustering structures such as those induced via hierarchical clustering [ 1 ].

In this work, moving on to conceptual clustering, we will require that classes with an intensional de nition may account for and de ne these clusters. To this purpose the algorithms should be able to exploit a background knowledge that can be expressed using expressive representation languages1, such as Description Logics. Again, in such methods, clustering requires the de nition of (dis)similarity measure between the set of the objects to be clustered.

We propose a generalized solution based on logical trees [ 4 ], namely Terminological Cluster Trees, as an extension of terminological decision trees [ 8 ]. They adopt a pseudo-metric de ned over the space of individuals as a criterion to separate groups of objects rather than information gain or other notions of purity devised for supervised concept learning and inductive classi cation methods, terminological decision trees. The proposed solution provides intensional de nitions that can be used for describing the individuals in the cluster, and unlike other methods [ 13, 11 ], does not have to resort to complex approximations such as the most speci c concept [ 2 ] as the representative of individuals on the conceptual level. Even more so, terminological cluster trees can determine autonomously the number of clusters to be generated, which is a required parameter for several other methods having a strong impact on the quality of the partitions obtained.

The rest of the paper is organized as follows: the next section illustrates the basic notions about the underlying Description Logics foundations for the intended representation and reasoning services; Sect. 3 introduces the problem of clustering individuals of a knowledge base while Sect. 4 presents the approach required for inducing terminological cluster trees. Finally, Sect. 5 illustrates the conclusions and some possible extensions. 2

Basics

In the following, we will borrow notation and terminology from Description Logics (DLs) [ 2 ] as the representation and reasoning services for the knowledge bases in the Web of Data ultimately rely on such a family of languages. Hence we will use the terms concept (description) and role as synonyms of class and property 2, respectively.

In these languages, a domain is modeled through atomic concepts (classes) NC and roles (relations) NR, which can be used to build complex descriptions regarding individuals (instances, objects), by using speci c operators (complement, conjunction and disjunction between concepts) that depend on the adopted language. A knowledge base is a couple K = (T ; A) where the TBox T contains axioms concerning concepts and roles (typically inclusion axioms such as C v D) and the ABox A contains assertions, i.e. axioms regarding the individuals (C(a), resp. R(a; b)). The set of individuals occurring in A is denoted by Ind(A).

The semantics of individuals, concepts, and roles is de ned through interpretations. An interpretation is a couple I = ( I ; I ) where I is the domain of the interpretation and I is a mapping such that, for each individual a, aI 2 I , for each concept C, CI I and for each role R, RI I I . The semantics 1 In the past, various fragments of First-Order Logic have been adopted, such as

Clausal Logics, especially in Inductive Logic Programming. 2 Datatype properties, i.e. roles ranging on concrete domains will not be considered in this study. of complex descriptions descends from the interpretation of the primitive concepts/roles and of the operators employed, depending on the adopted language. I satis es an axiom C v D (C is subsumed by D) when CI DI and an assertion C(a) (resp. R(a; b)) when aI 2 CI (resp. (aI ; bI ) 2 RI ). I is a model for K i it satis es each axiom/assertion in K, denoted with I j= . When is satis ed w.r.t. these models, we write K j= .

We will be interested in the instance-checking inference service: given an individual a and a concept description C determine if K j= C(a). Due to the Open World Assumption (OWA), answering to a class-membership query is more di cult w.r.t. ILP settings where the closed-world reasoning is the standard form. Indeed, one may not be able to prove the truth of either K j= C(a) or K j= :C(a), as there may be possible to nd di erent interpretations that satisfy either cases. 3

Conceptual Clustering for DL Knowledge Bases

As we are targeting conceptual clustering for DL Knowledge Bases, the problem, in a simple formulation, may be formalized as follows:

De nition 3.1 (conceptual clustering { at case). Given: Find:

{ a knowledge base K = hT ; Ai { a set of training individuals TI

Ind(A) { a partition of TI in n pairwise disjoint clusters fC1; : : : ; Cng { for each i = 1; : : : ; n, a concept description Di that accounts for Ci, i.e. such that 8a 2 Ci : 1. K j= Di(a) 2. K j= :Dj (a)

8j 2 f1; : : : ; ng; j 6= i

Note that in this setting the number of clusters (n) is not required as a parameter. Condition 2. may be relaxed (e.g. K 6j= Dj (a)) to allow some overlap between the clusters/concepts that may be further extended towards probabilistic clustering methods and models [ 1 ].

This problem can be regarded as a recursive one, as each cluster, in its turn, might yield its internal partitioning and each sibling sub-cluster would be characterized intensionally by its sub-class.

The decision on whether to partition recursively a given cluster or not generally depends on cohesion metrics, assessing a measure of intra-cluster similarity (within the cluster) and the w.r.t. the inter-cluster dissimilarity (w.r.t. the sibling partitions).

Terminological Cluster Trees

The notion of terminological cluster tree extends logical clustering trees introduced in [ 6 ] and learned through C0.5, a system derived from Tilde [ 4 ]. The induction of the model combines elements of logical decision trees induction (i.e. the approach based on recursive partitioning and exploiting of re nement operator for specializing concept descriptions) with other elements of instance-based learning (i.e. the employment of a distance measure over the instance space). De nition 4.1 (Terminological Cluster Trees). Given a knowledge base K, a terminological cluster tree (TCT) is a (binary) logical tree where: { each leaf node stands for a cluster of individuals, C { each node contains a concept description D (over the signature of K); { each edge from an internal node corresponds to the outcome of the membership test of individuals w.r.t. the concept installed in the node3.

Hence, a tree-node can be represented by a quadruple hD; C; Tleft; Trighti, indicating the two subtrees connected by either edge.

Person

Person u 9hasPublication:>

Person u 9hasPublication:(SWJ)

C3 C1

C2 Example 4.1 (a TCT). Fig. 1 illustrates a simple example of a TCT for describing individuals in the academic domain. The root node of the tree contains the concept Person. Two edges depart from this node: the left branch is used to denote a positive membership of an individual w.r.t. the concept Person, while 3 By convention the left branch is for positive instances and the right one is for negative instances. the right branch denote a negative membership w.r.t. the concept. On one hand, the right child of this node is a leaf that contains a cluster C4 composed by individuals that are not instances of the concept Person. On the other hand, the left child of the root node contains a further complex concept description. Again, there are two edges departing from this node: the right edge links the node to another leaf containing the cluster C3 of individuals denoting individuals that have no publication. tu

The details of the algorithms for (a) growing a TCT and (b) deriving intensional de nitions are reported in the sequel. 4.1

A Method for Inducing Terminological Cluster Trees A terminological cluster tree T is induced by means of a recursive strategy, which follows the same schema proposed for terminological decision trees (TDTs) [ 8 ]. The sketch is reported in Alg. 1.

Algorithm 1 Routines for inducing a TCT

The main routine is to be invoked as induceTCT(TI; C), where C may be > or any other general concept the individuals in TI are known to belong to by default.

In this recursive algorithm, the base case depends on a test (via stopCondition) on a threshold over the heuristics employed for growing the tree, measuring the cohesion of the cluster of individuals routed to the node in terms of a given metric. If this value exceeds then the branch is marked as completed and the cluster is stored in a leaf node. Further details about the heuristics and the stop condition will be reported later on.

In the inductive step, the current (parent) concept description C has to be specialized by means of an re nement operator ( ) exploring the search space of specializations of C. A set S of candidate specializations (C) is obtained. For each E 2 S, the set of positive and negative individuals, denoted, resp., by P and N are retrieved.

A tricky situation may occur (line 24) when either N or P is empty for a given concept (e.g. due to a total lack of disjointness axioms). In such a case, the algorithm can re-assign individuals I to N (resp. P) based on the distance between them and the prototype of P (resp. N) when this goes beyond a given threshold .

For P and N, a representative element is determined as a prototype, i.e. their medoid, a central element in a cluster, having a minimal average distance w.r.t. the other elements. Then, function selectBestConcept evaluates the specializations of the concept according to the closeness w.r.t. the medoids, determined according to a given distance measure (discussed later). The best concept E 2 S is the one for which the distance between the medoids for positive and negative instance set is maximized. Then E is installed in the current node.

After the assessment of the best concept E , the individuals are partitioned by split to be routed along the left or right branch. Di erently from TDTs, the routine does not decide the branch where the individuals will be sorted according to a concept membership test (instance check): rather it decides to split individuals according to the distance w.r.t. the prototypes of positive and negative membership w.r.t. E , i.e. the medoids of P and N. This divide-andconquer strategy is applied recursively until the instances routed to a node satisfy the stop condition. Note that, the number of the clusters is not required as an input but it depends on the number of paths generated in the growing phase: the algorithm is able to determine it naturally following the data distribution. Re nement Operator The proposed approach relies on a downward re nement operator that can generate the concepts to be installed in child-nodes performing a specialization process on the concept, say C, installed in a parentnode: 1 by adding a concept atom (or its complement) as a conjunct: C0 = C u (:)A; 2 by adding a general existential restriction (or its complement) as a conjunct:

C0 = C u (:)(9)R:>; 3 by adding a general universal restriction (or its complement) as a conjunct:

C0 = C u (:)(8)R:>; 4 by replacing a sub-description Ci in the scope of an existential restriction in

C with one of its re nements: 9R:Ci0 2 (9R:Ci) ^ Ci0 2 (Ci); 5 by replacing a sub-description Ci in the scope of a universal restriction with one of its re nements: 8R:Ci0 2 (8R:Ci) ^ Ci0 2 (Ci).

Note that the cases of 4 and 5 are recursive.

Prototypes Despite the common schema of the algorithms employed for growing TCTs and TDTs, the latter are obtained by selecting the best test in terms of information gain maximization [ 8 ], while the clustering procedure for TCTs resorts to a distance-based criterion on the individuals in the knowledge base. Speci cally, the heuristic adopted for selecting the best concept description that will be installed as new node can be de ned as follows:

E = arg max d (p(P); p(N))

D2 (C) where P and N are sub-clusters obtained from splitting I w.r.t. D, d( ; ) is a distance measure between individuals and p( ) is a function which maps a set of individuals to its prototype. As previously mentioned, the algorithm computes the medoids of both the set of positive and negative instances w.r.t. the test. Distance Measure The computation of medoids requires a (possibly) languageindependent measure for individuals whose de nition should capture aspects of their semantics in the context of the knowledge base. However individuals don't have an algebraic structure that can be exploited directly. In the TCT induction algorithm a language-independent dissimilarity measure [ 5 ] has been adopted.

Given a knowledge base K, the idea is to use the behavior of an individual w.r.t. a set of concepts C = fC1; C2; : : : ; Cmg that is dubbed context or committee of features. After C has been chosen, a projection function for each Ci 2 C can be de ned as a mapping i : Ind(A) ! f0; 12 ; 1g such that 8 a 2 Ind(A) i(a) = 8 1 if K j= Ci(a) <

0 if K j= :Ci(a) : 12 otherwise For the value of i associated to the uncertain membership case (owing to the OWA) we adopted a uniform prior probability. It can be set more accurately according to the membership to Ci, i : Ind(A) ! [0; 1], with x 2 Ind(A) 7! i(x) = P[K j= Ci(x)], if such a value can be estimated. For example, for densely populated ontologies this may be estimated as jrA(Ci)j=jInd(A)j, where rA() indicates the retrieval of the argument concept w.r.t. A, i.e. the set of individuals in that are known to belong to the concept [ 2 ]: rA(C) = fa 2 Ind(A) j K j= C(a)g. For largely populated concept C, this may be estimated on the ground of the assertions contained in A avoiding the bottleneck of reasoning.

Through the projection function, it is possible to de ne a family of distance measures fdpCgp2N for individuals as follows: dpC : Ind(A) Ind(A) ! [0; 1] such that In previous papers, e.g. [ 9 ], we also used the form: ppPi wij[ i(a) i(b)jp. The vector of weights w can be set according to the entropy of the considered concepts computed over the individuals occurring in K [ 10 ].

To speed up the algorithm, the projection functions can be pre-computed (once) before the learning phase.

Stop Condition The growth of a TCT can be stopped by resorting to a heuristic that is similar to the one employed for selecting the best concept description. This can be made by introducing a threshold 2 [0; 1], for the value of d( ; ). If the value is lower than the threshold, the branch growth is stopped. 4.2

Extracting Concepts from TCTs Alg. 2 reports the sketch of the function for deriving the concept descriptions describing the clusters obtained through a TCT. Given a TCT T , essentially Algorithm 2 Routines for deriving concept de nitions from a TCT 1 function DeriveConcepts(C; T ): set of concepts 2 input C: concept name 3 T : TCT 4 begin 5 let T = hD; I; Tleft; Trighti 6 if Tleft = Tright = null then f leaf g 7 return fCg 8 else 9 CSleft DeriveConcepts(C u D; Tleft) 10 CSright DeriveConcepts(:C u :D; Tright) 11 return (CSleft [ CSright) 12 end function deriveConcepts traverses the tree structure to collect the concept descriptions that are used as parents of the leaf-nodes. In this phase, it generates a set of concept descriptions CS.

Example 4.2. The set CS that can be obtained from the tree reported in Fig. 1 contains the following concept descriptions (corresponding to the four clusters): CS = f D1; D2; D3; D4 g = f Person u 9hasPublication:> u 9hasPublication:(SWJ); (Person u 9hasPublication:>) u :(Person u 9hasPublication:(SWJ)); Person u :(Person u 9hasPublication:>); :Person g Such concepts might be also simpli ed before being output. tu

Note that also the internal nodes contain concept descriptions that account for the individuals routed to such nodes. Hence it is straightforward to extend the extraction procedure, in order to produce a list of subsumption axioms between couples of the node concepts which might be submitted to a knowledge engineer for validation. 5

Conclusions, Ongoing and Future Work

In this work, we have proposed an extension of terminological decision trees [ 8 ], which were originally employed for (supervised) concept learning, in order to solve (unsupervised) conceptual clustering problems in the context of the datasets belonging to the Web of Data.

The algorithm essentially adopts a divide-and-conquer strategy that generates concept descriptions to be installed in the inner nodes via a re nement operator and selects the most promising ones to represent clusters of similar individuals using a suitable distance measure.

This preliminary work can be extended along various possible directions (some extensions are already being carried out), which can be listed as follows: { a comparison to other clustering methods in order to understand the feasibility of the proposed solution; { new distance measures between the individuals : in this work we adopted a language-independent distance measure between the individuals of a knowledge base. In this perspective, it may be interesting to investigate other distance measures (e.g. less computationally expensive functions); { new re nement operators : we adopted a re nement operator used for solving supervised learning problems. Further extensions of this work may consider new re nement operators that can be borrowed from other machine learning algorithms, e.g. DL-Learner [ 15 ] or other methods devised for the speci c method; { new heuristics : the approach installs concept descriptions so that the overlap between the corresponding sets of positive and negative individuals is minimized. It may be interesting to investigate a di erent heuristic that allows to quantify the degree of overlap between the two set of individuals; { discovery of disjointness axioms in order to enrich a knowledge base through the induction of terminological cluster trees and evaluation of the approach; { strategies to obtain simpler trees : we plan to investigate the e ectiveness of post-pruning procedures { inducing di erent kind of clusters : we focused on crisp clustering in this work. Another possible extension concerns the integration of theories for uncertainty management such as probabilistic models or the Dempster-Shafer theory allowing overlapping clusters; { scalability : another extension is to adopt solutions to make the proposed method scalable. Some possible solutions span from the implementation of distributed version of TCT induction algorithm and the employment of approximate reasoners in order to cope with the computational costs related to the underlying reasoning services adopted by the algorithm. 18. Stepp, R.E., Michalski, R.S.: Conceptual Clustering: Inventing Goal Oriented Classi cations of Structured Objects. In: Machine Learning: An Arti cial Intelligence Approach, Vol II. Morgan Kaufmann (1986)

1. Aggarwal , C.C. , Reddy , C.K. : Data Clustering: Algorithms and Applications . Chapman & Hall/CRC, 1st edn. ( 2013 )

2. Baader , F. , Calvanese , D. , McGuinness , D. , Nardi , D. , Patel-Schneider , P . (eds.): The Description Logic Handbook . Cambridge University Press, 2nd edn. ( 2007 )

3. Bezdek , J.C. : Pattern Recognition with Fuzzy Objective Function Algorithms . Kluwer Academic Publishers ( 1981 )

4. Blockeel , H., De Raedt , L. : Top-down induction of rst-order logical decision trees . Artif. Intell . 101 ( 1-2 ), 285 { 297 ( 1998 )

5. d'Amato , C. , Fanizzi , N. , Esposito , F. : Query Answering and Ontology Population: An Inductive Approach . In: Bechhofer, S. , et al. (eds.) Proceedings of ESWC 2008. LNCS , vol. 5021 , pp. 288 { 302 . Springer ( 2008 )

6. De Raedt , L. , Blockeel , H.: Using logical decision trees for clustering . In: Lavrac, N., Dzeroski, S. (eds.) Proceedings of ILP 1997. LNAI , vol. 1297 , pp. 133 { 140 . Springer ( 1997 )

7. Dempster , A.P. , Laird , N.M. , Rubin , D.B. : Maximum likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society, Series B 39 ( 1 ), 1 { 38 ( 1977 )

8. Fanizzi , N., d'Amato , C. , Esposito , F. : Induction of Concepts in Web Ontologies through Terminological Decision Trees . In: Balcazar, J. , et al. (eds.) Proceedings of ECML/PKDD2010. LNAI , vol. 6321 , pp. 442 { 457 . Springer ( 2010 )

9. Fanizzi , N., d'Amato , C. : A Hierarchical Clustering Method for Semantic Knowledge Bases . In: Apolloni, B. , Howlett , R.J. , Jain , L.C. (eds.) Proceedings of KES 2007 , Part III . LNCS , vol. 4694 , pp. 653 { 660 . Springer ( 2007 )

10. Fanizzi , N., d'Amato , C. , Esposito , F. : Evolutionary Conceptual Clustering Based on Induced Pseudo-Metrics. Int. J. Semantic Web Inf. Syst . 4 ( 3 ), 44 { 67 ( 2008 )

11. Fanizzi , N. , Iannone , L. , Palmisano , I. , Semeraro , G.: Concept Formation in Expressive Description Logics . In: Boulicaut, J. , et al. (eds.) Proceedings of ECML 2004. LNAI , vol. 3201 , pp. 99 { 110 . Springer ( 2004 )

12. Heath , T. , Bizer , C. : Linked Data: Evolving the Web into a Global Data Space . Synthesis Lectures on the Semantic Web , Morgan & Claypool Publishers ( 2011 )

13. Kietz , J.U. , Morik , K. : A Polynomial Approach to the Constructive Induction of Structural Knowledge . Mach Learn 14 , 193 { 217 ( 1994 )

14. Laskey , K. , Costa , P. , Kokar , M. , Martin , T. , Lukasiewicz , T. : Uncertainty Reasoning for the World Wide Web . Tech. rep., URW3 W3C Incubator Group ( 2008 ), http://www.w3.org/2005/Incubator/urw3/XGR-urw3- 20080331

15. Lehmann , J.: DL-Learner: Learning Concepts in Description Logics . Journal of Machine Learning Research (JMLR) 10 , 2639 { 2642 ( 2009 )

16. MacQueen , J.B.: Some Methods for Classi cation and Analysis of MultiVariate Observations . In: Le Cam, L.M. , Neyman , J . (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability . vol. 1 , pp. 281 { 297 . University of California Press ( 1967 )

17. Rettinger , A. , Losch, U., Tresp , V., d'Amato , C. , Fanizzi , N.: Mining the Semantic Web - Statistical learning for next generation knowledge bases . Data Min. Knowl. Discov . 24 ( 3 ), 613 { 662 ( 2012 )