Introduction

Measuring Similarity in Ontologies: A new family of measures

Tahani Alsubait

Bijan Parsia

Uli Sattler

sattlerg@cs.man.ac.uk 0 0 School of Computer Science, The University of Manchester , United Kingdom

2 5

Similarity measurement is important for numerous applications. Be it classical information retrieval, clustering, ontology matching or various other applications. It is also known that similarity measurement is di cult. This can be easily seen by looking at the several attempts that have been made to develop similarity measures, see for example [2, 4]. The problem is also well-founded in psychology and a number of psychological models of similarity have been already developed, see for example [3]. Rather than adopting a psychological model for similarity as a foundation, we noticed that some existing similarity measures for ontologies are ad-hoc and unprincipled. In addition, there is still a need for similarity measures which are applicable to expressive Description Logics (DLs) (i.e., beyond E L) and which are terminological (i.e., do not require an ABox). To address these requirements, we have developed a new family of similarity measures which are founded on the feature-based psychological model [3]. The individual measures vary in their accuracy/computational cost based on which features they consider. To date, there has been no thorough empirical investigation of similarity measures. This has motivated us to carry out two separate empirical studies. First, we compare the new measures along with some existing measures against a gold-standard. Second, we examine the practicality of using the new measures over an independently motivated corpus of ontologies (BioPortal library) which contains over 300 ontologies. We also examine whether cheap measures can be an approximation of some more computationally expensive measures. In addition, we explore what could possibly could wrong when using a cheap similarity measure.

Introduction

We aim at similarity measures for general OWL ontologies and thus a naive implementation of this approach would be trivialised because a concept has innitely many subsumers. To overcome this, we present re nements for the similarity function in which we do not count all subsumers but consider subsumers from a set of (possibly complex) concepts of a concept language L. Let C and D be concepts, let O be an ontology and let L be a concept language. We set:

S(C; O; L) = fD 2 L(Oe) j O j= C v Dg

Com(C; D; O; L) = S(C; O; L) \ S(D; O; L) Union(C; D; O; L) = S(C; O; L) [ S(D; O; L)

Sim(C; D; O; L) = jCom(C; D; O; L)j

jU nion(C; D; O; L)j To design a new measure, it remains to specify the set L. For example: AtomicSim(C; D) = Sim(C; D; O; LAtomic(Oe)); and LAtomic(Oe) = Oe \ NC : SubSim(C; D) = Sim(C; D; O; LSub(Oe)); and LSub(Oe) = Sub(O): GrSim(C; D) = Sim(C; D; O; LG(Oe)); and LG(Oe) = fE j E 2 Sub(O) or E = 9r:F; for some r 2 Oe \ NR and F 2 Sub(O)g: where Oe is the signature of O, NC is the set of concept names and Sub(O) is the set of concept expressions in O. The rationale of SubSim( ) is that it provides similarity measurements that are sensitive to the modeller's focus. To capture more possible subsumers, one can use GrSim( ) for which the grammar can be extended easily. 3

Approximations of similarity measures

Some measures might be practically ine cient due to the large number of candidate subsumers. For this reason, it would be nice if we can examine whether a \cheap" measure can be a good approximation for a more expensive one. De nition 1 Given two similarity functions Sim( ), Sim0( ), we say that: { Sim0( ) preserves the order of Sim( ) if 8A1; B1; A2; B2 2 Oe: Sim(A1; B1)

Sim(A2; B2) =) Sim0(A1; B1) Sim0(A2; B2). { Sim0( ) approximates Sim( ) from above if 8A; B 2 Oe: Sim(A; B)

Sim0(A; B). { Sim0( ) approximates Sim( ) from below if 8A; B 2 Oe: Sim(A; B) Sim0(A; B).

Consider AtomicSim( ) and SubSim( ). The rst thing to notice is that the set of candidate subsumers for the rst measure is actually a subset of the set of candidate subsumers for the second measure (Oe \ NC Sub(O)). However, we need to notice also that the number of entailed subsumers in the two cases need not to be proportionally related. Hence, the above examples of similarity measures are, theoretically, non-approximations of each other.

Empirical evaluation

We carry out a comparison between the three measures GrSim( ), SubSim( ) and AtomicSim( ) against human similarity judgments. We also include two existing similarity measures in this comparison (Rada [ 2 ] and Wu & Palmer [ 4 ]). We also study in detail the behaviour of our new family of measures in practice. GrSim( ) is considered as the expensive and most precise measure in this study.

To study the relation between the di erent measures in practice, we examine the following properties: order-preservation, approximation from above/below and correlation (using Pearson's coe cient). 4.1

Experimental set-up Part 1: Comparison against a gold-standard The similarity of 19 SNOMED

CT concept pairs was calculated using the three methods along with Rada [ 2 ] and Wu & Palmer [ 4 ] measures. We compare these similarities to human judgements taken from the Pedersen et al.[ 1 ] test set.

Part 2: Cheap vs. expensive measures A snapshot of BioPortal from November 2012 was used as a corpus. It contains a total of 293 ontologies. We excluded 86 ontologies which have only atomic subsumptions as for such ontologies the behaviour of the considered measures will be identical, i.e., we already know that AtomicSim( ) is good and cheap. Due to the large number of classes and di culty of spotting interesting patterns by eye, we calculated the pairwise similarity for a sample of concepts from the corpus. The size of the sample is 1,843 concepts with 99% con dence level. To ensure that the sample encompasses concepts with di erent characteristics, we picked 14 concepts from each ontology. The selection was not purely random. Instead, we picked 2 random concepts and for each random concept we picked some neighbour concepts. 4.2

Results

How good is the expensive measure? Not surprisingly, GrSim and SubSim had the highest correlation values with experts' similarity (Pearson's correlation coe cient r = 0:87; p < 0:001). Secondly comes AtomicSim with r = 0:86. Finally comes Wu & Palmer then Rada with r = 0:81 and r = 0:64 respectively. Figure 1 shows the similarity curves for the 6 measures used in this comparison. The new measures along with Wu & Palmer measure preserve the order of human similarity more often than Rada measure. They mostly underestimated similarity whereas the Rada measure was mostly overestimating human similarity.

Cost of the expensive measure The average time per ontology taken to calculate grammar-based similarities was 2.3 minutes (standard deviation = 10:6 minutes, median m = 0:9 seconds) and the maximum time was 93 minutes for the Neglected Tropical Disease Ontology which is a SRIQ ontology with 1237 logical axioms, 252 concepts and 99 object properties. For this ontology, the cost of AtomicSim( ) was only 15.545 sec and 15.549 sec for SubSim( ). 9 out of 196 ontologies took over 1 hour to be processed. One thing to note about these ontologies is the high number of logical axioms and object properties. Clearly, GrSim( ) is far more costly than the other two measures. This is why we want to know how good/bad a cheaper measure can be.

How good is a cheap measure? Although we have excluded all ontologies with only atomic subsumptions from the study, in 12% of the ontologies the three measures were perfectly correlated (r = 1; p < 0:001). These perfect correlations indicate that, in some cases, the bene t of using an expensive measure is totally neglectable.

AtomicSim( ) and SubSim( ) did not preserve the order of GrSim( ) in 80% and 73% of the ontologies respectively. Also, they were not approximations from above nor from below in 72% and 64% of the ontologies respectively.

Take a look at the African Traditional Medicine ontology in Figure 2. SubSim( ) is 100% order-preserving while AtomicSim( ) is only 99% order-preserving.

Note also the Platynereis Stage Ontology in Figure 3 in which both AtomicSim( ) and SubSim( ) are 75% order-preserving. However, AtomicSim( ) was 100% approximating from above while SubSim( ) was 85% approximating from below.

Pedersen ,

Pakhomov ,

Patwardhan , and

Chute . Measures of semantic similarity and relatedness in the biomedical domain . Journal of Biomedical Informatics , 30 ( 3 ): 288 { 299 , 2007 .

Rada ,

Mili , E. Bicknell, and

Blettner . Development and application of a metric on semantic nets . In IEEE Transaction on Systems, Man, and Cybernetics , volume 19 , page 1730, 1989 .

Tversky . Features of similarity. Psycological Review by the American Psycological Association , Inc., 84 ( 4 ), July 1977 .

Wu and MS. Palmer. Verb semantics and lexical selection . In Proceedings of the 32nd . Annual Meeting of the Association for Computational Linguistics (ACL 1994 ), page 133138, 1994 .