Evaluation of an Ontology Summarization Approach Ning Li Enrico Motta Mathieu dʼAquin Zdenek Zdrahal Knowledge Media Institute Knowledge Media Institute Knowledge Media Institute Knowledge Media Institute The Open University The Open University The Open University The Open University Milton Keynes Milton Keynes Milton Keynes Milton Keynes United Kingdom, MK7 6AA United Kingdom, MK7 6AA United Kingdom, MK7 6AA United Kingdom, MK7 6AA n.li@open.ac.uk e.motta@open.ac.uk m.daquin@open.ac.uk z.zdrahal@open.ac.uk We start with a review of the current algorithm for Key Concept ABSTRACT Extraction in Section2. We will then focus on the main Ontology summarization is a very useful technique to help users contributions of this paper, that is to objectively evaluate the making sense of ontologies quickly. We have developed a criteria comparatively in Section 3. In Section 4, we analyze and summarization approach that linearly combines a number of discuss the evaluation results, and Section 5 concludes the paper. criteria, drawn from cognitive science, network topology, and lexical statistics to produce ontology summaries [1]. Motivated by 2. THE KCE ALGORITHMS our later findings that the approach, in its current form, binds the In [1], a number of criteria were considered, and correspondingly criteria so tightly that hinders its flexible and optimal usage in a number of algorithms were developed, to identify key concepts different scenarios, this work presents an objective evaluation of of an ontology. In particular, the notion of natural category, this approach. This is not just a supplement to the subjective drawn from cognitive studies, was used to identify concepts that evaluation already done, but with a more important goal to are information-rich in a psycho-linguistic sense. This notion was evaluate the impact and find the ranking of importance for each realized by two operational measures: name simplicity which criterion. favors concepts that are labeled with simple names, such as Vegetation while penalizing compounds such as 1. INTRODUCTION ExoticVegetation; and basic level which measures how “central” a With the number and size of ontology increasing as well as concept is in the taxonomy of the ontology, i.e. how many times it complexity of ontology taxonomy, Ontology summarization has appears in the middle of a path from the root to a leaf of the been recognized as an important tool to facilitate ontology branch that contains the concept. Two other criteria were drawn understanding in order to support tasks like ontology reuse. We from the topology of an ontology: the notion of density highlights developed such an ontology summarization approach [1], called concepts that are richly characterized with properties and Key Concept Extraction (KCE). It uses a number of criteria taxonomic relationships, such as isA or typeof; while the notion of drawn from cognitive science, network topology, and lexical coverage aims to ensure that no important part of the ontology is statistics to extract key concepts, which are believed to be most neglected. Lastly, the notion of popularity, drawn from lexical reprehensive of the ontology. Ontology summaries produced in statistics, was introduced to indentify concepts that are commonly this way have been shown to correlate significantly with the ones used. The density and popularity criteria were both decomposed generated by human experts, referred to as “ground truth”. This into two sub-criteria, global and local density, and global and approach has been used as the basis for a novel ontology local popularity respectively. While the global measures are navigation and visualization tool, called KC-Viz1, and also to normalized with respect to all the concepts in the ontology, the provide summary view for online ontology sharing and reusing local ones consider the relative density or popularity of a concept system Cupboard2. with respect to its surrounding concepts. The aim is to ensure that Though good results were produced in the approach to Key “locally significant” concepts get a higher score, even though they Concept Extraction, the algorithm, in its current form, have may not rank too highly with respect to global measures. Each of limitations on matters, like time constrains, when used in different these seven criteria produces a score for each concept in the scenarios. With only subjective evaluation on the final ontology and the final score assigned to a concept is a weighted summarization results that is an accumulated effect of all the summation of the scores resulted from individual criterion. criteria used in the algorithm, it is not possible to separate the impact of each criterion on and its contribution to making results 3. EVALUATION OF KCE ALGORITHMS as close as possible to experts’ opinions. Hence, there is a need to Kendall’s tau Statisitcs [2] (abbreviated as tau) is often used to evaluate each criterion separately in a comparative manner. Also, measure the agreements between two measured quantities. In a closer look into how they relate to ontology features would be specific, it is a measure of rank correlation, that is, the similarity useful. In addition, it provides indicative view of how to improve of the orderings of the data when ranked by each of the quantities. the overall performance of KCE, by giving optimal weights to It has been used in the evaluation of text summarization [3] as each criterion. These weights were only derived empirically in well as an RDF-sentence-based ontology summarization [4]. [1], where a comprehensive analysis of the algorithms and Here, we use tau to find the correlation between the score vector associated performances had not been realized. (one per ontology and the length of vector equals the number of concepts in each ontology), produced by each criterion, with human experts’ “ground truth” score vector. Eight people with experiences on ontology engineering were asked to select up to 20 1 http://neon-toolkit.org/wiki/KC-Viz key concepts for each ontology. The score vector for each 2 http://kmi-web06.open.ac.uk:8081/cupboard/ criterion is obtained by running the corresponding algorithm, and that of “ground truth” is obtained by counting the experts’ votes features. For example, the ranking of name simplicity is lower on each concept and then normalizing the result with respect to than global popularity in biosphere ontology but higher in the total number of votes being cast in the whole ontology. We financial ontology. So, why, in another word, name simplicity is still use the same four ontologies biosphere, music, financial, and less important than global popularity in biosphere ontology but aktors portal (see [1]), to find the tau scores and their average. more important in financial ontology. Firstly, by looking at Table 1 shows the results. Each entry in this table is the what’s typically contained in biosphere ontology, we know that a correlation between the criterion score vector and “ground truth” majority of the terms are simple names instead of compounds, score vector. An average over all test ontologies is listed in the and also a high percentage of the terms are not popular words. bottom row. With “ground truth” containing key concepts like Animal, Bird, Table 1. Algorithms and experts agreement measured by tau Fungi, Insect, Mammal, MarineAnimal etc., all with very popular names and only one is compound, it is obvious that the impact of name simplicity criterion is less prominent than that of global Coverage Density Global Density Local Basic Simplicity Name Popularity Global Popularity Local Level popularity in making the summarization results correlating with “ground truth”. While for financial ontology, a majority of the terms are labeled with popular words and it is often the case that a simple name is franchised by many compound names, With “ground truth”, e.g. Bank, Bond, Broker, Capital, Contract, Biosphere 0.140 0.454 0.449 0.388 0.111 0.300 0.091 Dealer, Financial_Market etc. containing one compound name Financial 0.053 0.539 0.547 0.448 0.464 0.430 0.310 only, it is not surprising that name simplicity may impose a larger impact than global popularity on the summarization results in Music 0.272 0.308 0.307 0.367 -0.048 0.085 -0.019 making them correlate with “ground truth” more. Aktors 0.241 0.378 0.355 0.401 0.136 0.114 0.055 Though lack of comparison value, the definite values for the portal scores of different criteria are worth looking into. For example, Average 0.177 0.420 0.415 0.401 0.166 0.232 0.109 the global popularity scores of both biosphere and financial are pretty high. This in fact reinforces the subjective evaluation in the original work [1]. The initial design of the algorithm did not have The resulted tau score does not reflect the precise contributions of popularity criterion and the resulted summaries had very low each criterion, rather it is often a relative comparison among the levels of agreement with the “ground truth”. When adding criteria. Increasing values imply increasing agreement between popularity as an additional criterion to the existing criteria stack, the two sets of rankings. If the rankings are completely the resulted summaries were all improved significantly, with independent and uncorrelated, the coefficient then has value zero ontology biosphere and financial being improved more by 167% on average. In our case, the higher the score is, the more and 100% respectively than ontology music, and aktors portal correlations between the corresponding criterion’s score with which had improvement ratios of 50% and 20% respectively (see experts’ score and hence the more agreements between their [1]). Hence, it is not so surprising to see popularity criterion has choices of key concepts. Also, it must be emphasized that the relatively higher tau scores for biosphere and financial than the scores are most meaningful when considered per ontology. For other two ontologies. example, it is not expected to compare the global popularity score of financial ontology with global density of music ontology, nor 5. CONCLUSIONS to compare the global popularity score of financial ontology with This paper provides an objective evaluation of the Key Concept global popularity of music ontology even because two ontologies Extraction algorithms used in an ontology summarization may have very different features which, as will be analyzed next, approach. The evaluation results provide a basis to judge the affect the definite values of the tau score. Only the comparison importance of each individual criterion being used. It helps to among different criteria within one ontology indicates the decide which criterion is prioritized to use or given more weights importance of each criterion. Obviously, if one criterion when such a decision is required in certain use case scenarios. consistently produces higher scores than the other criteria cross all ontologeis, it is reasonable to believe that it is a more 6. REFERENCES important criterion. The average scores listed in the bottom row [1] Peroni, S., Motta, E., d'Aquin, M. 2008. Identifying Key provide such an indication. Concepts in an Ontology Through the Integration of Cognitive Principles with Statistical and Topological 4. ANALYSISES AND DISCUSSIONS Measures. In 3rd Asian Semantic Web Conference, Bangkok, From the results, we can see that the criteria global density, local Thailand. density and basic level, show consistent high agreements with [2] Sheskin, D.J. 1997. Handbook of Parametric and “ground truth” across all onotlogies with a similar order of Nonparametric Statistical Procedures. CRC Press. rankings, which indicates that human experts also have their [3] Donaway, R.L., Drummey, K.W., Mather, L.A. 2000. A attentions on those corresponding features of ontology. While Comparison of Rankings Produced by Summarization other criteria coverage, name simplicity, global popularity, local Evaluation Measures. In ANLP/NAACL Workshop on popularity show consistent less importance. But the order of Automatic Summarization, pp 69–78. rankings among them varies slightly across four ontologies. [4] Zhang, X., Cheng, G., Qu, Y. 2007. Ontology Summarization Though the average score at the bottom row provides the most Based on RDF Sentence Graph. In 16th International World comprehensive indication of the importance of each criterion, a Wide Web Conference (WWW2007), Banff, Alberta, Canada, closer look into those variations could provide a profound insight May 8-12. into the impact of each criterion on ontologies with distinctive