Exploiting Semantic Relatedness Measures for Multi-label Classifier Evaluation Christophe Deloo∗ Claudia Hauff Delft University of Technology Delft University of Technology Delft, The Netherlands Delft, The Netherlands c.p.p.deloo@gmail.com c.hauff@tudelft.nl ABSTRACT papers2 are to be annotated with concepts from an existing In the multi-label classification setting, documents can be thesaurus3 (the Parliament thesaurus). A multi-label clas- labelled with a number of concepts (instead of just one). sifier framework exists and each document can be automat- Evaluating the performance of classifiers in this scenario is ically annotated with a number of concepts. Currently, the often as simple as measuring the percentage of correctly evaluation of the classifier is conducted as follows: the auto- assigned concepts. Classifiers that do not retrieve a sin- matically produced annotations are compared to the ground- gle concept existing in the ground truth annotation are all truth (i.e. the concepts assigned by domain experts) and the considered equally poor. However, some classifiers might binary measures of precision and recall are computed. This perform better than others, in particular those, that assign means, that a document labelled with concepts which do not concepts which are semantically similar to the ground truth occur in the ground truth receives a precision/recall of zero, annotation. Thus, exploiting the semantic relatedness be- even though the assigned concepts may be semantically very tween the classifier-assigned and the ground truth concepts similar to the ground truth concepts. As an example, con- leads to a more refined evaluation. A number of well-known sider Figure 1: the ground truth of the document consists algorithms compute the semantic relatedness between con- of three concepts {biofuel, environment, renewable energy} cepts with the aid of general-world knowledge bases such as and the classifier annotates the document with the concepts WordNet1 . When the concepts are domain specific, however, {energy source, solar energy}. Binary precision/recall mea- such approaches cannot be employed out-of-the-box. Here, sures evaluate the classifier’s performance as zero, though it we present a study, inspired by a real-world problem, where is evident, that the classifier does indeed capture the content we first investigate the performance of well-known semantic of the document - at least partially. relatedness measures on a domain-dependent thesaurus. We Thus, we are faced with the following research question: then employ the best performing measure to evaluate multi- Can the evaluation of a multi-label classifier be improved label classifiers. We show that (i) measures which perform when taking the semantic relatedness of concepts into ac- well on WordNet do not reach a comparable performance on count? our thesaurus and that (ii) an evaluation based on semantic To this end, we present two studies (Figure 1): relatedness yields results which are more in line with human 1. We investigate established semantic relatedness mea- ratings than the traditional F-measure. sures on the Parliament thesaurus. Are measures that perform well on WordNet or Wikipedia also suitable Categories and Subject Descriptors: H.3.3 Information for this domain-specific thesaurus? Storage and Retrieval: Information Search and Retrieval 2. Given the best performing relatedness measure, we in- Keywords: semantic relatedness, classifier evaluation clude the semantic relatedness in the evaluation of the multi-label classifier framework and investigate if such 1. INTRODUCTION a semantically enhanced evaluation improves over the In this paper, we present a two-part study, that is inspired binary precision/recall based evaluation. by the following real-world problem: Dutch Parliamentary We find that the best performing measures on WordNet do ∗ not necessarily perform as well on a different thesaurus, and This research was performed while the author was an intern thus, they should be (re-)evaluated when a novel thesaurus at GridLine. 1 is employed. Our user study also shows that a classifier eval- http://wordnet.princeton.edu/ uation, which takes the semantic relatedness of the ground truth and the classifier assigned concepts into account yields results which are closer to those of human experts than tra- ditional binary evaluation measures. 2 The documents come from the Dutch House of Represen- tatives (de Tweede Kamer), which is the lower house of the bicameral parliament of the Netherlands. 3 DIR 2013, April 26, 2013, Delft, The Netherlands. For more details see Section 3. energy {c 1, c 2} automatically source assigned concepts relatedness. In order to evaluate the different measures, we renewable employ an established methodology: we select a number of semantically energy enriched concept pairs from our thesaurus and ask human annotators solar evaluation to judge the relatedness of the concepts on a 5-point scale energy biofuel ground truth (where 1 means unrelated and 5 means strongly related ). We environ- ment {g 1 , g 2 , g3 } assignments consider these judgements as our ground truth and rank the 1 2 concept pairs according to their semantic relatedness. Then, Parliament thesaurus we also rank the concept pairs according to the scores they achieve by the different semantic relatedness measures. The Figure 1: Overview of the two-step process: (1) agreement between the two rankings is evaluated with the we first investigate semantic relatedness measures rank correlation measure Kendall’s Tau (τ ) and the linear on the Parliament thesaurus. Then, (2) given a correlation coefficient (r). document and its assigned ground truth concepts The Parliament thesaurus contains nearly 8, 000 Dutch {g1 , g2 , g3 } (by human annotators), we evaluate the terms oriented towards political themes such as defense, wel- quality of the classifier-assigned concepts {c1 , c2 }. fare, healthcare, culture and environment. As is typical for The classifier evaluation takes the semantic relat- a thesaurus, the concepts are hierarchically structured and edness between the concepts into account. the following three types of relations exist: hierarchical (nar- rower/broader), synonymy and relatedness. Fifty concept 2. RELATED WORK pairs were manually selected by the authors, with the goal In this section, we first discuss semantic relatedness mea- to include as many different characteristics as possible, that sures and then briefly describe previous work in multi-label is, concept pairs of varying path lengths, types of relations, classifier evaluation. etc. The human ratings were obtained in an electronic sur- Several measures of semantic relatedness using a variety vey where Dutch speaking people were asked to rate the fifty of lexical resources have been proposed in the literature. In concept pairs on their relatedness. As stated earlier, in the most cases semantic relations between concepts are either 5-point scale, the higher the assigned rating, the stronger inferred from large corpora of text or lexical structures such the perceived relatedness. as taxonomies and thesauri. The state-of-the-art related- The following relatedness measures were selected for our ness measures can be roughly organised into graph-based experiments: Rada [11], Leacock & Chodorow [6], Resnik [12], measures [11, 6, 19, 4, 16], corpus-based measures [17, 10] Wu & Palmer [19], Jiang & Conrath [5] and Lin [7]. The and hybrid measures [12, 5, 7, 1]. The latter combine infor- measures of Rada, Leacock & Chodorow and Wu & Palmer mation gathered from the corpus and the graph structure. are all graph-based measures based on path lengths. The The majority of relatedness measures are graph-based and path length is calculated by summing up the weights of the were originally developed for WordNet. WordNet is a large edges in the path. The weights typically depend on the lexical database for the English language in which concepts type of relation. The stronger the semantic relation, the (called synsets) are manually organised in a graph-like struc- lower the weight. Two versions of both Rada’s and Leacock ture. While WordNet represents a well structured thesaurus, & Chodorow’s approach were implemented: one including its coverage is limited. Thus, more recently, researchers have only hierarchical and synonymous relations, and one includ- turned their attention to Wikipedia, a much larger knowl- ing all three types of thesaurus relations. The weights of the edge base. Semantic relatedness measures originally devel- relations were chosen according to their semantic strength. oped for WordNet have been validated on Wikipedia. Ap- A weight of 1 was assigned to both hierarchical and related proaches that exploit structural components that are specific concept relations and a weight of 0 to synonymous concept to Wikipedia have been developed as well [14, 18, 3]. relations. The remaining three approaches, which are based With respect to multi-label classifier evaluation, our work on the concept of information content, were implemented builds in particular on Nowak et al. [9]. The authors study using the approach of Seco et al. [13]. the behavior of different semantic relatedness measures for the evaluation of an image annotation task and quantify the Multi-label Classifier Evaluation. correctness of the classification by using a matching opti- Having identified the best performing measure of seman- misation procedure that determines the lowest cost between tic relatedness on the Parliament thesaurus, we then turn the concept sets of the ground truth and of the classifier. to the evaluation of the existing multi-label classifier frame- We note, that besides semantic relatedness measures one work (Figure 1 step (2)). Matching the concepts from the can also apply hierarchical evaluation measures to determine classifier with the ground truth concepts is performed ac- the performance of multi-label classifiers, as for instance pro- cording to a simplified version (which excludes the ontology posed in [15]. We leave the comparison of these two different and annotator agreement) of the procedure presented in [9]. approaches for future work. Nowak et al. define a classification evaluation measure that incorporates the notion of semantic relatedness. The algo- 3. METHODOLOGY rithm calculates the degree of relatedness between the set C of classifier concepts and the set E of ground truth concepts Semantic Relatedness in the Parliament Thesaurus. with an optimisation procedure. This procedure pairs every We first investigate the performance of known semantic label of both sets with a label of the other set in a way that relatedness measures on our domain-specific thesaurus (Fig- maximises relatedness: each label lc ∈ C is matched with a ure 1 step (1)). The goal of this experiment is to identify the label le0 ∈ E and each label le ∈ E is matched with a label most promising semantic relatedness measure, i.e. the mea- lc0 ∈ C. The relatedness values of each of those pairs are sure that correlates most closely with human judgements of summed up and divided by the number of labels occurring Concept pairs Av. rating Std. Dev. Measures r τ Vaticaanstad paus 4.86 0.25 Vatican City pope Rada (similarity) 0.43 0.35 energiebedrijven elektriciteitsbedrijven 4.72 0.43 Rada (relatedness) 0.73 0.55 power companies electricity companies Leacock & Chodorow (similarity) 0.49 0.36 rijbewijzen rijbevoegdheid 4.64 0.55 Leacock & Chodorow (relatedness) 0.73 0.55 driver licenses qualification to drive Wu & Palmer 0.39 0.33 ... Resnik 0.45 0.37 boedelscheiding gentechnologie 1.2 0.34 Jiang & Conrath 0.48 0.41 derision of property gene technology Lin 0.45 0.39 roken dieren 1.17 0.29 smoke animals makelaars republiek 1.16 0.28 broker republic Table 2: Overview of the correlations of relatedness measures with human judgements of relatedness. Table 1: Shown are the three concept pairs from our annotation study achieving the highest and the Classifier Ground truth Av. rating lowest average rating respectively (in Dutch and En- toelating vreemdelingenrecht 4.67 vreemdelingen vreemdelingen glish). procedures werknemers vluchtelingen kinderbescherming jeugdigen 3.67 in both sets. This yields a value in the interval [0, 1]. The kindermishandeling gezondheidszorg higher the value, the more related the sets. Formally: Table 3: Two examples of assigned classifier con- cepts vs. ground truth concepts and the average of X X max rel(lc , le0 ) + max rel(le , lc0 ) lc ∈C 0 le ∈E le ∈E 0 lc ∈C the ratings obtained from the three experts users. (1) |C| + |E| To validate this measure we conduct a study with human higher correlations for the selected relatedness measures. experts: three expert users, who are familiar with the the- Their correlation results range from 0.74 (Wu & Palmer) to saurus and the documents, were asked to judge for twenty- 0.84 (Jiang & Conrath) and are in line with similar studies five documents the relatedness between the ground truth on WordNet such as Budanitsky et al. [2]. We conclude that concepts and the classifier assigned concepts (taking the con- measures which perform best on WordNet are not perform- tent of the document into account) on a 5-point scale: very ing as well on our domain-dependent Parliament thesaurus. poor, poor, average, good and very good. It should be em- phasised, that our expert users have not created the ground Multi-label Classifier Evaluation. truth concepts (those were created by library experts em- In Table 3 two examples of assigned classifier concepts vs. ployed by the Dutch government). The average rating taken ground truth concepts are shown. Reported are also the av- over all three individual expert ratings are considered as the erage ratings obtained from the three expert users. Across ground-truth. The expert evaluations are used to compare all 25 evaluated documents, the mean rating was 3.28, indi- the performance of the relatedness evaluation measure and cating that the classifier framework performs reasonably well the performance of a frequently used binary evaluation mea- at assigning concepts related to the ground truth concepts. sure (F-measure). We hypothesise, that the classifier evalu- ation, which takes the semantic relatedness of the concepts Correlation Semantically- F1 into account will correlate to a larger degree with the expert enhanced judgements than the traditional binary evaluation measure. r 0.67 0.48 τ 0.53 0.37 4. EXPERIMENTS & RESULTS Semantic Relatedness in the Parliament Thesaurus. Table 4: Correlations between the expert ratings Examples of concept pairs that were selected for the an- and the semantically-enhanced and the binary (F1 ) notation study are shown in Table 1; in particular the three classifier evaluation respectively. concept pairs yielding the highest human annotator related- ness scores and the lowest scores respectively are listed. The results of the second experiment are summarised in The performance of the relatedness measures on the Par- Table 4. Here, we employed Leacock & Chodorow’s relat- liament thesaurus are listed in Table 2. From these results edness as it was our best performing approach (Table 2). two aspects stand out: (i) the relatively high correlation The results indicate that for the annotated set of twenty- obtained for Rada’s and Leacock & Chodorow’s relatedness five documents, the relatedness evaluations correlate more measure, and, (ii) the relatively poor performance of the with the expert evaluations than the evaluation based on remaining measures. F1 . The coefficients report an increase in correlation of Traditionally, semantic relatedness measures have been at least 0.16 in favour of the relatedness evaluations. To evaluated on WordNet, the most well-known manually cre- emphasise the difference, we also present the scatter plots ated lexical database. Seco et al. [13] evaluated all mea- of the semantically-enhanced (Figure 2) and the binary, F1 sures from our selection (except Rada) in a similar way on based, evaluation (Figure 3). In both plots, the correspond- the WordNet graph against a test-bed of human judgements ing trend line is drawn in red. It is evident, that in the provided by Miller & Charles [8]. They reported significant binary case, the number of F1 = 0 entries has a significant raters than an evaluation based on binary decision. Besides the issues already raised, in future work we plan to investigate in which graph/content characteristics WordNet differs from our thesaurus and to what extent these different characteristics can be employed to explain the difference in performance of the various semantic relatedness measures. 6. REFERENCES [1] S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In International Joint Conference on Artificial Intelligence, volume 18, pages 805–810, 2003. [2] A. Budanitsky and G. Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1):13–47, 2006. Figure 2: Expert versus relatedness evaluations. [3] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artificial intelligence, pages 1606–1611, 2007. [4] G. Hirst and D. St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 13:305–332, 1998. [5] J. J. Jiang and D. W. Conrath. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. 1997. [6] C. Leacock and M. Chodorow. Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database, 49(2):265–283, 1998. [7] D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th international conference on Machine Learning, volume 1, pages 296–304, 1998. [8] G. Miller and W. Charles. Contextual correlates of semantic similarity. Language and cognitive processes, 6(1):1–28, 1991. Figure 3: Expert versus binary evaluations. [9] S. Nowak, A. Llorente, E. Motta, and S. Rüger. The effect of semantic relatedness measures on multi-label classification evaluation. In CIVR ’10, pages 303–310, 2010. impact on the obtained correlation. Note that the disper- [10] S. Patwardhan. Incorporating dictionary and corpus sion of relatedness evaluations in Figure 2 is higher at lower information into a context vector measure of semantic relatedness, 2003. expert evaluations compared to higher expert evaluations. [11] R. Rada, H. Mili, E. Bicknell, and M. Blettner. Whether this observation is to be attributed to noise is im- Development and application of a metric on semantic nets. possible to say due to the small size of the evaluation. We Systems, Man and Cybernetics, IEEE Transactions on, will investigate this issue further in future work. 19(1):17–30, 1989. [12] P. Resnik. Using information content to evaluate semantic 5. CONCLUSIONS similarity in a taxonomy. pages 448–453, 1995. [13] N. Seco, T. Veale, and J. Hayes. An intrinsic information In this paper, we have presented a two-step procedure content metric for semantic similarity in wordnet. In ECAI, to tackle a real-world problem: namely, the semantically- volume 16, page 1089, 2004. enhanced evaluation of multi-label classifiers that assign con- [14] M. Strube and S. Ponzetto. Wikirelate! computing cepts to documents. We first investigated to what extent se- semantic relatedness using wikipedia. In Proceedings of the mantic relatedness measures that perform well on the most National Conference on Artificial Intelligence, volume 21, page 1419, 2006. commonly used lexical database (WordNet) also perform [15] A. Sun and E.-P. Lim. Hierarchical text classification and well on another thesaurus (our domain-specific Parliament evaluation. In Data Mining, 2001. ICDM 2001, Proceedings thesaurus). To this end, we conducted a user study where IEEE International Conference on, pages 521–528. IEEE, we let approximately 100 users annotate fifty concept pairs 2001. drawn from our thesaurus. We found that the results achieved [16] M. Sussna. Word sense disambiguation for free-text on WordNet need to be considered with care, and it is in- indexing using a massive semantic network. In CIKM ’93, deed necessary to re-evaluate them when using a different pages 67–74. ACM, 1993. source. [17] A. Tversky. Features of similarity. Psychological review, 84(4):327, 1977. In a second step, we then exploited the semantic related- [18] I. Witten and D. Milne. An effective, low-cost measure of ness measure we found to perform best in the multi-label semantic relatedness obtained from wikipedia links. In classifier evaluation. Again, we investigated the ability of Proceeding of AAAI Workshop on Wikipedia and Artificial such an evaluation measure to outperform a standard bi- Intelligence: an Evolving Synergy, pages 25–30, 2008. nary measure (F1 ) by asking expert users to rate for a small [19] Z. Wu and M. Palmer. Verbs semantics and lexical set of documents the quality of the classifier concepts when selection. In ACL ’94, pages 133–138, 1994. compared to the ground truth concepts. Our results showed that an evaluation which includes the semantic relatedness of concepts yields results which are more in line with human