Exploiting Semantic Relatedness Measures for Multi-label
                     Classifier Evaluation

                           Christophe Deloo∗                                      Claudia Hauff
                      Delft University of Technology                      Delft University of Technology
                         Delft, The Netherlands                              Delft, The Netherlands
                       c.p.p.deloo@gmail.com                                   c.hauff@tudelft.nl


ABSTRACT                                                         papers2 are to be annotated with concepts from an existing
In the multi-label classification setting, documents can be      thesaurus3 (the Parliament thesaurus). A multi-label clas-
labelled with a number of concepts (instead of just one).        sifier framework exists and each document can be automat-
Evaluating the performance of classifiers in this scenario is    ically annotated with a number of concepts. Currently, the
often as simple as measuring the percentage of correctly         evaluation of the classifier is conducted as follows: the auto-
assigned concepts. Classifiers that do not retrieve a sin-       matically produced annotations are compared to the ground-
gle concept existing in the ground truth annotation are all      truth (i.e. the concepts assigned by domain experts) and the
considered equally poor. However, some classifiers might         binary measures of precision and recall are computed. This
perform better than others, in particular those, that assign     means, that a document labelled with concepts which do not
concepts which are semantically similar to the ground truth      occur in the ground truth receives a precision/recall of zero,
annotation. Thus, exploiting the semantic relatedness be-        even though the assigned concepts may be semantically very
tween the classifier-assigned and the ground truth concepts      similar to the ground truth concepts. As an example, con-
leads to a more refined evaluation. A number of well-known       sider Figure 1: the ground truth of the document consists
algorithms compute the semantic relatedness between con-         of three concepts {biofuel, environment, renewable energy}
cepts with the aid of general-world knowledge bases such as      and the classifier annotates the document with the concepts
WordNet1 . When the concepts are domain specific, however,       {energy source, solar energy}. Binary precision/recall mea-
such approaches cannot be employed out-of-the-box. Here,         sures evaluate the classifier’s performance as zero, though it
we present a study, inspired by a real-world problem, where      is evident, that the classifier does indeed capture the content
we first investigate the performance of well-known semantic      of the document - at least partially.
relatedness measures on a domain-dependent thesaurus. We            Thus, we are faced with the following research question:
then employ the best performing measure to evaluate multi-       Can the evaluation of a multi-label classifier be improved
label classifiers. We show that (i) measures which perform       when taking the semantic relatedness of concepts into ac-
well on WordNet do not reach a comparable performance on         count?
our thesaurus and that (ii) an evaluation based on semantic         To this end, we present two studies (Figure 1):
relatedness yields results which are more in line with human       1. We investigate established semantic relatedness mea-
ratings than the traditional F-measure.                               sures on the Parliament thesaurus. Are measures that
                                                                      perform well on WordNet or Wikipedia also suitable
Categories and Subject Descriptors: H.3.3 Information                 for this domain-specific thesaurus?
Storage and Retrieval: Information Search and Retrieval
                                                                   2. Given the best performing relatedness measure, we in-
Keywords: semantic relatedness, classifier evaluation
                                                                      clude the semantic relatedness in the evaluation of the
                                                                      multi-label classifier framework and investigate if such
1.    INTRODUCTION                                                    a semantically enhanced evaluation improves over the
  In this paper, we present a two-part study, that is inspired        binary precision/recall based evaluation.
by the following real-world problem: Dutch Parliamentary            We find that the best performing measures on WordNet do
∗
                                                                 not necessarily perform as well on a different thesaurus, and
  This research was performed while the author was an intern     thus, they should be (re-)evaluated when a novel thesaurus
at GridLine.
1                                                                is employed. Our user study also shows that a classifier eval-
  http://wordnet.princeton.edu/
                                                                 uation, which takes the semantic relatedness of the ground
                                                                 truth and the classifier assigned concepts into account yields
                                                                 results which are closer to those of human experts than tra-
                                                                 ditional binary evaluation measures.


                                                                 2
                                                                   The documents come from the Dutch House of Represen-
                                                                 tatives (de Tweede Kamer), which is the lower house of the
                                                                 bicameral parliament of the Netherlands.
                                                                 3
DIR 2013, April 26, 2013, Delft, The Netherlands.                  For more details see Section 3.
       energy
                                          {c 1, c 2}        automatically
       source
                                                            assigned concepts
                                                                                  relatedness. In order to evaluate the different measures, we
                renewable
                                                                                  employ an established methodology: we select a number of
                                                                   semantically
                energy
                                                                   enriched       concept pairs from our thesaurus and ask human annotators
                     solar
                                                                   evaluation     to judge the relatedness of the concepts on a 5-point scale
                     energy   biofuel

                                                            ground truth
                                                                                  (where 1 means unrelated and 5 means strongly related ). We
          environ-
          ment
                                        {g 1 , g 2 , g3 }   assignments           consider these judgements as our ground truth and rank the
                                 1                                          2
                                                                                  concept pairs according to their semantic relatedness. Then,
      Parliament
      thesaurus
                                                                                  we also rank the concept pairs according to the scores they
                                                                                  achieve by the different semantic relatedness measures. The
Figure 1: Overview of the two-step process: (1)
                                                                                  agreement between the two rankings is evaluated with the
we first investigate semantic relatedness measures
                                                                                  rank correlation measure Kendall’s Tau (τ ) and the linear
on the Parliament thesaurus. Then, (2) given a
                                                                                  correlation coefficient (r).
document and its assigned ground truth concepts
                                                                                     The Parliament thesaurus contains nearly 8, 000 Dutch
{g1 , g2 , g3 } (by human annotators), we evaluate the
                                                                                  terms oriented towards political themes such as defense, wel-
quality of the classifier-assigned concepts {c1 , c2 }.
                                                                                  fare, healthcare, culture and environment. As is typical for
The classifier evaluation takes the semantic relat-
                                                                                  a thesaurus, the concepts are hierarchically structured and
edness between the concepts into account.
                                                                                  the following three types of relations exist: hierarchical (nar-
                                                                                  rower/broader), synonymy and relatedness. Fifty concept
2.   RELATED WORK                                                                 pairs were manually selected by the authors, with the goal
   In this section, we first discuss semantic relatedness mea-                    to include as many different characteristics as possible, that
sures and then briefly describe previous work in multi-label                      is, concept pairs of varying path lengths, types of relations,
classifier evaluation.                                                            etc. The human ratings were obtained in an electronic sur-
   Several measures of semantic relatedness using a variety                       vey where Dutch speaking people were asked to rate the fifty
of lexical resources have been proposed in the literature. In                     concept pairs on their relatedness. As stated earlier, in the
most cases semantic relations between concepts are either                         5-point scale, the higher the assigned rating, the stronger
inferred from large corpora of text or lexical structures such                    the perceived relatedness.
as taxonomies and thesauri. The state-of-the-art related-                            The following relatedness measures were selected for our
ness measures can be roughly organised into graph-based                           experiments: Rada [11], Leacock & Chodorow [6], Resnik [12],
measures [11, 6, 19, 4, 16], corpus-based measures [17, 10]                       Wu & Palmer [19], Jiang & Conrath [5] and Lin [7]. The
and hybrid measures [12, 5, 7, 1]. The latter combine infor-                      measures of Rada, Leacock & Chodorow and Wu & Palmer
mation gathered from the corpus and the graph structure.                          are all graph-based measures based on path lengths. The
   The majority of relatedness measures are graph-based and                       path length is calculated by summing up the weights of the
were originally developed for WordNet. WordNet is a large                         edges in the path. The weights typically depend on the
lexical database for the English language in which concepts                       type of relation. The stronger the semantic relation, the
(called synsets) are manually organised in a graph-like struc-                    lower the weight. Two versions of both Rada’s and Leacock
ture. While WordNet represents a well structured thesaurus,                       & Chodorow’s approach were implemented: one including
its coverage is limited. Thus, more recently, researchers have                    only hierarchical and synonymous relations, and one includ-
turned their attention to Wikipedia, a much larger knowl-                         ing all three types of thesaurus relations. The weights of the
edge base. Semantic relatedness measures originally devel-                        relations were chosen according to their semantic strength.
oped for WordNet have been validated on Wikipedia. Ap-                            A weight of 1 was assigned to both hierarchical and related
proaches that exploit structural components that are specific                     concept relations and a weight of 0 to synonymous concept
to Wikipedia have been developed as well [14, 18, 3].                             relations. The remaining three approaches, which are based
   With respect to multi-label classifier evaluation, our work                    on the concept of information content, were implemented
builds in particular on Nowak et al. [9]. The authors study                       using the approach of Seco et al. [13].
the behavior of different semantic relatedness measures for
the evaluation of an image annotation task and quantify the                       Multi-label Classifier Evaluation.
correctness of the classification by using a matching opti-                           Having identified the best performing measure of seman-
misation procedure that determines the lowest cost between                        tic relatedness on the Parliament thesaurus, we then turn
the concept sets of the ground truth and of the classifier.                       to the evaluation of the existing multi-label classifier frame-
   We note, that besides semantic relatedness measures one                        work (Figure 1 step (2)). Matching the concepts from the
can also apply hierarchical evaluation measures to determine                      classifier with the ground truth concepts is performed ac-
the performance of multi-label classifiers, as for instance pro-                  cording to a simplified version (which excludes the ontology
posed in [15]. We leave the comparison of these two different                     and annotator agreement) of the procedure presented in [9].
approaches for future work.                                                       Nowak et al. define a classification evaluation measure that
                                                                                  incorporates the notion of semantic relatedness. The algo-
3.   METHODOLOGY                                                                  rithm calculates the degree of relatedness between the set C
                                                                                  of classifier concepts and the set E of ground truth concepts
Semantic Relatedness in the Parliament Thesaurus.                                 with an optimisation procedure. This procedure pairs every
  We first investigate the performance of known semantic                          label of both sets with a label of the other set in a way that
relatedness measures on our domain-specific thesaurus (Fig-                       maximises relatedness: each label lc ∈ C is matched with a
ure 1 step (1)). The goal of this experiment is to identify the                   label le0 ∈ E and each label le ∈ E is matched with a label
most promising semantic relatedness measure, i.e. the mea-                        lc0 ∈ C. The relatedness values of each of those pairs are
sure that correlates most closely with human judgements of                        summed up and divided by the number of labels occurring
                       Concept pairs                            Av. rating    Std. Dev.                 Measures                                     r           τ
     Vaticaanstad               paus                                   4.86         0.25
     Vatican City               pope                                                                    Rada (similarity)                           0.43        0.35
     energiebedrijven           elektriciteitsbedrijven                4.72         0.43                Rada (relatedness)                          0.73        0.55
     power companies            electricity companies                                                   Leacock & Chodorow (similarity)             0.49        0.36
     rijbewijzen                rijbevoegdheid                         4.64         0.55                Leacock & Chodorow (relatedness)            0.73        0.55
     driver licenses            qualification to drive                                                  Wu & Palmer                                 0.39        0.33
                                                 ...                                                    Resnik                                      0.45        0.37
     boedelscheiding            gentechnologie                          1.2         0.34                Jiang & Conrath                             0.48        0.41
     derision of property       gene technology                                                         Lin                                         0.45        0.39
     roken                      dieren                                 1.17         0.29
     smoke                      animals

     makelaars                  republiek                              1.16         0.28
     broker                     republic                                                         Table 2: Overview of the correlations of relatedness
                                                                                                 measures with human judgements of relatedness.
Table 1: Shown are the three concept pairs from
our annotation study achieving the highest and the                                                       Classifier            Ground truth         Av. rating

lowest average rating respectively (in Dutch and En-                                                     toelating            vreemdelingenrecht         4.67
                                                                                                         vreemdelingen          vreemdelingen
glish).                                                                                                                           procedures
                                                                                                                                  werknemers
                                                                                                                                 vluchtelingen

                                                                                                         kinderbescherming        jeugdigen              3.67
in both sets. This yields a value in the interval [0, 1]. The                                            kindermishandeling    gezondheidszorg

higher the value, the more related the sets. Formally:
                                                                                                 Table 3: Two examples of assigned classifier con-
                                                                                                 cepts vs. ground truth concepts and the average of
               X                                         X
                        max rel(lc , le0 ) +                   max rel(le , lc0 )
              lc ∈C
                        0
                        le ∈E
                                                       le ∈E
                                                               0
                                                               lc ∈C                             the ratings obtained from the three experts users.
                                                                                           (1)
                                            |C| + |E|
   To validate this measure we conduct a study with human                                        higher correlations for the selected relatedness measures.
experts: three expert users, who are familiar with the the-                                      Their correlation results range from 0.74 (Wu & Palmer) to
saurus and the documents, were asked to judge for twenty-                                        0.84 (Jiang & Conrath) and are in line with similar studies
five documents the relatedness between the ground truth                                          on WordNet such as Budanitsky et al. [2]. We conclude that
concepts and the classifier assigned concepts (taking the con-                                   measures which perform best on WordNet are not perform-
tent of the document into account) on a 5-point scale: very                                      ing as well on our domain-dependent Parliament thesaurus.
poor, poor, average, good and very good. It should be em-
phasised, that our expert users have not created the ground                                      Multi-label Classifier Evaluation.
truth concepts (those were created by library experts em-                                           In Table 3 two examples of assigned classifier concepts vs.
ployed by the Dutch government). The average rating taken                                        ground truth concepts are shown. Reported are also the av-
over all three individual expert ratings are considered as the                                   erage ratings obtained from the three expert users. Across
ground-truth. The expert evaluations are used to compare                                         all 25 evaluated documents, the mean rating was 3.28, indi-
the performance of the relatedness evaluation measure and                                        cating that the classifier framework performs reasonably well
the performance of a frequently used binary evaluation mea-                                      at assigning concepts related to the ground truth concepts.
sure (F-measure). We hypothesise, that the classifier evalu-
ation, which takes the semantic relatedness of the concepts                                                 Correlation        Semantically-               F1
into account will correlate to a larger degree with the expert                                                                    enhanced
judgements than the traditional binary evaluation measure.
                                                                                                            r                                0.67        0.48
                                                                                                            τ                                0.53        0.37
4.      EXPERIMENTS & RESULTS

Semantic Relatedness in the Parliament Thesaurus.                                                Table 4: Correlations between the expert ratings
   Examples of concept pairs that were selected for the an-                                      and the semantically-enhanced and the binary (F1 )
notation study are shown in Table 1; in particular the three                                     classifier evaluation respectively.
concept pairs yielding the highest human annotator related-
ness scores and the lowest scores respectively are listed.                                          The results of the second experiment are summarised in
   The performance of the relatedness measures on the Par-                                       Table 4. Here, we employed Leacock & Chodorow’s relat-
liament thesaurus are listed in Table 2. From these results                                      edness as it was our best performing approach (Table 2).
two aspects stand out: (i) the relatively high correlation                                       The results indicate that for the annotated set of twenty-
obtained for Rada’s and Leacock & Chodorow’s relatedness                                         five documents, the relatedness evaluations correlate more
measure, and, (ii) the relatively poor performance of the                                        with the expert evaluations than the evaluation based on
remaining measures.                                                                              F1 . The coefficients report an increase in correlation of
   Traditionally, semantic relatedness measures have been                                        at least 0.16 in favour of the relatedness evaluations. To
evaluated on WordNet, the most well-known manually cre-                                          emphasise the difference, we also present the scatter plots
ated lexical database. Seco et al. [13] evaluated all mea-                                       of the semantically-enhanced (Figure 2) and the binary, F1
sures from our selection (except Rada) in a similar way on                                       based, evaluation (Figure 3). In both plots, the correspond-
the WordNet graph against a test-bed of human judgements                                         ing trend line is drawn in red. It is evident, that in the
provided by Miller & Charles [8]. They reported significant                                      binary case, the number of F1 = 0 entries has a significant
                                                                  raters than an evaluation based on binary decision.
                                                                    Besides the issues already raised, in future work we plan to
                                                                  investigate in which graph/content characteristics WordNet
                                                                  differs from our thesaurus and to what extent these different
                                                                  characteristics can be employed to explain the difference in
                                                                  performance of the various semantic relatedness measures.

                                                                  6.   REFERENCES
                                                                   [1] S. Banerjee and T. Pedersen. Extended gloss overlaps as a
                                                                       measure of semantic relatedness. In International Joint
                                                                       Conference on Artificial Intelligence, volume 18, pages
                                                                       805–810, 2003.
                                                                   [2] A. Budanitsky and G. Hirst. Evaluating wordnet-based
                                                                       measures of lexical semantic relatedness. Computational
                                                                       Linguistics, 32(1):13–47, 2006.
 Figure 2: Expert versus relatedness evaluations.                  [3] E. Gabrilovich and S. Markovitch. Computing semantic
                                                                       relatedness using wikipedia-based explicit semantic
                                                                       analysis. In Proceedings of the 20th international joint
                                                                       conference on Artificial intelligence, pages 1606–1611, 2007.
                                                                   [4] G. Hirst and D. St-Onge. Lexical chains as representations
                                                                       of context for the detection and correction of malapropisms.
                                                                       WordNet: An electronic lexical database, 13:305–332, 1998.
                                                                   [5] J. J. Jiang and D. W. Conrath. Semantic Similarity Based
                                                                       on Corpus Statistics and Lexical Taxonomy. 1997.
                                                                   [6] C. Leacock and M. Chodorow. Combining local context and
                                                                       wordnet similarity for word sense identification. WordNet:
                                                                       An electronic lexical database, 49(2):265–283, 1998.
                                                                   [7] D. Lin. An information-theoretic definition of similarity. In
                                                                       Proceedings of the 15th international conference on
                                                                       Machine Learning, volume 1, pages 296–304, 1998.
                                                                   [8] G. Miller and W. Charles. Contextual correlates of
                                                                       semantic similarity. Language and cognitive processes,
                                                                       6(1):1–28, 1991.
     Figure 3: Expert versus binary evaluations.                   [9] S. Nowak, A. Llorente, E. Motta, and S. Rüger. The effect
                                                                       of semantic relatedness measures on multi-label
                                                                       classification evaluation. In CIVR ’10, pages 303–310, 2010.
impact on the obtained correlation. Note that the disper-         [10] S. Patwardhan. Incorporating dictionary and corpus
sion of relatedness evaluations in Figure 2 is higher at lower         information into a context vector measure of semantic
                                                                       relatedness, 2003.
expert evaluations compared to higher expert evaluations.
                                                                  [11] R. Rada, H. Mili, E. Bicknell, and M. Blettner.
Whether this observation is to be attributed to noise is im-           Development and application of a metric on semantic nets.
possible to say due to the small size of the evaluation. We            Systems, Man and Cybernetics, IEEE Transactions on,
will investigate this issue further in future work.                    19(1):17–30, 1989.
                                                                  [12] P. Resnik. Using information content to evaluate semantic
5.   CONCLUSIONS                                                       similarity in a taxonomy. pages 448–453, 1995.
                                                                  [13] N. Seco, T. Veale, and J. Hayes. An intrinsic information
   In this paper, we have presented a two-step procedure               content metric for semantic similarity in wordnet. In ECAI,
to tackle a real-world problem: namely, the semantically-              volume 16, page 1089, 2004.
enhanced evaluation of multi-label classifiers that assign con-   [14] M. Strube and S. Ponzetto. Wikirelate! computing
cepts to documents. We first investigated to what extent se-           semantic relatedness using wikipedia. In Proceedings of the
mantic relatedness measures that perform well on the most              National Conference on Artificial Intelligence, volume 21,
                                                                       page 1419, 2006.
commonly used lexical database (WordNet) also perform
                                                                  [15] A. Sun and E.-P. Lim. Hierarchical text classification and
well on another thesaurus (our domain-specific Parliament              evaluation. In Data Mining, 2001. ICDM 2001, Proceedings
thesaurus). To this end, we conducted a user study where               IEEE International Conference on, pages 521–528. IEEE,
we let approximately 100 users annotate fifty concept pairs            2001.
drawn from our thesaurus. We found that the results achieved      [16] M. Sussna. Word sense disambiguation for free-text
on WordNet need to be considered with care, and it is in-              indexing using a massive semantic network. In CIKM ’93,
deed necessary to re-evaluate them when using a different              pages 67–74. ACM, 1993.
source.                                                           [17] A. Tversky. Features of similarity. Psychological review,
                                                                       84(4):327, 1977.
   In a second step, we then exploited the semantic related-
                                                                  [18] I. Witten and D. Milne. An effective, low-cost measure of
ness measure we found to perform best in the multi-label               semantic relatedness obtained from wikipedia links. In
classifier evaluation. Again, we investigated the ability of           Proceeding of AAAI Workshop on Wikipedia and Artificial
such an evaluation measure to outperform a standard bi-                Intelligence: an Evolving Synergy, pages 25–30, 2008.
nary measure (F1 ) by asking expert users to rate for a small     [19] Z. Wu and M. Palmer. Verbs semantics and lexical
set of documents the quality of the classifier concepts when           selection. In ACL ’94, pages 133–138, 1994.
compared to the ground truth concepts. Our results showed
that an evaluation which includes the semantic relatedness
of concepts yields results which are more in line with human