=Paper=
{{Paper
|id=None
|storemode=property
|title=Exploiting Semantic Relatedness Measures for Multi-label Classifier Evaluation
|pdfUrl=https://ceur-ws.org/Vol-986/paper_4.pdf
|volume=Vol-986
|dblpUrl=https://dblp.org/rec/conf/dir/DelooH13
}}
==Exploiting Semantic Relatedness Measures for Multi-label Classifier Evaluation==
Exploiting Semantic Relatedness Measures for Multi-label
Classifier Evaluation
Christophe Deloo∗ Claudia Hauff
Delft University of Technology Delft University of Technology
Delft, The Netherlands Delft, The Netherlands
c.p.p.deloo@gmail.com c.hauff@tudelft.nl
ABSTRACT papers2 are to be annotated with concepts from an existing
In the multi-label classification setting, documents can be thesaurus3 (the Parliament thesaurus). A multi-label clas-
labelled with a number of concepts (instead of just one). sifier framework exists and each document can be automat-
Evaluating the performance of classifiers in this scenario is ically annotated with a number of concepts. Currently, the
often as simple as measuring the percentage of correctly evaluation of the classifier is conducted as follows: the auto-
assigned concepts. Classifiers that do not retrieve a sin- matically produced annotations are compared to the ground-
gle concept existing in the ground truth annotation are all truth (i.e. the concepts assigned by domain experts) and the
considered equally poor. However, some classifiers might binary measures of precision and recall are computed. This
perform better than others, in particular those, that assign means, that a document labelled with concepts which do not
concepts which are semantically similar to the ground truth occur in the ground truth receives a precision/recall of zero,
annotation. Thus, exploiting the semantic relatedness be- even though the assigned concepts may be semantically very
tween the classifier-assigned and the ground truth concepts similar to the ground truth concepts. As an example, con-
leads to a more refined evaluation. A number of well-known sider Figure 1: the ground truth of the document consists
algorithms compute the semantic relatedness between con- of three concepts {biofuel, environment, renewable energy}
cepts with the aid of general-world knowledge bases such as and the classifier annotates the document with the concepts
WordNet1 . When the concepts are domain specific, however, {energy source, solar energy}. Binary precision/recall mea-
such approaches cannot be employed out-of-the-box. Here, sures evaluate the classifier’s performance as zero, though it
we present a study, inspired by a real-world problem, where is evident, that the classifier does indeed capture the content
we first investigate the performance of well-known semantic of the document - at least partially.
relatedness measures on a domain-dependent thesaurus. We Thus, we are faced with the following research question:
then employ the best performing measure to evaluate multi- Can the evaluation of a multi-label classifier be improved
label classifiers. We show that (i) measures which perform when taking the semantic relatedness of concepts into ac-
well on WordNet do not reach a comparable performance on count?
our thesaurus and that (ii) an evaluation based on semantic To this end, we present two studies (Figure 1):
relatedness yields results which are more in line with human 1. We investigate established semantic relatedness mea-
ratings than the traditional F-measure. sures on the Parliament thesaurus. Are measures that
perform well on WordNet or Wikipedia also suitable
Categories and Subject Descriptors: H.3.3 Information for this domain-specific thesaurus?
Storage and Retrieval: Information Search and Retrieval
2. Given the best performing relatedness measure, we in-
Keywords: semantic relatedness, classifier evaluation
clude the semantic relatedness in the evaluation of the
multi-label classifier framework and investigate if such
1. INTRODUCTION a semantically enhanced evaluation improves over the
In this paper, we present a two-part study, that is inspired binary precision/recall based evaluation.
by the following real-world problem: Dutch Parliamentary We find that the best performing measures on WordNet do
∗
not necessarily perform as well on a different thesaurus, and
This research was performed while the author was an intern thus, they should be (re-)evaluated when a novel thesaurus
at GridLine.
1 is employed. Our user study also shows that a classifier eval-
http://wordnet.princeton.edu/
uation, which takes the semantic relatedness of the ground
truth and the classifier assigned concepts into account yields
results which are closer to those of human experts than tra-
ditional binary evaluation measures.
2
The documents come from the Dutch House of Represen-
tatives (de Tweede Kamer), which is the lower house of the
bicameral parliament of the Netherlands.
3
DIR 2013, April 26, 2013, Delft, The Netherlands. For more details see Section 3.
energy
{c 1, c 2} automatically
source
assigned concepts
relatedness. In order to evaluate the different measures, we
renewable
employ an established methodology: we select a number of
semantically
energy
enriched concept pairs from our thesaurus and ask human annotators
solar
evaluation to judge the relatedness of the concepts on a 5-point scale
energy biofuel
ground truth
(where 1 means unrelated and 5 means strongly related ). We
environ-
ment
{g 1 , g 2 , g3 } assignments consider these judgements as our ground truth and rank the
1 2
concept pairs according to their semantic relatedness. Then,
Parliament
thesaurus
we also rank the concept pairs according to the scores they
achieve by the different semantic relatedness measures. The
Figure 1: Overview of the two-step process: (1)
agreement between the two rankings is evaluated with the
we first investigate semantic relatedness measures
rank correlation measure Kendall’s Tau (τ ) and the linear
on the Parliament thesaurus. Then, (2) given a
correlation coefficient (r).
document and its assigned ground truth concepts
The Parliament thesaurus contains nearly 8, 000 Dutch
{g1 , g2 , g3 } (by human annotators), we evaluate the
terms oriented towards political themes such as defense, wel-
quality of the classifier-assigned concepts {c1 , c2 }.
fare, healthcare, culture and environment. As is typical for
The classifier evaluation takes the semantic relat-
a thesaurus, the concepts are hierarchically structured and
edness between the concepts into account.
the following three types of relations exist: hierarchical (nar-
rower/broader), synonymy and relatedness. Fifty concept
2. RELATED WORK pairs were manually selected by the authors, with the goal
In this section, we first discuss semantic relatedness mea- to include as many different characteristics as possible, that
sures and then briefly describe previous work in multi-label is, concept pairs of varying path lengths, types of relations,
classifier evaluation. etc. The human ratings were obtained in an electronic sur-
Several measures of semantic relatedness using a variety vey where Dutch speaking people were asked to rate the fifty
of lexical resources have been proposed in the literature. In concept pairs on their relatedness. As stated earlier, in the
most cases semantic relations between concepts are either 5-point scale, the higher the assigned rating, the stronger
inferred from large corpora of text or lexical structures such the perceived relatedness.
as taxonomies and thesauri. The state-of-the-art related- The following relatedness measures were selected for our
ness measures can be roughly organised into graph-based experiments: Rada [11], Leacock & Chodorow [6], Resnik [12],
measures [11, 6, 19, 4, 16], corpus-based measures [17, 10] Wu & Palmer [19], Jiang & Conrath [5] and Lin [7]. The
and hybrid measures [12, 5, 7, 1]. The latter combine infor- measures of Rada, Leacock & Chodorow and Wu & Palmer
mation gathered from the corpus and the graph structure. are all graph-based measures based on path lengths. The
The majority of relatedness measures are graph-based and path length is calculated by summing up the weights of the
were originally developed for WordNet. WordNet is a large edges in the path. The weights typically depend on the
lexical database for the English language in which concepts type of relation. The stronger the semantic relation, the
(called synsets) are manually organised in a graph-like struc- lower the weight. Two versions of both Rada’s and Leacock
ture. While WordNet represents a well structured thesaurus, & Chodorow’s approach were implemented: one including
its coverage is limited. Thus, more recently, researchers have only hierarchical and synonymous relations, and one includ-
turned their attention to Wikipedia, a much larger knowl- ing all three types of thesaurus relations. The weights of the
edge base. Semantic relatedness measures originally devel- relations were chosen according to their semantic strength.
oped for WordNet have been validated on Wikipedia. Ap- A weight of 1 was assigned to both hierarchical and related
proaches that exploit structural components that are specific concept relations and a weight of 0 to synonymous concept
to Wikipedia have been developed as well [14, 18, 3]. relations. The remaining three approaches, which are based
With respect to multi-label classifier evaluation, our work on the concept of information content, were implemented
builds in particular on Nowak et al. [9]. The authors study using the approach of Seco et al. [13].
the behavior of different semantic relatedness measures for
the evaluation of an image annotation task and quantify the Multi-label Classifier Evaluation.
correctness of the classification by using a matching opti- Having identified the best performing measure of seman-
misation procedure that determines the lowest cost between tic relatedness on the Parliament thesaurus, we then turn
the concept sets of the ground truth and of the classifier. to the evaluation of the existing multi-label classifier frame-
We note, that besides semantic relatedness measures one work (Figure 1 step (2)). Matching the concepts from the
can also apply hierarchical evaluation measures to determine classifier with the ground truth concepts is performed ac-
the performance of multi-label classifiers, as for instance pro- cording to a simplified version (which excludes the ontology
posed in [15]. We leave the comparison of these two different and annotator agreement) of the procedure presented in [9].
approaches for future work. Nowak et al. define a classification evaluation measure that
incorporates the notion of semantic relatedness. The algo-
3. METHODOLOGY rithm calculates the degree of relatedness between the set C
of classifier concepts and the set E of ground truth concepts
Semantic Relatedness in the Parliament Thesaurus. with an optimisation procedure. This procedure pairs every
We first investigate the performance of known semantic label of both sets with a label of the other set in a way that
relatedness measures on our domain-specific thesaurus (Fig- maximises relatedness: each label lc ∈ C is matched with a
ure 1 step (1)). The goal of this experiment is to identify the label le0 ∈ E and each label le ∈ E is matched with a label
most promising semantic relatedness measure, i.e. the mea- lc0 ∈ C. The relatedness values of each of those pairs are
sure that correlates most closely with human judgements of summed up and divided by the number of labels occurring
Concept pairs Av. rating Std. Dev. Measures r τ
Vaticaanstad paus 4.86 0.25
Vatican City pope Rada (similarity) 0.43 0.35
energiebedrijven elektriciteitsbedrijven 4.72 0.43 Rada (relatedness) 0.73 0.55
power companies electricity companies Leacock & Chodorow (similarity) 0.49 0.36
rijbewijzen rijbevoegdheid 4.64 0.55 Leacock & Chodorow (relatedness) 0.73 0.55
driver licenses qualification to drive Wu & Palmer 0.39 0.33
... Resnik 0.45 0.37
boedelscheiding gentechnologie 1.2 0.34 Jiang & Conrath 0.48 0.41
derision of property gene technology Lin 0.45 0.39
roken dieren 1.17 0.29
smoke animals
makelaars republiek 1.16 0.28
broker republic Table 2: Overview of the correlations of relatedness
measures with human judgements of relatedness.
Table 1: Shown are the three concept pairs from
our annotation study achieving the highest and the Classifier Ground truth Av. rating
lowest average rating respectively (in Dutch and En- toelating vreemdelingenrecht 4.67
vreemdelingen vreemdelingen
glish). procedures
werknemers
vluchtelingen
kinderbescherming jeugdigen 3.67
in both sets. This yields a value in the interval [0, 1]. The kindermishandeling gezondheidszorg
higher the value, the more related the sets. Formally:
Table 3: Two examples of assigned classifier con-
cepts vs. ground truth concepts and the average of
X X
max rel(lc , le0 ) + max rel(le , lc0 )
lc ∈C
0
le ∈E
le ∈E
0
lc ∈C the ratings obtained from the three experts users.
(1)
|C| + |E|
To validate this measure we conduct a study with human higher correlations for the selected relatedness measures.
experts: three expert users, who are familiar with the the- Their correlation results range from 0.74 (Wu & Palmer) to
saurus and the documents, were asked to judge for twenty- 0.84 (Jiang & Conrath) and are in line with similar studies
five documents the relatedness between the ground truth on WordNet such as Budanitsky et al. [2]. We conclude that
concepts and the classifier assigned concepts (taking the con- measures which perform best on WordNet are not perform-
tent of the document into account) on a 5-point scale: very ing as well on our domain-dependent Parliament thesaurus.
poor, poor, average, good and very good. It should be em-
phasised, that our expert users have not created the ground Multi-label Classifier Evaluation.
truth concepts (those were created by library experts em- In Table 3 two examples of assigned classifier concepts vs.
ployed by the Dutch government). The average rating taken ground truth concepts are shown. Reported are also the av-
over all three individual expert ratings are considered as the erage ratings obtained from the three expert users. Across
ground-truth. The expert evaluations are used to compare all 25 evaluated documents, the mean rating was 3.28, indi-
the performance of the relatedness evaluation measure and cating that the classifier framework performs reasonably well
the performance of a frequently used binary evaluation mea- at assigning concepts related to the ground truth concepts.
sure (F-measure). We hypothesise, that the classifier evalu-
ation, which takes the semantic relatedness of the concepts Correlation Semantically- F1
into account will correlate to a larger degree with the expert enhanced
judgements than the traditional binary evaluation measure.
r 0.67 0.48
τ 0.53 0.37
4. EXPERIMENTS & RESULTS
Semantic Relatedness in the Parliament Thesaurus. Table 4: Correlations between the expert ratings
Examples of concept pairs that were selected for the an- and the semantically-enhanced and the binary (F1 )
notation study are shown in Table 1; in particular the three classifier evaluation respectively.
concept pairs yielding the highest human annotator related-
ness scores and the lowest scores respectively are listed. The results of the second experiment are summarised in
The performance of the relatedness measures on the Par- Table 4. Here, we employed Leacock & Chodorow’s relat-
liament thesaurus are listed in Table 2. From these results edness as it was our best performing approach (Table 2).
two aspects stand out: (i) the relatively high correlation The results indicate that for the annotated set of twenty-
obtained for Rada’s and Leacock & Chodorow’s relatedness five documents, the relatedness evaluations correlate more
measure, and, (ii) the relatively poor performance of the with the expert evaluations than the evaluation based on
remaining measures. F1 . The coefficients report an increase in correlation of
Traditionally, semantic relatedness measures have been at least 0.16 in favour of the relatedness evaluations. To
evaluated on WordNet, the most well-known manually cre- emphasise the difference, we also present the scatter plots
ated lexical database. Seco et al. [13] evaluated all mea- of the semantically-enhanced (Figure 2) and the binary, F1
sures from our selection (except Rada) in a similar way on based, evaluation (Figure 3). In both plots, the correspond-
the WordNet graph against a test-bed of human judgements ing trend line is drawn in red. It is evident, that in the
provided by Miller & Charles [8]. They reported significant binary case, the number of F1 = 0 entries has a significant
raters than an evaluation based on binary decision.
Besides the issues already raised, in future work we plan to
investigate in which graph/content characteristics WordNet
differs from our thesaurus and to what extent these different
characteristics can be employed to explain the difference in
performance of the various semantic relatedness measures.
6. REFERENCES
[1] S. Banerjee and T. Pedersen. Extended gloss overlaps as a
measure of semantic relatedness. In International Joint
Conference on Artificial Intelligence, volume 18, pages
805–810, 2003.
[2] A. Budanitsky and G. Hirst. Evaluating wordnet-based
measures of lexical semantic relatedness. Computational
Linguistics, 32(1):13–47, 2006.
Figure 2: Expert versus relatedness evaluations. [3] E. Gabrilovich and S. Markovitch. Computing semantic
relatedness using wikipedia-based explicit semantic
analysis. In Proceedings of the 20th international joint
conference on Artificial intelligence, pages 1606–1611, 2007.
[4] G. Hirst and D. St-Onge. Lexical chains as representations
of context for the detection and correction of malapropisms.
WordNet: An electronic lexical database, 13:305–332, 1998.
[5] J. J. Jiang and D. W. Conrath. Semantic Similarity Based
on Corpus Statistics and Lexical Taxonomy. 1997.
[6] C. Leacock and M. Chodorow. Combining local context and
wordnet similarity for word sense identification. WordNet:
An electronic lexical database, 49(2):265–283, 1998.
[7] D. Lin. An information-theoretic definition of similarity. In
Proceedings of the 15th international conference on
Machine Learning, volume 1, pages 296–304, 1998.
[8] G. Miller and W. Charles. Contextual correlates of
semantic similarity. Language and cognitive processes,
6(1):1–28, 1991.
Figure 3: Expert versus binary evaluations. [9] S. Nowak, A. Llorente, E. Motta, and S. Rüger. The effect
of semantic relatedness measures on multi-label
classification evaluation. In CIVR ’10, pages 303–310, 2010.
impact on the obtained correlation. Note that the disper- [10] S. Patwardhan. Incorporating dictionary and corpus
sion of relatedness evaluations in Figure 2 is higher at lower information into a context vector measure of semantic
relatedness, 2003.
expert evaluations compared to higher expert evaluations.
[11] R. Rada, H. Mili, E. Bicknell, and M. Blettner.
Whether this observation is to be attributed to noise is im- Development and application of a metric on semantic nets.
possible to say due to the small size of the evaluation. We Systems, Man and Cybernetics, IEEE Transactions on,
will investigate this issue further in future work. 19(1):17–30, 1989.
[12] P. Resnik. Using information content to evaluate semantic
5. CONCLUSIONS similarity in a taxonomy. pages 448–453, 1995.
[13] N. Seco, T. Veale, and J. Hayes. An intrinsic information
In this paper, we have presented a two-step procedure content metric for semantic similarity in wordnet. In ECAI,
to tackle a real-world problem: namely, the semantically- volume 16, page 1089, 2004.
enhanced evaluation of multi-label classifiers that assign con- [14] M. Strube and S. Ponzetto. Wikirelate! computing
cepts to documents. We first investigated to what extent se- semantic relatedness using wikipedia. In Proceedings of the
mantic relatedness measures that perform well on the most National Conference on Artificial Intelligence, volume 21,
page 1419, 2006.
commonly used lexical database (WordNet) also perform
[15] A. Sun and E.-P. Lim. Hierarchical text classification and
well on another thesaurus (our domain-specific Parliament evaluation. In Data Mining, 2001. ICDM 2001, Proceedings
thesaurus). To this end, we conducted a user study where IEEE International Conference on, pages 521–528. IEEE,
we let approximately 100 users annotate fifty concept pairs 2001.
drawn from our thesaurus. We found that the results achieved [16] M. Sussna. Word sense disambiguation for free-text
on WordNet need to be considered with care, and it is in- indexing using a massive semantic network. In CIKM ’93,
deed necessary to re-evaluate them when using a different pages 67–74. ACM, 1993.
source. [17] A. Tversky. Features of similarity. Psychological review,
84(4):327, 1977.
In a second step, we then exploited the semantic related-
[18] I. Witten and D. Milne. An effective, low-cost measure of
ness measure we found to perform best in the multi-label semantic relatedness obtained from wikipedia links. In
classifier evaluation. Again, we investigated the ability of Proceeding of AAAI Workshop on Wikipedia and Artificial
such an evaluation measure to outperform a standard bi- Intelligence: an Evolving Synergy, pages 25–30, 2008.
nary measure (F1 ) by asking expert users to rate for a small [19] Z. Wu and M. Palmer. Verbs semantics and lexical
set of documents the quality of the classifier concepts when selection. In ACL ’94, pages 133–138, 1994.
compared to the ground truth concepts. Our results showed
that an evaluation which includes the semantic relatedness
of concepts yields results which are more in line with human