Discrimination of Word Senses with Hypernyms

                       Artem Revenko and Victor Mireles

                   Semantic Web Company, Vienna, Austria
          {artem.revenko,victor.mireles-chavez}@semantic-web.com

      Abstract. Languages are inherently ambiguous. Four out of five words
      in English have more than one meaning. Nowadays there is a growing
      number of small proprietary thesauri used for knowledge management
      for different applications. In order to enable the usage of these thesauri
      for automatic text annotations, we introduce a robust method for dis-
      criminating word senses using hypernyms. The method uses collocations
      to induce word senses and to discriminate the thesaural sense from the
      other senses by utilizing hypernym entries taken from a thesaurus. The
      main novelty of this work is the usage of hypernyms already at the stage
      sense induction. The hypernyms enable us to cast the task to a binary
      scenario, namely teasing apart thesaural senses from all the rest. The
      introduced method outperforms the baseline and has indicates accuracy
      above 80%.

      Keywords: thesaurus, controlled vocabulary, word sense induction, en-
      tity linking, named entity disambiguation

1   Introduction

Information retrieval can successfully provide services like search or recommen-
dation only once different senses in the corpus are distinguished. Studies show
that in the everyday usage of English about 80% of words are ambiguous [19].
Even when controlled vocabularies are available, it is often the case that a la-
bel representing a concept has “non-technical” senses, and these senses are also
present in the given corpus at hand. Thus, the task of word sense disambiguation
(WSD) is still called for in the presence of controlled vocabularies.
Controlled vocabulary refer here to a finite, well specified set of concepts that
typify a specific domain. Each concept has an associated set of labels. A label
can be a word in a broad sense, i.e. it may be word or words habitually used in a
(natural) language, an acronym or even an arbitrary sequence of symbols taken
from a chosen alphabet. A controlled vocabulary can thus be used to capture
synonymy, by grouping different synonyms as labels of the same concept. Usually
we represent a concept by one of its labels that is chosen in advance and is called
the preferred label. Since controlled vocabularies express no semantic relationship
between concepts their use in disambiguating senses is limited.
In order to enrich controlled vocabularies with semantic information, often the
mutually inverse relations of hyponym vs. hypernym are considered. The hyper-
nym of a concept x is defined as a concept whose meaning includes the meaning
2

of x, so that any concrete entity that is an instance of concept x is also an in-
stance of its hypernym. We call a thesaurus a tuple consisting of a vocabulary
and list of hypernym/hyponym relations between its concepts.
In particular, thesauri encoding these relations via the skos:narrower and
skos:broader predicates following the standardized SKOS vocabulary [3] are com-
monly found in the industry [15, p. 169]1 , in part because they can be built with
little effort for domain-specific use-cases. This is in contrast to more general
knowledge graphs which can aid in disambiguating senses (often at high com-
putational cost [2]), but notably they are also costly to compile, and thus not
widely adopted.
In order to leverage the knowledge encoded in a thesaurus for tackling WSD, we
begin by noting the following:
    – Ambiguity is a characteristic of words: concepts are non-ambiguous. Thus, it
      is enough to map a word onto a concept in order to disambiguate its sense.
      Many concepts, however, including those not present in a thesaurus, may
      have the same label.
      Example 1. The STW thesaurus [4] includes a concept with the label Bond
      (http://zbw.eu/stw/descriptor/12234-1),
      which is hyponym of a concept with the label Securities and is a hyper-
      nym of a concept with the label Asset-backed securities. Yet, in the phrase
      We invite our top investors to bond while playing golf, the word bond does
      not refer to the thesaural concept.
    – In the case of domain-specific analysis of texts, it is sufficient to determine
      whether a given word is being used in the sense encoded in the thesaurus,
      referred to as the (thesaural sense), or not. Casting this task to a binary task
      also simplifies it. Consequently, we are not interested to find all the possible
      senses and disambiguate them. Rather, we aim at disambiguating correctly
      only one chosen (thesaural) sense.
1.1     Problem Statement
We present in this work a method that tackles the domain-specific case of WSD.
It is noteworthy that this method does not require all possible senses of a word
to be contained in the thesaurus. This makes it specially useful in the industrial
environment, where usually only small thesauri are available.
Specifically, we take as an input a corpus, a thesaurus, and a concept from the
thesaurus, one of whose labels is found throughout the corpus. We call this
concept the target entity. The problem that we deal with is to distinguish, for
each document in the corpus, whether the target entity is used in the thesaural
sense or not. Thus, the end result is a partition of the corpus into two disjoint
collections: “this” and “other”. The collection “this” contains the documents
that feature the target entity in the thesaural sense.
1
    Though strictly speaking the semantics of broader/narrower relation can be more
    general than hypernym/hyponym, for example, the meronym relation can also be
    encoded as broader/narrower.
                                                                                  3

The contributions of this work are the following:
 – We introduce a method for Word Sense Induction (WSI) with the usage of
   hypernyms.
 – We introduce a pipelined work-flow to discriminate between thesaural and
   non-thesaural senses of the target entity, by utilizing its hypernyms.
 – We prepare and carry out an experiment that resembles a real world use
   case: concepts have multiple labels and the corpus is ambiguous.


2     Related Work

The problem of word sense disambiguation has attracted much attention for
several decades now. We refer the reader to [11] for a review. Here we would
like to mention that one could roughly distinguish three classes of approaches to
solve WSD: supervised, unsupervised and knowledge-based. The unsupervised
methods are usually characterized by
 – No need for external knowledge, therefore, the system is broadly applicable.
   However, the results depend a lot on the corpus.
 – No expected user involvement.
Our method makes use of the knowledge in thesaurus, in this sense it is knowledge-
based. However, it shares many features of the unsupervised methods in that it
does not expect any information about the non-thesaural senses (even their num-
ber) and that no user involvement is expected.
Among the large amount of research that has been done on this problem, there
is some work that is tightly connected to the approach presented in this paper,
we describe it in the subsequent subsections.

2.1   Knowledge-Based Methods a.k.a. Named Entity Linking

The knowledge-based methods [18, 17] are based on large static knowledge
graphs (KG) that are computed a priori, usually with great effort. Such a graph
can be, for example, WordNet [12, 23]. This graphs are assumed to include all
senses of the target word, either explicitly as labels of the nodes or not. A corpus
is then analyzed and, depending on it, the KG is traversed (e.g. with a random
walk [1]) and the possible senses are thus discovered. It is important to point
out that using a KG exclusively, without aid of the corpus, is known to lead to
poor results, specially in domain-specific cases in which the thesaural sense is
unlikely to be part of the knowledge base.
In the problem statement we focus on a single target entity. One can naturally
extend the method and run it for all the identified named entities to link it to
the correct sense in the KG. The method would benefit from the hypernyms
contained in the KG. However, our method makes use of the corpus, therefore it
may not be appropriate for disambiguating very short strings, like search queries.
4

2.2     Word Sense Induction

The task of WSI is a preliminary step for tackling WSD This task consists of
finding (inducing) the senses present in unannotated data. In the terms described
above, the input to this task is a corpus and a target word. The outcome is an
enumeration of all the senses found for the target word.
Of the many ways the WSI problem has been approached, it is important to
mention two in the context of this work. The first is the so-called sense embedding
[13, 9]. These methods follow the distributional assumption according to which
similar words appear in similar contexts [10], and they apply skip-gram models to
reduce the dimensionality of related words and assign them to a given word as its
‘semantic signature’. These context embeddings are then clustered to determine
the different senses of each word. We should mention that, in our experience,
these methods fail in cases where two senses of a word lead to very similar
contexts, such as “Americano” which can refer to either a cocktail or a coffee.
The second common approach to the WSI task which is of interest here, is to
analyze the text and extract from it collocation graphs: graphs whose nodes are
words found in the text close to the target word, and whose edges are weighted
to reflect the strength of this collocation. This has the advantage that previously
unknown senses can be induced and described with the help of collocations. To
weight the edges of the graph, conditional probabilities [21], Dice scores [5] or
word co-occurrences relative to a reference corpus [8] have been used. More than
that, discrete features derived from the syntactic use of each word (e.g. [16]) may
be used. In the current paper we abstain from relying on any language specific
tools such as part of speech taggers or sentence parsers in order to stay language
independent. We make use of stemmers, however, they only help to improve
results and are not essential for the introduced methods. Moreover, stemmers
are one of the simplest language specific tools and are widely available for many
languages.
The next step in the collocation-based graph algorithms is to identify a collection
of sets of nodes, each of which correspond to a sense. Once done, WSD can be
carried out by clustering the contexts where the target word appears. There
have been several approaches to identifying these senses. Graph clustering via
various algorithms has been a popular approach (e.g. in [7, 5, 16]). Another
approach, built specifically for inducing senses, is HyperLex [21], which defines
senses as hubs (highly connected nodes) in the collocation graph along with
their immediate neighbors. To identify senses, HyperLex sorts the nodes of the
collocation graph by degrees. Senses are induced by taking and removing one by
one the hubs from list, along with their immediate neighbors. In this paper we
use a variant of HyperLex introduced in [5] with PageRank scores in place of
degrees.
Other works, such as [16], use hypernyms after the senses are already induced.
This work, in contrast, takes hypernyms into account already at the stage of
graph clustering. That way we guarantee that
    – the thesaural meaning of the target word is captured in a single sense and
                                                                                5

 – the level of granularity of the sense is defined by the structure of the the-
   saurus.
We note that in the case of collocation graphs, there is no explicit description
of what a particular sense is. This can be contrasted with the discussion in the
previous section, in which senses are defined as concepts in a thesaurus with
known hypernym/hyponym relations; or with the static KG based approach to
LSI, in which senses are pre-existing entities in the graph. This lack of inter-
pretability of a sense is overcome in this work by relating at least one of the
collocation-derived senses with a concept in the thesaurus.

3     Method
In this section we introduce a method to solve the task introduced in Section 1.1.
We rely on the “one sense per document” assumption [22] and on the assumption
that the sense of the entity can be unambiguously deduced from its hypernyms
or more generally, a thesaurus.
Our method consists of 2 steps:
1. WSI; the outcome is a set of senses with one distinguished thesaural sense.
2. WSD, i.e. classification of each occurrence of the target word into one of the
   senses.

3.1   WSI with Hypernyms
As already mentioned in the introduction, for the WSI task we use HyperLex
[21], implementing it in a similar way to [5]. However, we also introduce hy-
pernyms in the WSI process, therefore we will denote the new modification as
HyperHyperLex. The process is presented in Figure 1.

In the first step we compute the graph of collocations between all the tokens
in the text. In this implementation we use only unigrams, however, the usage
of n-grams would clearly improve the performance. In Figure 1 this phase is
represented as a graph of collocations; “E” stands for target entity.
In the second step we compute the PageRank of the nodes in the graph. The
nodes with the highest PageRank have a thicker border in the second subfigure
in Figure 1.
In the third step we introduce the hypernyms into the process and for each node
we compute the following measure
                                              P
                                                    CO(h, n) 
               m(n) := P R(n) ∗ CO(E, n) + h∈H                  ,
                                                    |H|
where P R stands for PageRank, H is the set of hypernyms, CO is the collocation
measures. We use a variant of Dice Scores [6] as a word-association measure to
grab collocations. We sort the nodes with respect to m(n).
In the fourth step we take the first node in the sorted list of nodes and identify
it as a hub. Next we build a cluster around the hub. For this purpose we take
6

each node and assess it involvement in the cluster by summing the Dice Scores
between itself and the hub, the direct neighbors of the hub and the hypernyms.
This sum is then normalized by the sum of the Dice Scores between the node
and all of its neighbors. Therefore, involvement of the node is a number between
0 and 1. In contrast to the original HyperLex method, in HyperHyperLex the
nodes belong to the clusters to some degree.
After the first cluster is induced we return to the sorted list and take the next
node. If the node is not yet involved in other clusters (i.e. the involvement does
not surpass a certain threshold) we take it as the next hub. When building the
cluster around this hub, we also take only nodes whose involvement in other
clusters does not surpass the threshold. When this process selects as hub a node
that is not in the thesaurus, the only difference in the above procedure is that
its hypernyms are not taken into account when defining the involvement of the
nodes in its cluster.
After this process, each node has a score denoting its membership in a cluster.
This score is used in the next step for classifying the occurrences of the target
entity. The score is defined by

                              s(n) := P R(n) ∗ I(n),

where I stands for involvement of the node in said cluster.


Fig. 1. HyperHyperLex. E stands for the target entity, br stand for broaders. Thick
nodes identify the hubs, dashed line marks the hypernym-induced sense.
                                                                                    7

Example 2 (Americano). After calculating the collocation graph and sorting ac-
cording to PageRank we get the following sorting for potential hubs
 1. “campari”,
 2. “mix”,
 3. “espresso”.
However, after taking hypernyms into account the highest m score is obtained by
“mix”. Therefore, the hypernym-induced hub is “mix”, “campari” participates
in this sense as it is one of the collocations of the hub. Therefore, after resorting
“espresso” gets highest score and becomes the first non-hypernym-induced hub
and the second overall.

In the preliminary tests we found that several senses corresponding to the target
sense could be induced. However, none of the senses captured the thesaural
sense completely, resulting in misclassification. With the help of the hypernyms
it has become possible to capture the thesaural sense in a single sense and
take into account the decisions of the data architect. In Figure 2 the sense
induced with hypernyms is denoted as a dashed circle vs two solid circles induced
without hypernyms. The hypernym-induced sense contains more words than the
thesaural sense. Yet, the intersection of the thesaural and hypernym-induced
senses is larger than that of the thesaural and non-hypernym-induced senses.
This decreases the classification error.


Fig. 2. WSI with and without hypernyms. The gray area indicates the “real” thesaural
sense, the dashed lines mark the two sense induced with HyperLex, the solid line marks
the sense induced with HyperHyperLex. Though the hypernym-induced sense may
capture several general senses, it better corresponds to the thesaural sense.
8

3.2     WSD
The result of the previous WSI step is a set of key-value dictionaries, where
 – each dictionary corresponds to an induced sense,
    – the keys are the words that belong to the sense cluster,
    – the values are the scores s(n) or the weights of the words with respect to the
      sense.
In order to disambiguate an occurrence of the target word we first extract its
context. A context is a set of words surrounding the target word, for example,
10 words before and 10 words after the target. Then compute the sum of the
words’ scores in the context with respect to each sense. For each sense we take
the corresponding dictionary and sum up all the scores of all the words that are
found in the context and in the sense cluster. Finally, we choose the sense with
the highest aggregate score.

4      Experiment
We conduct two separate experiments with very similar setups: cocktails and
MeSH [14].

4.1     Data
All the data (corpora and thesauri) can be found at github.com/artreven/
thesaural_wsi. The two examples correspond quite well to those found in in-
dustrial settings in terms of size and depth of thesaurus, quantity, size, and
quality of texts.
Thesauri In the cocktails example the concepts are taken from the “All about
cocktails”2 thesaurus. The thesaurus contains various cocktail instances and in-
gredients. We have only used ambiguous cocktail names for this experiment. We
use the concept scheme name “cocktail” and the broaders of the target concept
as hypernyms.
In the MeSH example we use the MeSH3 thesaurus. We use the preferred labels
of the broader concepts as hypernyms.
As we only use unigrams in our code, we split the compound labels into unigrams
and use only nouns in both experiments.
Corpora We used the Wikilinks dataset [20] to extract the corpora. The dataset
contains documents and the links to the Wikipedia pages inside the documents.
The texts contain mistakes, which makes them particularly suitable for simulat-
ing a real world use case. The corpora contain duplicates. The data is used as
is, without any cleaning.
First we identify those cocktail names that are ambiguous and then we collected
the texts that mention those names. We remove all the texts that refer to the
2
    vocabulary.semantic-web.at/cocktails
3
    www.nlm.nih.gov/mesh/
                                                                                 9

disambiguation page at Wikipedia and all the categories containing less than 5
documents. We do the same procedure for finding ambiguous labels from MeSH.
The preprocessing phase includes removing stopwords, stemming, removing words
that appear less than 3 times, removing the words that appear in more than half
of the documents, and substituting all digits to a special token.
We collect 13 corpora with a total of 1227 texts for cocktails, see Table 1 for
more details. We collect 8 corpora with a total of 784 texts for MeSH labels, see
Table 3 for more details.
4.2   Results
In the experiment we classify the word occurrences into two categories “this”
and “other”. We should note that many considered words have more than 2
meanings, all the non-thesaural meanings fall into the category “other”. For the
baseline everything is classified into the most popular category. This baseline
is known to be challenging. In practice there is no guarantee that the thesaural
sense would be the most popular, therefore this baseline is better than the results
one could expect in practice without WSID.
The results for cocktails are presented in Table 1. Observations:
 – As can be seen from results even the high number of the real senses does not
   prevent the method from showing high accuracy.
 – With large corpora the accuracy is high due to better WSI.
 – Even if the number of thesaural mentions is very low (“Cosmopolitan”, “B-
   52”) the accuracy remains high. We assume that the well represented “other”
   senses can be induced accurately and the use of hypernyms improves the
   induction of the thesaural sense.
 – The worst results are obtained for the corpora where the thesaural sense
   (cocktail) is the dominant: Vesper, Margarita, Martini. Since the other sense(s)
   are underrepresented there is a risk of getting several senses capturing the
   individual context. In such senses many general-purpose (not sense-specific)
   collocations may be present, as a result these senses may score higher in
   general contexts.
The results for MeSH are presented in Table 1. Observations:
 – For “Warts” only one category is induced. This result is due to underrepre-
   sented “other” sense (5 documents). This is another risk for underrepresented
   senses.
 – The MeSH corpora contains many duplicates and dirty texts. For example,
   sometimes the target words could be found as part of link strings or as the
   name of a button. Cleaning would definitely improve the results. However,
   even in the case of dirty data the method still yields acceptable results.
The averaged results are presented in Table 2 and in Table 4. In both cases
HyperHyperLex outperforms the challenging baseline. The difference is even
larger when the individual texts are taken into account (micro average), because
the method benefits from having a larger corpus.
10

        Table 1. Results of discriminating cocktail names using hypernyms

                Number Number Cocktail         Baseline                HyperHy-
Cocktail name                                              HyperLex
                of senses of texts sense texts accuracy                 perLex
Americano           2        45         13       0.711         0.844     0.889
Aviation            2        27         17        0.63         0.37      0.852
B-52                2       111          8       0.928         0.946     0.982
Bellini             3        95         42       0.558         0.737     0.968
Bloody Mary         6       352        109        0.69         0.827     0.935
Cosmopolitan        3       125         10        0.92         0.928     0.952
Grasshopper         4        39         13       0.667         0.744     0.974
Manhattan           4       262         70       0.733         0.851     0.885
Margarita           2        41         36       0.878         0.878     0.561
Martini             2        38         27       0.711         0.711     0.711
Mimosa              3        45         18         0.6          0.4      0.733
Tequila Sunrise     2        16         10       0.625         0.938     0.812
Vesper              2        31         24       0.774         0.839     0.677

                     Table 2. Averaged results for cocktails

                  Method        Macro average Micro average
                  Baseline         0.725          0.737
                  HyperLex         0.784          0.821
                  HyperHyperLex    0.841          0.896


5    Conclusion
We have introduced a method to automatically discriminate between thesaural
and non-thesaural usage of entities. The method does not require any senses of
the target entity to be provided in advance and does not make use of external
resources except for thesaurus, which is considered an input parameter.
In the two experiments the new method has outperformed the baseline and has
shown an accuracy of about 0.9 and 0.75, respectively. The method is robust
and performs well even in the real-world case of dirty data.


         Table 3. Results of discriminating MeSH labels using hypernyms

                Number Number        MeSH      Baseline                HyperHy-
MeSH label                                                 HyperLex
                of senses of texts sense texts accuracy                 perLex
Amnesia             2        31        18        0.581         0.806     0.806
Dengue              2        84        59        0.702         0.298     0.583
Warts               2        14         9        0.643         0.643     0.643
Delirium            2       135        98        0.726         0.711     0.556
Iris                5       150        29        0.807         0.433     0.720
Kuru                2        69        35        0.507         0.696     0.957
Amygdala            2        64        42        0.656         0.578     0.656
Vertigo             5       237        43        0.819         0.768     0.865
                                                                               11

                      Table 4. Averaged results for MeSH

                  Method        Macro average Micro average
                  Baseline          0.68          0.735
                  HyperLex         0.617          0.621
                  HyperHyperLex    0.723          0.739


Remarks:
 – We performed WSI using HyperLex for comparison. The results for some
   words are comparable or even better, however for other words the accuracy
   drops significantly. In most cases this was due to the fact shown in Figure 2,
   namely, there were several senses that would contain a mixture of thesaural
   and other sense.
 – The method would benefit from using bi-grams. Indeed, all the texts contain
   many significant compound named entities.
 – The method would benefit from using additional NLP features such as part
   of speech or dependencies. This would be especially useful when working
   with small corpora.
Acknowledgements We would like to thank Noam Ordan for thoughtful com-
ments, careful proofreading, and valuable suggestions.
The work is partially supported by the LDL4HELTA (ldl4.com) project (EU-
REKA funding program, project ID 9898).

References
1. Eneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa. Random walks for
   knowledge-based word sense disambiguation. Computational Linguistics, 40(1):57–
   84, 2014.
2. Eneko Agirre and Aitor Soroa. Personalizing pagerank for word sense disam-
   biguation. In Proceedings of the 12th Conference of the European Chapter of the
   Association for Computational Linguistics, EACL ’09, pages 33–41, Stroudsburg,
   PA, USA, 2009. Association for Computational Linguistics.
3. Sean Bechhofer and Alistair Miles. Skos simple knowledge organization system
   reference. W3C recommendation, W3C, 2009.
4. Timo Borst and Joachim Neubert. Case study: Publishing stw thesaurus for eco-
   nomics as linked open data. W3C Semantic Web Use Cases and Case Studies,
   2009.
5. Antonio Di Marco and Roberto Navigli. Clustering and diversifying web search
   results with graph-based word sense induction. Computational Linguistics,
   39(3):709–754, 2013.
6. Lee R. Dice. Measures of the amount of ecologic association between species.
   Ecology, 26(3):297–302, 1945.
7. Beate Dorow and Dominic Widdows. Discovering corpus-specific word senses. In
   Proceedings of the tenth conference on European chapter of the Association for
   Computational Linguistics-Volume 2, pages 79–82. Association for Computational
   Linguistics, 2003.
12

 8. Ioannis P Klapaftis and Suresh Manandhar. Word sense induction using graphs of
    collocations. In ECAI, pages 298–302, 2008.
 9. Jiwei Li and Dan Jurafsky. Do multi-sense embeddings improve natural language
    understanding? arXiv preprint arXiv:1506.01070, 2015.
10. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
    tributed representations of words and phrases and their compositionality. In Ad-
    vances in neural information processing systems, pages 3111–3119, 2013.
11. Roberto Navigli. A quick tour of word sense disambiguation, induction and related
    approaches. SOFSEM 2012: Theory and practice of computer science, pages 115–
    129, 2012.
12. Roberto Navigli, Stefano Faralli, Aitor Soroa, Oier de Lacalle, and Eneko Agirre.
    Two birds with one stone: learning semantic models for text categorization and
    word sense disambiguation. In Proceedings of the 20th ACM international confer-
    ence on Information and knowledge management, pages 2317–2320. ACM, 2011.
13. Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum.
    Efficient non-parametric estimation of multiple embeddings per word in vector
    space. arXiv preprint arXiv:1504.06654, 2015.
14. S. J. Nelson. Medical terminologies that work: The example of mesh. In 2009 10th
    International Symposium on Pervasive Systems, Algorithms, and Networks, pages
    380–384, Dec 2009.
15. Julia Neuschmid, Sabrina Kirrane, Elmar Kiesling, and Thomas Thurner. Pro-
    pelling the Potential of Enterprise Linked Data in Austria. monochrom, 2016.
16. Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto,
    and Chris Biemann. Unsupervised does not mean uninterpretable: the case for word
    sense induction and disambiguation. Association for Computational Linguistics,
    2017.
17. Delip Rao, Paul McNamee, and Mark Dredze. Entity linking: Finding extracted
    entities in a knowledge base. In Multi-source, multilingual information extraction
    and summarization, pages 93–115. Springer, 2013.
18. Lev Ratinov, Dan Roth, Doug Downey, and Mike Anderson. Local and global algo-
    rithms for disambiguation to wikipedia. In Proceedings of the 49th Annual Meeting
    of the Association for Computational Linguistics: Human Language Technologies-
    Volume 1, pages 1375–1384. Association for Computational Linguistics, 2011.
19. Jennifer Rodd, Gareth Gaskell, and William Marslen-Wilson. Making sense of
    semantic ambiguity: Semantic competition in lexical access. Journal of Memory
    and Language, 46(2):245 – 266, 2002.
20. Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCal-
    lum. Wikilinks: A large-scale cross-document coreference corpus labeled via links
    to Wikipedia. Technical Report UM-CS-2012-015, University of Massachusetts,
    Amherst, 2012.
21. Jean Véronis. Hyperlex: lexical cartography for information retrieval. Computer
    Speech & Language, 18(3):223–252, 2004.
22. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised
    methods. In Proceedings of the 33rd annual meeting on Association for Com-
    putational Linguistics, pages 189–196. Association for Computational Linguistics,
    1995.
23. Zhi Zhong and Hwee Tou Ng. It makes sense: A wide-coverage word sense disam-
    biguation system for free text. In Proceedings of the ACL 2010 System Demon-
    strations, pages 78–83. Association for Computational Linguistics, 2010.