<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CoKE : Word Sense Induction Using Contextualized Knowledge Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sanjana Ramprasad</string-name>
          <email>sanjana.ramprasad@hiremya.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James Maddox</string-name>
          <email>james.maddox@hiremya.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gerber</institution>
          ,
          <addr-line>D. Lenat, F. van Harmelen, P. Clark (Eds.)</addr-line>
          ,
          <institution>Proceedings of, the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering (AAAI-MAKE 2019). Stanford, University</institution>
          ,
          <addr-line>Palo Alto, California, USA, March 25-27, 2019.</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mya Systems</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Word Embeddings can capture lexico-semantic information but remain flawed in their inability to assign unique representations to different senses of polysemous words. They also fail to include information from well-curated semantic lexicons and dictionaries. Previous approaches that obtain ontologically grounded word-sense representations learn embeddings that are superior in understanding contextual similarity but are outperformed on several word relatedness tasks by single prototype words. In this work, we introduce a new approach that can induce polysemy to any pre-defined embedding space by jointly grounding contextualized sense representations learned from sense-tagged corpora and word embeddings to a knowledge base. The advantage of this method is that it allows integrating ontological information while also readily inducing polysemy to pre-defined embedding spaces without the need for re-training. We evaluate our vectors on several word similarity and relatedness tasks, along with two extrinsic tasks and find that it consistently outperforms current state-of-the-art.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Distributed representations of words
        <xref ref-type="bibr" rid="ref25 ref27">(Mikolov et al. 2013b)</xref>
        has proven to be successful in addressing various drawbacks
of symbolic representations which treat words as atomic
units of meaning. By grouping similar words and
capturing analogical and lexical relationships, they are a popular
choice in several downstream NLP applications.
      </p>
      <p>
        While these embeddings capture meaningful lexical
relationships, they come with their own set of drawbacks. For
instance, complete reliance on natural language corpora
amplifies existing vocabulary bias that is inherent in datasets.
Vocabulary bias is caused by words not seen in the training
corpora and also extends to bias in word usage where some
words, often morphologically complex words, are used less
frequently than other words or phrases with the same
meaning. Thus embeddings suffer from inaccurate modeling of
less frequent words which is evident in the relatively lower
performance of word embeddings on the rare word
similarity task
        <xref ref-type="bibr" rid="ref21 ref22 ref38">(Luong, Socher, and Manning 2013b)</xref>
        . An
approach by
        <xref ref-type="bibr" rid="ref13">(Bojanowski et al. 2016a)</xref>
        propose using
character n-gram representations to address the problem of
out-ofvocabulary and rare words.
        <xref ref-type="bibr" rid="ref12">(Faruqui et al. 2014)</xref>
        also
proposed retrofitting vectors to an ontology to deal with
inaccurate modeling of less frequent words. However, these
methods don’t account for polysemy.
      </p>
      <p>Polysemy is an important feature of language which
causes words to have a different meaning or “sense” based
on the context in which they occur. For instance, the word
bank can refer to a financial institution or land on either
side of a river. A large body of work has gone into
developing word sense disambiguation systems to identify the
correct sense of a word based on its context. Word embeddings,
on the other hand, assign a single vector representation to
a word type, irrespective of polysemy. The availability of
disambiguation systems coupled with the growing reliance
of NLP systems on distributional semantics has led to an
increasing interest in obtaining powerful sense
representations.</p>
      <p>
        Some of the previous work that has gone into learning
sense representations includes unsupervised learning
techniques to cluster contexts and learn multi prototype
vectors(
        <xref ref-type="bibr" rid="ref34">(Reisinger and Mooney 2010)</xref>
        , (Huang et al. 2012)
and
        <xref ref-type="bibr" rid="ref11 ref35 ref41 ref45">(Wu and Giles 2015)</xref>
        ). A common drawback with the
cluster based approach is the difficulty in deciding the
number of clusters apriori. (
        <xref ref-type="bibr" rid="ref31">(Neelakantan et al. 2015)</xref>
        ,
        <xref ref-type="bibr" rid="ref42">(Tian
et al. 2014)</xref>
        ,
        <xref ref-type="bibr" rid="ref11 ref35 ref41 ref45">(Cheng and Kartsaklis 2015)</xref>
        ) also learn
multiple word embeddings by modifying the Skip-Gram model.
These approaches yield to sense representations that are
limited in terms of interpretability which makes it challenging
to include in downstream tasks. To remedy this,
        <xref ref-type="bibr" rid="ref11 ref19 ref35 ref41 ref45">(Iacobacci,
Pilehvar, and Navigli 2015)</xref>
        ,
        <xref ref-type="bibr" rid="ref10">(Chen, Liu, and Sun 2014)</xref>
        use
sense-tagged corpora and Word2Vec modifications to obtain
sense representations; however, they only make use of
distributional semantics.
      </p>
      <p>
        Previous work combining distributional semantics and
knowledge bases include
        <xref ref-type="bibr" rid="ref11 ref20 ref35 ref41 ref45">(Jauhar, Dyer, and Hovy 2015)</xref>
        and
        <xref ref-type="bibr" rid="ref11 ref35 ref41 ref45">(Rothe and Schu¨tze 2015)</xref>
        that grounding word embeddings
to ontologies to obtain sense representations. As a result
of grounding, these techniques drastically improved
performance on several similarity tasks but an observed pattern is
that this leads to compromised performance on word
relatedness tasks(
        <xref ref-type="bibr" rid="ref12">(Faruqui et al. 2014)</xref>
        ,
        <xref ref-type="bibr" rid="ref11 ref20 ref35 ref41 ref45">(Jauhar, Dyer, and Hovy
2015)</xref>
        ).
      </p>
      <p>
        In this work, we present a novel approach that uses
knowledge bases and sense representations to directly induce
polysemy to any pre-defined word embedding space. Our
approach leads to interpretable, ontologically grounded sense
representations that can easily be used with powerful
disambiguation systems. The main contributions of this
paper are a) Obtaining ontologically grounded sense
representations that perform well on both similarity and
relatedness tasks b) Automatic sense induction and integration of
knowledge base information into any predefined embedding
space without re-training c) Our embeddings also show
performance benefits when used with transfer learning
methods like CoVE
        <xref ref-type="bibr" rid="ref23">(McCann et al. 2017)</xref>
        and ELMo
        <xref ref-type="bibr" rid="ref33">(Peters
et al. 2018)</xref>
        on extrinsic tasks. d) Furthermore, we propose
methodologies for knowledge base augmentation along with
an approach to learn more effective sense representations.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>In our approach we thus rely on a) Sense tagged corpora
to obtain contextualized sense representations. The
objective of which is to capture sense relations and interactions
in naturally occurring corpora. The sense representations
are interpretable and have lexical mappings to a knowledge
base. We use them to induce polysemy in word embedding
spaces. b) Pretrained word embeddings to capture beneficial
lexical relationships that are inherent on account of being
trained on large amounts of data. Sense representations do
not adequately capture these relationships due to the limited
size of sense-tagged corpora which is used to train them. c)
Lastly, to account for the vocabulary bias in corpora which
causes similar meaning words to be farther apart in
embedding spaces, we use a knowledge base to jointly ground word
and sense representations.</p>
      <p>We thus describe our approach in three parts a) Lexicon
building b) Sense-Form Representations and c) Multi</p>
      <sec id="sec-2-1">
        <title>Word-Sense Representations</title>
      </sec>
      <sec id="sec-2-2">
        <title>a) Lexicon Building</title>
        <p>
          For our Knowledge Base, we rely on WordNet
          <xref ref-type="bibr" rid="ref29">(Miller 1995)</xref>
          and a Thesaurus1. WordNet(WN) is a large lexical database
that groups synonyms to synsets and records relations
between them in the form of synonyms, hypernyms, and
hyponyms. The synsets are highly interpretable since they
come with a gloss along with examples. A thesaurus, on
the other hand, groups words into different clusters based
on similarity of meaning.
        </p>
        <p>Thesaurus Inclusion The structure of WordNet(WN)
is such that it labels semantic relations among different
synsets. While this structure helps determine the degree
of similarity between synsets, it leads to a restricted set
of synonyms that represent a synset. To best combine
information from both resources, we augment the synonyms
in a WordNet synset using a Thesaurus.</p>
        <p>1https://www.thesaurus.com/</p>
        <p>Unlike WordNet(WN), the thesaurus does not have
distinct labels for senses. Senses are instead represented by a
group of words. Given a query word, the thesaurus returns
clusters of words where each cluster represents some sense.
Given a WN synset(s), we use the synset’s headword to
query the thesaurus and use a simple algorithm to map the
most appropriate cluster to the corresponding WN synset by
computing each cluster’s probability with respect to (s).</p>
        <p>Probabilities are assigned based on the words in a
cluster and the WN structure. Thus if a thesaurus cluster has
more words that are “closer” based on WN structure to
the synset(s), it receives a higher probability. To
measure “closeness”, we use the path-similarity(p) metric of
WN. Path-similarity(p) measures the similarity between two
synsets by considering the distance between them. It ranges
from 0 1 with scores towards 1 denoting “closer” synsets.
Since path-similarity(p) calculates similarity between two
synsets, thus given a word (w) in a thesaurus cluster queried
using the headword of the WN synset(s), we find the
distance-based similarity dw;s between s and w by first
obtaining all of the synsets(Sw) in WN for w and use it to
calculate dw;s as follows.</p>
        <p>dw;s</p>
        <p>maxfp(s; si)8si 2 Swg
If a word is not found in WN, we assign dw;s to 0:1 which is
the lowest distance-based similarity implying it is ”farthest”
from the synset(s) in WN.</p>
        <p>To account for varying cluster sizes in the thesaurus</p>
        <sec id="sec-2-2-1">
          <title>Algorithm 1 Thesaurus Inclusion</title>
          <p>Input: WordNet Synset (s), corresponding synonym
set(Sw)
Output: Most probable cluster for a word Cwn out of all
possible clusters Cw found in Thesaurus for a word.
1: Cw Thesaurus(w)
2: if length(Cw) = 1
3: n = 0
4: else
5:
6:
7: end if
9: return Cwn
pc(w) fp(cluster)8cluster 2 Cwg
n index(pc(w) , max(pc(w))
and prevent larger clusters from invariably having bigger
scores, we divide words in each cluster(c) into ten discrete
bins(bins) based on each word’s d score. The bins are in an
incremental range of 0:1(( [0-0.1 , 0.11- 0.2 ,...,0.91-1.0]),
with the highest score bin being 1. We then obtain cluster
scores, scorecluster as :
scorecluster =
wbin</p>
          <p>count(bin)</p>
          <p>X
bin2bins
We then get the probability of a cluster(pcluster) from
scorecluster by passing it through a sigmoid function.
pcluster =</p>
          <p>exp(scorecluster)
exp(scorecluster) + 1</p>
          <p>The words in the thesaurus cluster with the highest
probability is then picked to augmented into the synonym list of
the respective WN synset(s) . We’ve outlined the procedure
in Algorithm 1.</p>
          <p>In the Table 1 we denote the vocabulary and synset cluster
changes brought about by this step. The last column records
the average number of synonyms linked with a synset in
WordNet. Originally, owing to WordNet’s stringent relation
structure we see there are an average of approximately 2
synonyms within a synset. This number drastically increases
using a thesaurus for augmentation.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Words</title>
          <p>Phrases
WordNet 147307
Thesaurus(Introduced) 4026</p>
          <p>WordNet Form Extension To obtain representations that
cater to both similarity and relatedness, we modify the
synset nodes in WordNet. A synset in WordNet is
represented by a set of synonyms. We observe that these synonym
sets include words of the same meaning without
differentiating between their syntactic forms. For instance, consider the
synset operate.v.01, defined as “direct or control; projects,
businesses” , it has both run and running in its synonym
sets. In practice, each syntactic form of a word has
different semantic distributions. For instance, for this sense, run
is found to most likely occur with words such as lead and
head as compared to its alternate form running which is
more likely to appear with words such as managing,
administrating , leading. To account for this difference in
semantics, we extend WordNet nodes to include the syntactic form
information and call a synset, syntactic form pair
“senseform.” To obtain different sense-form nodes, we make use
of the OMSTI corpus and record different forms of a synset
based on the different syntactic forms of words associated
with the synset. Each “sense-form” is then linked to the
corresponding syntactic form of synonyms. The extended
WordNet(Ext-WN) sense-form nodes and synonyms are
depicted in Figure 1.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>b) Sense-Form Representations</title>
        <p>
          To obtain sense-form representations, we use a sense-tagged
corpus, OMSTI
          <xref ref-type="bibr" rid="ref11 ref35 ref41 ref45">(Taghipour and Ng 2015)</xref>
          . The corpus
contains sense-tagged words based on WordNet. Each
sensetagged word is associated with the respective synset found
in WN. We pre-process the corpora by replacing every
word and synset pair as a sense-form based on the
syntactic form of the tagged word and the synset. We then use
the Word2Vec toolkit(
          <xref ref-type="bibr" rid="ref25 ref27">(Mikolov et al. 2013b)</xref>
          ) with the Skip
Gram objective function and Negative Sampling to obtain
our contextualized “sense-form” representations.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>c) Word-Sense Representation and Induction</title>
        <p>We initialise each sense-form node in WN using the
representations obtained from the sense-tagged corpora. Then,
for each sense-form and the respective augmented synonym
set, we obtain unique multi word-sense representations by
jointly grounding the word and sense-form embeddings to
WordNet. For a word(w) in synonym set of a sense(s), we
obtain multi word-sense representations as follows:
vw;s =</p>
        <p>w;s([uw; vs;form(s)])</p>
        <p>Where, uw is the pre-trained word embedding , vs;form(s)
is the contextualized sense-form representation of the node
learned from sense-tagged corpora. For grounding, we use
WordNet’s synset rank information and graph structure to
obtain the scaling factor, w;s for grounding as follows:
w;s = 1</p>
        <p>clog(x); where
x = ranks;w + d(w; s)</p>
        <p>
          For word (w) in the w; s pair, WN which gives the list
of senses(Sw) in decreasing order of likelihood. We use this
to obtain the rank ranks;w of a senses with respect to w.
The sense with rank 1 in Sw for a word is thus the most
likely sense of the word. As outlined in our previous
sections, we use an augmented synonym set by adding from a
thesaurus for each synset node which means there are many
word-sense pairs in our extended-WN not found in WN. For
example, the extended-WN includes “hold” as a synonym
for sense “influence.n.01”. This word and sense pair(hold,
influence.n.01) is not found in WN. Thus “influence.n.01” is
not part of Shold in the original WN. If a word(w),sense(s)
pair from our extended-WN is present in Sw, we use the
rank directly. If not, we use the rank of the synset in Sw that
is “closest” to the sense s in the word-sense pair. The WN
path-similarity(p) metric is used to denote “closeness”. We
would also like to penalise senses s found in our
extendedWN pairs more if they are farther in the WN graph structure
to the original senses Sw given by WN for word w. The
intuition is, the closer a sense is to a word in the WN graph,
the more relevant it is to the word. The same intuition is
followed in retrofitting vectors to lexicons as well
          <xref ref-type="bibr" rid="ref12">(Faruqui et
al. 2014)</xref>
          . d(Sw; s) is the penalizer in our equation which
obtains the distance between a word and a sense as follows:
d(w; s) = min([1
        </p>
        <p>p(s; x)8x 2 Sw])</p>
        <p>
          Recall p(s; x) is the path-similarity score with a higher
score denoting closer pairs, implying closer pairs get
assigned a lower penalizing distance. We use a monotonically
decreasing distribution 1 clog(x) with c as some constant
in our probability distribution as found by
          <xref ref-type="bibr" rid="ref1">(Arora et al.
2018)</xref>
          . As a result, of feeding ranks and graph structure
distances between w and s, to this distribution, the lower
ranked(with one being the highest) and farther away synsets
(or bigger d) get lower scaling scores. Senses similar in rank
and distance thus get similar scaling scores.
        </p>
        <p>We thus get grounded representations with the scaling factor
w;s reflecting likelihood and ontology graph structure.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Experiments</title>
        <p>In this section, we describe the experiments done to evaluate
our multi word-sense word embeddings. We use an array of
existing word similarity and relatedness datasets to conduct
intrinsic evaluation and 4 datasets across 2 tasks for extrinsic
evaluation.</p>
      </sec>
      <sec id="sec-2-6">
        <title>Intrinsic Evaluation</title>
        <p>We test our embeddings intrinsically on similarity,
relatedness and contextual similarity datasets.</p>
        <p>
          Word Representations To run our experiments,we
pick two different embeddings of 300 dimension
GLoVE
          <xref ref-type="bibr" rid="ref32">(Pennington, Socher, and Manning 2014)</xref>
          ,and
Skip-Gram(SG)
          <xref ref-type="bibr" rid="ref25 ref27 ref38">(Mikolov et al. 2013a)</xref>
          . We use these
embeddings for word sense induction in our experiments
because they are a popular choice for NLP systems at the
time of writing the paper. The resulting CoKE embeddings
after scaling and concatenation with word embeddings is
600 dimension.
        </p>
        <p>
          Similarity Measures Given a pair of words w with M
senses and w0 with N senses, we use the following two
metrics proposed by
          <xref ref-type="bibr" rid="ref34">(Reisinger and Mooney 2010)</xref>
          for
computing similarity scores without using context.
        </p>
        <p>AvgSim(w; w0 ) =
M axSim(w; w0 ) =</p>
        <p>1
M N</p>
        <p>M N
X X(cos(vw;i; vw0 ;j ))
i=1 j=1
max
1 i M;1 j M
cos(vw;i; vw0 ;j )
AvgSim computes word similarity as the average
similarity between all pairs of sense vectors. Whereas M axSim
computes the maximum over all pairwise sense vector
similarities.</p>
        <p>Baselines We denote two baselines in Table 2. and
Table 3., in addition to the baseline score of the single
prototype word embeddings themselves. The first baseline
we denote is to measure performance on
concatenating sense embeddings learned from the OMSTI corpus
along with word embeddings using WordNet to retrieve
senses for a word. This baseline is to indicate scores on
concatenating embeddings from two different sources.
This is denoted as +Synset(WN) in the table. The
second baseline, +CoKE(Ext-WN) is to track performance
changes when splitting senses to sense-forms and
grounding them to extended-WN. Finally, we show scores with
+CoKE(Thes+Ext-WN) which reflects performance of
grounded word-sense representations using sense-forms,
extended-WordNet and the thesaurus.</p>
        <p>
          Word Similarity We evaluate our embeddings on
several standard word similarity datasets namely, SimLex
          <xref ref-type="bibr" rid="ref11 ref16 ref35 ref41 ref45">(Hill, Reichart, and Korhonen 2015)</xref>
          (SL-999),
WordSim353(Gabrilovich and Markovitch ) (WS-S), MC-30
          <xref ref-type="bibr" rid="ref28">(Miller
and Charles 1991)</xref>
          , RG-65
          <xref ref-type="bibr" rid="ref37">(Rubenstein and Goodenough
1965)</xref>
          , YP-130
          <xref ref-type="bibr" rid="ref46">(Yang and Powers 2006)</xref>
          ,SimVerb(Gerz
et al. 2016)(SV) and Rare Word(RW) similarity
          <xref ref-type="bibr" rid="ref21 ref22 ref38">(Luong,
Socher, and Manning 2013a)</xref>
          .
        </p>
        <p>Each dataset contains a list of word pairs with an individual
score generated by humans of how similar the two words
are. We calculate the Spearman correlation between the
labels and the scores generated by our method. For similarity,
we use M axSim as a metric to find the most similar pair
among different senses of a word. The results are outlined
in Table 2.</p>
        <p>We observe that the lower performance for Synset(WN),
obtained by concatenating word with sense embeddings to
get word-sense embeddings, is because of the limited
number of synonyms for a synset recorded in WordNet along
with the limited size of the dataset used to learn these
embeddings.</p>
        <p>
          The average improvement column in the table(Avg
Improvement), shows a significant improvement in performance on
splitting senses to sense-forms and
grounding(CoKE(ExtWN)). The benefits of this approach are reflected mainly
on the SimVerb-3500 dataset. This is not a surprising result
since words tend to have more syntactic forms when they
occur as verbs. With distributional semantics, syntactic forms
of verbs often remain close making it hard to capture
differences. However drastic improvements can be seen through
76.96
-25.76
-24.64
+0.21
74.97
-11.85
-7.96
+10.84
76.15
-10.34
-4.23
+11.6
50.33
-28.24
-27.7
+1.72
45.78
-23.03
-25.38
+1.51
+0.59
+4.04
+17.69
40.82
+0.48
+6.96
+18.29
55.89
+5.41
+11.75
+11.69
57.08
+0.26
+7.02
+11.8
78.80
-11.44
-9.48
+5.98
78.60
-10.24
-6.19
+7.27
+1.1
+6.71
+13.51
28.32
+0.47
+8.06
+17.59
-10.02
-6.75
+8.80
-9.35
-5.12
+9.75
thesaurus inclusion(CoKE(Thes+Ext-WN)), this is because
using WordNet alone leads to limited lexemes on account
of words being represented by fewer senses as opposed to
a large number of senses captured for a word by word
embeddings, as a result of being trained on large datasets. On
including a thesaurus and augmenting the synonym set for
synsets in WordNet, we see that the number of senses that
represent a word drastically changes leading to more
lexemes that closely reflect all possible senses of a word.
We also note that the improvements for WS-S is relatively
lower; we suspect this is because the dataset is designed
based on association rather than similarity alone. We also
observe that as baselines of embedding spaces get higher for
datasets, the performance gains reduces since most of the
information is captured in the embedding spaces. The same
trend is also observed in
          <xref ref-type="bibr" rid="ref12">(Faruqui et al. 2014)</xref>
          .
Word Relatedness Integration of our vectors also shows
improvements in word relatedness tasks. As our benchmark,
we evaluate on WS-R (relatedness) , MTurk(771) (
          <xref ref-type="bibr" rid="ref14">(Halawi
et al. 2012)</xref>
          ), MEN(
          <xref ref-type="bibr" rid="ref8">(Bruni et al. 2012)</xref>
          ), and on SGS130
(
          <xref ref-type="bibr" rid="ref40">(Szumlanski, Gomez, and Sims 2013)</xref>
          ) which includes
phrases. We evaluate the performance of our method against
standard pre-trained word embedding using Spearman
correlation. We use AvgSim as our metric to measure
relatedness and report scores Table 3.
        </p>
        <p>The baselines we use are the same as for word similarity
as described above. We notice how performance
improvements through sense-form splitting are not as drastic as for
word similarity. This could be on account of word
relatedness tasks more frequently checking for relatedness of
objects rather than verbs; sense-form splitting is more
beneficial to verbs than nouns on account of more varying forms
of words as verbs.</p>
        <p>
          We are not sure why the overall performance gains are not
as high as for similarity, but the scores do reflect gains as
opposed to retrofitting directly to lexicons which leads to a
serious drop in relatedness. The big performance gains on
SGS
          <xref ref-type="bibr" rid="ref40">(Szumlanski, Gomez, and Sims 2013)</xref>
          is due to phrases
present in the dataset. By using a thesaurus and WN, we
learn multiple phrasal representations not found in the
original word embedding space.
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>Word Similarity for Polysemous Words We use the</title>
        <p>SCWS dataset introduced by (Huang et al. 2012), where
word pairs are chosen to have variations in meanings for
polysemous and homonymous words. We compare our method
with other state-of-the-art multi-prototype models.We find
that our model performs competitively with previous
models. We use the Skip-Gram(SG) word embedding with our
method to allow for fair comparison, since previous work
uses Skip-Gram for retrofitting to WordNet.The Spearman
correlation between the labels and scores are indicated in
Table 4.</p>
      </sec>
      <sec id="sec-2-8">
        <title>Extrinsic Evaluation</title>
        <p>
          A lot of the prior work on obtaining sense embeddings show
performance improvements in intrinsic tasks, but leave out
testing them on downstream tasks. It is thus difficult to judge
the effectiveness of these representations. To bridge this
gap, we run experiments on two tasks(Sentiment Analysis
and Question Classification) across 4 datasets to provide
some insight on the usefulness of our representations.
Datasets For sentiment analysis we use the Stanford
Sentiment Treebank dataset
          <xref ref-type="bibr" rid="ref21 ref22 ref38">(Socher et al. 2013)</xref>
          . We train
seperately and test on the Binary Version(SST-2) as well as the
five class version(SST-5). For question classification, we
evaluate performance on the TREC
          <xref ref-type="bibr" rid="ref43">(Voorhees 2001)</xref>
          question classification dataset which consists of open domain
questions and semantic categories.
size of 300 and run our experiments. Parameters were
finetuned specifically for each task and embedding type.
        </p>
      </sec>
      <sec id="sec-2-9">
        <title>Performance Comparisons We first run experiments on</title>
        <p>CoKE by representing words as an average of their
respective sense embeddings. It is a known fact that words are a
weighted sum of their senses. Thus the intuition of using
averaged embeddings is that having grounded word-sense
representations should lead to better word representations
through averaging.</p>
        <p>
          Recent trends have also lead to an increasing interest
in transfer learning for obtaining superior word
representations. CoVE
          <xref ref-type="bibr" rid="ref23">(McCann et al. 2017)</xref>
          and ELMo
          <xref ref-type="bibr" rid="ref33">(Peters et
al. 2018)</xref>
          show significant improvements in extrinsic tasks.
CoVE uses word representations learned from a machine
translation system in combination with GloVE embeddings.
ELMo, on the other hand, uses a language model to obtain
contextualised word representations. As shown by
          <xref ref-type="bibr" rid="ref33">(Peters et
al. 2018)</xref>
          , these systems inherently act as word sense
disambiguation and representation systems. They give word
representations conditioned on the context it occurs in and
perform on par with state-of-the-art word sense disambiguation
systems, but it is unclear how informative the sense
representations are. We thus hypothesise that the systems can
benefit by using better sense representations.
        </p>
        <p>
          Due to the promising performance of CoVE and ELMo as
word sense disambiguation systems and increasing interest
in using them in NLP tasks, we use them as disambiguation
systems in our experiments to sense tag the four benchmark
datasets. To get the disambiguated sense tags using CoVE
or ELMo, we use the same approach as outlined in
          <xref ref-type="bibr" rid="ref33">(Peters
et al. 2018)</xref>
          . We compute each word’s representation in
OMSTI using CoVE or ELMo and then use the average of all
the representations obtained for a sense to get its respective
sense representations. To disambiguate a sentence, we then
run the sentence through the CoVE or ELMo architecture
to get word representations and then tag the word by taking
the nearest neighbour sense from the corresponding CoVE
or ELMo computed sense representations. For ELMo, we
use the last layer and the pre-trained version made available
publicly.
        </p>
        <p>In our experiments, we use the CoKE word-sense
embeddings obtained by using GLoVE with the thesaurus
and extended-WordNet for grounding.We pick CoKE with
GLoVE embeddings to be fair in comparison with CoVE
which is obtained by concatenation with GLoVE
embeddings.</p>
        <p>We thus compare performance using GLoVE, CoVE and
ELMo independently, using an average of CoKE
representations to get word representations, and also using
ELMo/CoVE as disambiguation systems with sense-tagged
words represented with CoKE embeddings(CoKE+(CoVE),
CoKE(+ELMo)). Note if a word is not sense-tagged we use
vanilla GLoVE vectors concatenated with an unknown
vector.</p>
        <p>
          Training Details To test for performance of different
embeddings on datasets, we implement a single-layer
LSTM
          <xref ref-type="bibr" rid="ref17">(Hochreiter and Schmidhuber 1997)</xref>
          with a hidden
Results As shown in Table 6. using CoKE shows more
significant improvements with Classification as opposed
to Sentiment Analysis. This is an expected outcome since
our approach focuses on ontology grounding without
considering polarity of words which is the primary goal
of Sentiment Analysis. On the other hand, Classification
as a task is more sensitive to representations that cater to
similarity and relatedness between sentences. Significant
improvements can be seen on classification tasks even by
using averaged CoKE embeddings without disambiguation.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Qualitative Analysis</title>
      <p>In this section, we look at some visualisations of senses
induced and show how they are easily interpretable. Since
sense tags have lexical mappings to an ontology, they can be
looked up to find meanings. Moreover, the semantic
distribution of the word-senses also plays a role in obtaining
meaningful sense clusters. We analyse two things 1) Sense
clusters induced 2) How using different sense forms affect
representations and sense interactions in their respective word
forms. For all our analysis, we use the concatenated version
of CoKE + GLoVE embeddings and use Principle
Component Analysis to perform dimensionality reduction.</p>
      <sec id="sec-3-1">
        <title>Sense Clusters</title>
        <p>We look at the sense clusters formed by our word specific
senses embeddings for the word “rock”.</p>
        <p>The clusters for the word ”rock” is depicted in Figure 2.
The multiple fine-grained word-sense embeddings for the
word “rock” cluster to form 5 basic senses. We see three
distinct clusters that dominate. “Cluster#2” can be
interpreted as all synsets that speak of rock as a ”substance”. In,
“Cluster#3”, the synsets cluster together to speak of rock as
“music”. An interesting property can be observed
comparing “Cluster#1” and “Cluster#5”. The senses found in both
of these clusters interpret “rock” as “movement/motion”.
However, the two distinct clusters also capture the kind of
motion. For instance ,the senses roll:v:13 and rock:v:01
in “Cluster#5” map specifically “sideways movement”.
While the senses in “Cluster#1” map to glosses “sudden
movements”(convulse , lurch,move , tremble) and “back
and forth movements(wobble , rock)”. Another interesting
property is depicted by “Cluster#4”, although they are more
synonymous in meaning to rock as a “substance”, the senses
for gravel cluster very closely to senses mapping to gloss
“jerking” movements capturing deeper relations between
senses.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Sense Forms</title>
        <p>In this section, we analyse how different sense-form
representations interact for synonyms within a synset. We do so
by considering the word-forms “plan” and “planning” both
of which are synonyms of their respective sense-forms of
“mastermind.v.01” (Gloss: plan and direct, a complex
undertaking).</p>
        <p>In order to observe the difference in sense-form
relationships of word-forms, we consider only common synsets in
“plan” and “planning” for visualisation and observe the
interactions with each other. For the word “plan” as shown
in Figure 3.a), we observe that the synset “mastermind” is
closer in proximity to synsets that map to words like “plan”,
“sketch”, “prepare”. In contrast, the same synset in the
embedding space for ”planning” as shown in Figure 3.b)
interacts closely with synsets that are analogous to “project
planning”, “scheduling”, “organising”. This shows how
using different sense-form representations, leads to different
and unique interactions among the same group of synsets
for each word.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In our work, we explore the possibility of obtaining multi
word-sense representations and sense induction to
embedding spaces by using distributional semantics and a
knowledge base. The prototypes allow ease of use with WSD
systems, can easily be used in downstream applications since
they are portable and are flexible to use in a wide variety
of tasks. Previous work on obtaining sense representations
falls under three distinct clusters - Unsupervised methods,
Supervised resource-specific methods and ontology
grounding. By using pre-trained unsupervised embeddings,
supervised sense embeddings and jointly grounding them in an
ontology, ours is the first approach that lies in the
intersection of all three approaches. The code and vectors will be
made available publicly as well.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Arora</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; Ma, T.; and
          <string-name>
            <surname>Risteski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Linear algebraic structure of word senses, with applications to polysemy</article-title>
          .
          <source>Transactions of the Association of Computational Linguistics</source>
          <volume>6</volume>
          :
          <fpage>483</fpage>
          -
          <lpage>495</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Athiwaratkun</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>Wilson</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Multimodal word distributions</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          ,
          <fpage>1645</fpage>
          -
          <lpage>1656</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          2016a.
          <article-title>Enriching word vectors with subword information</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>arXiv preprint arXiv:1607</source>
          .
          <fpage>04606</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          2016b.
          <article-title>Enriching word vectors with subword information</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>arXiv preprint arXiv:1607</source>
          .
          <fpage>04606</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Bruni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Boleda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Baroni,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Tran</surname>
          </string-name>
          , N.
          <article-title>-</article-title>
          K.
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>Distributional semantics in technicolor</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume</source>
          <volume>1</volume>
          ,
          <fpage>136</fpage>
          -
          <lpage>145</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and Sun,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>A unified model for word sense representation and disambiguation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <fpage>1025</fpage>
          -
          <lpage>1035</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Cheng</surname>
          </string-name>
          , J., and
          <string-name>
            <surname>Kartsaklis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Syntax-aware multisense word embeddings for deep compositional models of meaning</article-title>
          .
          <source>arXiv preprint arXiv:1508</source>
          .
          <fpage>02354</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Faruqui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dodge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Jauhar,
          <string-name>
            <given-names>S. K.</given-names>
            ;
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Hovy</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N. A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Retrofitting word vectors to semantic lexicons</article-title>
          .
          <source>arXiv preprint arXiv:1411</source>
          .
          <fpage>4166</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          2016. Simverb-3500:
          <article-title>A large-scale evaluation set of verb similarity</article-title>
          .
          <source>arXiv preprint arXiv:1608</source>
          .
          <fpage>00869</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Halawi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Dror,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Gabrilovich</surname>
          </string-name>
          , E.; and Koren,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>Large-scale learning of word relatedness with constraints</article-title>
          .
          <source>In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <fpage>1406</fpage>
          -
          <lpage>1414</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Reichart</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2015</year>
          . Simlex-
          <volume>999</volume>
          :
          <article-title>Evaluating semantic models with (genuine) similarity estimation</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>41</volume>
          (
          <issue>4</issue>
          ):
          <fpage>665</fpage>
          -
          <lpage>695</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          2012.
          <article-title>Improving word representations via global context and multiple word prototypes</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume</source>
          <volume>1</volume>
          ,
          <fpage>873</fpage>
          -
          <lpage>882</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Iacobacci</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Pilehvar, M. T.; and Navigli,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Sensembed: Learning sense embeddings for word and relational similarity</article-title>
          .
          <source>In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>1</volume>
          ,
          <fpage>95</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Jauhar</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Hovy</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Ontologically grounded multi-sense representation learning for semantic vector space models</article-title>
          .
          <source>In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <fpage>683</fpage>
          -
          <lpage>693</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Luong</surname>
          </string-name>
          , M.-T.;
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <year>2013a</year>
          .
          <article-title>Better word representations with recursive neural networks for morphology</article-title>
          .
          <source>In CoNLL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2013b</year>
          .
          <article-title>Better word representations with recursive neural networks for morphology</article-title>
          .
          <source>Proceedings of the Seventeenth Conference on Computational Natural Language Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>McCann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bradbury</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Xiong,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Socher,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>Learned in translation: Contextualized word vectors</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          ,
          <volume>6297</volume>
          -
          <fpage>6308</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Corrado</surname>
          </string-name>
          , G.; and
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2013a</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>arXiv preprint arXiv:1301</source>
          .
          <fpage>3781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Chen,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            ; and
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2013b</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>3111</volume>
          -
          <fpage>3119</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Charles</surname>
            ,
            <given-names>W. G.</given-names>
          </string-name>
          <year>1991</year>
          .
          <article-title>Contextual correlates of semantic similarity</article-title>
          .
          <source>Language and cognitive processes 6</source>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>G. A.</given-names>
          </string-name>
          <year>1995</year>
          .
          <article-title>Wordnet: a lexical database for english.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shankar</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Efficient non-parametric estimation of multiple embeddings per word in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1504</source>
          .
          <fpage>06654</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Socher, R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In Proc. of NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Reisinger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Multi-prototype vector-space models of word meaning</article-title>
          .
          <source>In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics</source>
          ,
          <fpage>109</fpage>
          -
          <lpage>117</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>Rothe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and Schu¨tze,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Autoextend: Extending word embeddings to embeddings for synsets and lexemes</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>arXiv preprint arXiv:1507</source>
          .
          <fpage>01127</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <surname>Rubenstein</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Goodenough</surname>
            ,
            <given-names>J. B.</given-names>
          </string-name>
          <year>1965</year>
          .
          <article-title>Contextual correlates of synonymy</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>8</volume>
          (
          <issue>10</issue>
          ):
          <fpage>627</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Perelygin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chuang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Manning,
          <string-name>
            <given-names>C. D.</given-names>
            ;
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ; and
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Recursive deep models for semantic compositionality over a sentiment treebank</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>In Proceedings of the 2013 conference on empirical methods in natural language processing</source>
          ,
          <volume>1631</volume>
          -
          <fpage>1642</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <string-name>
            <surname>Szumlanski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sims</surname>
            ,
            <given-names>V. K.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>A new set of norms for semantic relatedness measures</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , volume
          <volume>2</volume>
          ,
          <fpage>890</fpage>
          -
          <lpage>895</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>Taghipour</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>H. T.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>One million sensetagged instances for word sense disambiguation and induction</article-title>
          .
          <source>In Proceedings of the Nineteenth Conference on Computational Natural Language Learning</source>
          ,
          <fpage>338</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; Zhang, R.;
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , E.; and Liu, T.-Y.
          <year>2014</year>
          .
          <article-title>A probabilistic model for learning multiprototype word embeddings</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2014</year>
          ,
          <source>the 25th International Conference on Computational Linguistics: Technical Papers</source>
          ,
          <fpage>151</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>The trec question answering track</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>Natural Language Engineering</source>
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <fpage>361</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Sense-aaware semantic analysis: A multi-prototype word representation model using wikipedia</article-title>
          .
          <source>In AAAI</source>
          ,
          <fpage>2188</fpage>
          -
          <lpage>2194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Powers</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Verb similarity on the taxonomy of WordNet</article-title>
          . Masaryk University.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>