=Paper=
{{Paper
|id=Vol-2350/paper14
|storemode=property
|title=CoKE : Word Sense Induction Using Contextualized Knowledge Embeddings
|pdfUrl=https://ceur-ws.org/Vol-2350/paper14.pdf
|volume=Vol-2350
|authors=Sanjana Ramprasad,James Maddox
|dblpUrl=https://dblp.org/rec/conf/aaaiss/RamprasadM19
}}
==CoKE : Word Sense Induction Using Contextualized Knowledge Embeddings==
<pdf width="1500px">https://ceur-ws.org/Vol-2350/paper14.pdf</pdf>
<pre>
    CoKE : Word Sense Induction Using Contextualized Knowledge Embeddings

                          Sanjana Ramprasad                                        James Maddox
                         Mya Systems                                             Mya Systems
               sanjana.ramprasad@hiremya.com                              james.maddox@hiremya.com


                            Abstract                                 larity task (Luong, Socher, and Manning 2013b). An ap-
                                                                     proach by (Bojanowski et al. 2016a) propose using charac-
  Word Embeddings can capture lexico-semantic information            ter n-gram representations to address the problem of out-of-
  but remain flawed in their inability to assign unique represen-
                                                                     vocabulary and rare words. (Faruqui et al. 2014) also pro-
  tations to different senses of polysemous words. They also
  fail to include information from well-curated semantic lexi-       posed retrofitting vectors to an ontology to deal with inaccu-
  cons and dictionaries. Previous approaches that obtain onto-       rate modeling of less frequent words. However, these meth-
  logically grounded word-sense representations learn embed-         ods don’t account for polysemy.
  dings that are superior in understanding contextual similarity        Polysemy is an important feature of language which
  but are outperformed on several word relatedness tasks by          causes words to have a different meaning or “sense” based
  single prototype words. In this work, we introduce a new ap-       on the context in which they occur. For instance, the word
  proach that can induce polysemy to any pre-defined embed-          bank can refer to a financial institution or land on either
  ding space by jointly grounding contextualized sense repre-        side of a river. A large body of work has gone into develop-
  sentations learned from sense-tagged corpora and word em-
  beddings to a knowledge base. The advantage of this method
                                                                     ing word sense disambiguation systems to identify the cor-
  is that it allows integrating ontological information while also   rect sense of a word based on its context. Word embeddings,
  readily inducing polysemy to pre-defined embedding spaces          on the other hand, assign a single vector representation to
  without the need for re-training. We evaluate our vectors on       a word type, irrespective of polysemy. The availability of
  several word similarity and relatedness tasks, along with two      disambiguation systems coupled with the growing reliance
  extrinsic tasks and find that it consistently outperforms cur-     of NLP systems on distributional semantics has led to an
  rent state-of-the-art.                                             increasing interest in obtaining powerful sense representa-
                                                                     tions.
                        Introduction                                    Some of the previous work that has gone into learning
                                                                     sense representations includes unsupervised learning tech-
Distributed representations of words (Mikolov et al. 2013b)          niques to cluster contexts and learn multi prototype vec-
has proven to be successful in addressing various drawbacks          tors((Reisinger and Mooney 2010) , (Huang et al. 2012)
of symbolic representations which treat words as atomic              and (Wu and Giles 2015)). A common drawback with the
units of meaning. By grouping similar words and captur-              cluster based approach is the difficulty in deciding the num-
ing analogical and lexical relationships, they are a popular         ber of clusters apriori. ( (Neelakantan et al. 2015) , (Tian
choice in several downstream NLP applications.                       et al. 2014) ,(Cheng and Kartsaklis 2015)) also learn multi-
   While these embeddings capture meaningful lexical rela-           ple word embeddings by modifying the Skip-Gram model.
tionships, they come with their own set of drawbacks. For            These approaches yield to sense representations that are lim-
instance, complete reliance on natural language corpora am-          ited in terms of interpretability which makes it challenging
plifies existing vocabulary bias that is inherent in datasets.       to include in downstream tasks. To remedy this, (Iacobacci,
Vocabulary bias is caused by words not seen in the training          Pilehvar, and Navigli 2015), (Chen, Liu, and Sun 2014) use
corpora and also extends to bias in word usage where some            sense-tagged corpora and Word2Vec modifications to obtain
words, often morphologically complex words, are used less            sense representations; however, they only make use of dis-
frequently than other words or phrases with the same mean-           tributional semantics.
ing. Thus embeddings suffer from inaccurate modeling of                 Previous work combining distributional semantics and
less frequent words which is evident in the relatively lower         knowledge bases include (Jauhar, Dyer, and Hovy 2015) and
performance of word embeddings on the rare word simi-                (Rothe and Schütze 2015) that grounding word embeddings
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     to ontologies to obtain sense representations. As a result
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   of grounding, these techniques drastically improved perfor-
the AAAI 2019 Spring Symposium on Combining Machine Learn-           mance on several similarity tasks but an observed pattern is
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            that this leads to compromised performance on word relat-
University, Palo Alto, California, USA, March 25-27, 2019.           edness tasks((Faruqui et al. 2014), (Jauhar, Dyer, and Hovy
2015)).
   In this work, we present a novel approach that uses knowl-
edge bases and sense representations to directly induce pol-
ysemy to any pre-defined word embedding space. Our ap-
proach leads to interpretable, ontologically grounded sense
representations that can easily be used with powerful dis-
ambiguation systems. The main contributions of this pa-
per are a) Obtaining ontologically grounded sense repre-
sentations that perform well on both similarity and related-
ness tasks b) Automatic sense induction and integration of
knowledge base information into any predefined embedding
space without re-training c) Our embeddings also show per-
formance benefits when used with transfer learning meth-
ods like CoVE (McCann et al. 2017) and ELMo (Peters
et al. 2018) on extrinsic tasks. d) Furthermore, we propose
methodologies for knowledge base augmentation along with
an approach to learn more effective sense representations.

                     Methodology
In our approach we thus rely on a) Sense tagged corpora
to obtain contextualized sense representations. The objec-
tive of which is to capture sense relations and interactions
in naturally occurring corpora. The sense representations
are interpretable and have lexical mappings to a knowledge
base. We use them to induce polysemy in word embedding
spaces. b) Pretrained word embeddings to capture beneficial
lexical relationships that are inherent on account of being     Figure 1: WordNet synset nodes split based on syntactic
trained on large amounts of data. Sense representations do      form information
not adequately capture these relationships due to the limited
size of sense-tagged corpora which is used to train them. c)
Lastly, to account for the vocabulary bias in corpora which        Unlike WordNet(WN), the thesaurus does not have dis-
causes similar meaning words to be farther apart in embed-      tinct labels for senses. Senses are instead represented by a
ding spaces, we use a knowledge base to jointly ground word     group of words. Given a query word, the thesaurus returns
and sense representations.                                      clusters of words where each cluster represents some sense.
We thus describe our approach in three parts a) Lexicon         Given a WN synset(s), we use the synset’s headword to
building b) Sense-Form Representations and c) Multi             query the thesaurus and use a simple algorithm to map the
Word-Sense Representations                                      most appropriate cluster to the corresponding WN synset by
                                                                computing each cluster’s probability with respect to (s).
a) Lexicon Building                                                Probabilities are assigned based on the words in a clus-
                                                                ter and the WN structure. Thus if a thesaurus cluster has
For our Knowledge Base, we rely on WordNet (Miller 1995)
                                                                more words that are “closer” based on WN structure to
and a Thesaurus1 . WordNet(WN) is a large lexical database
                                                                the synset(s), it receives a higher probability. To mea-
that groups synonyms to synsets and records relations be-
                                                                sure “closeness”, we use the path-similarity(p) metric of
tween them in the form of synonyms, hypernyms, and hy-
                                                                WN. Path-similarity(p) measures the similarity between two
ponyms. The synsets are highly interpretable since they
                                                                synsets by considering the distance between them. It ranges
come with a gloss along with examples. A thesaurus, on
                                                                from 0 − 1 with scores towards 1 denoting “closer” synsets.
the other hand, groups words into different clusters based
                                                                Since path-similarity(p) calculates similarity between two
on similarity of meaning.
                                                                synsets, thus given a word (w) in a thesaurus cluster queried
Thesaurus Inclusion The structure of WordNet(WN)                using the headword of the WN synset(s), we find the
is such that it labels semantic relations among different       distance-based similarity dw,s between s and w by first ob-
synsets. While this structure helps determine the degree        taining all of the synsets(Sw ) in WN for w and use it to cal-
of similarity between synsets, it leads to a restricted set     culate dw,s as follows.
of synonyms that represent a synset. To best combine                         dw,s ← max{p(s, si )∀si ∈ Sw }
information from both resources, we augment the synonyms
in a WordNet synset using a Thesaurus.                          If a word is not found in WN, we assign dw,s to 0.1 which is
                                                                the lowest distance-based similarity implying it is ”farthest”
                                                                from the synset(s) in WN.
   1
       https://www.thesaurus.com/                                  To account for varying cluster sizes in the thesaurus
Algorithm 1 Thesaurus Inclusion                                  sets include words of the same meaning without differentiat-
Input: WordNet Synset (s), corresponding synonym                 ing between their syntactic forms. For instance, consider the
set(Sw )                                                         synset operate.v.01, defined as “direct or control; projects,
Output: Most probable cluster for a word Cwn out of all          businesses” , it has both run and running in its synonym
possible clusters Cw found in Thesaurus for a word.              sets. In practice, each syntactic form of a word has differ-
                                                                 ent semantic distributions. For instance, for this sense, run
1: Cw ← Thesaurus(w)                                             is found to most likely occur with words such as lead and
2: if length(Cw ) = 1                                            head as compared to its alternate form running which is
3:         n=0                                                   more likely to appear with words such as managing, admin-
4: else                                                          istrating , leading. To account for this difference in seman-
5:         pc (w) ← {p(cluster)∀cluster ∈ Cw }                   tics, we extend WordNet nodes to include the syntactic form
6:         n ← index(pc (w) , max(pc (w))                        information and call a synset, syntactic form pair “sense-
7: end if                                                        form.” To obtain different sense-form nodes, we make use
9: return Cwn                                                    of the OMSTI corpus and record different forms of a synset
                                                                 based on the different syntactic forms of words associated
                                                                 with the synset. Each “sense-form” is then linked to the
and prevent larger clusters from invariably having bigger        corresponding syntactic form of synonyms. The extended
scores, we divide words in each cluster(c) into ten discrete     WordNet(Ext-WN) sense-form nodes and synonyms are de-
bins(bins) based on each word’s d score. The bins are in an      picted in Figure 1.
incremental range of 0.1(( [0-0.1 , 0.11- 0.2 ,...,0.91-1.0]),
with the highest score bin being 1. We then obtain cluster
                                                                 b) Sense-Form Representations
scores, scorecluster as :                                        To obtain sense-form representations, we use a sense-tagged
                          X                                      corpus, OMSTI(Taghipour and Ng 2015). The corpus con-
         scorecluster =         wbin ∗ count(bin)                tains sense-tagged words based on WordNet. Each sense-
                          bin∈bins                               tagged word is associated with the respective synset found
                                                                 in WN. We pre-process the corpora by replacing every
We then get the probability of a cluster(pcluster ) from         word and synset pair as a sense-form based on the syntac-
scorecluster by passing it through a sigmoid function.           tic form of the tagged word and the synset. We then use
                            exp(scorecluster )                   the Word2Vec toolkit((Mikolov et al. 2013b)) with the Skip
             pcluster =                                          Gram objective function and Negative Sampling to obtain
                          exp(scorecluster ) + 1                 our contextualized “sense-form” representations.
   The words in the thesaurus cluster with the highest prob-
ability is then picked to augmented into the synonym list of     c) Word-Sense Representation and Induction
the respective WN synset(s) . We’ve outlined the procedure       We initialise each sense-form node in WN using the rep-
in Algorithm 1.                                                  resentations obtained from the sense-tagged corpora. Then,
                                                                 for each sense-form and the respective augmented synonym
   In the Table 1 we denote the vocabulary and synset cluster    set, we obtain unique multi word-sense representations by
changes brought about by this step. The last column records      jointly grounding the word and sense-form embeddings to
the average number of synonyms linked with a synset in           WordNet. For a word(w) in synonym set of a sense(s), we
WordNet. Originally, owing to WordNet’s stringent relation       obtain multi word-sense representations as follows:
structure we see there are an average of approximately 2 syn-
onyms within a synset. This number drastically increases us-                    vw,s = αw,s ([uw , vs,f orm(s) ])
ing a thesaurus for augmentation.                                   Where, uw is the pre-trained word embedding , vs,f orm(s)
                                                                 is the contextualized sense-form representation of the node
                           Words     Phrases    Average          learned from sense-tagged corpora. For grounding, we use
                                                syn-             WordNet’s synset rank information and graph structure to
                                                onyms(per        obtain the scaling factor, αw,s for grounding as follows:
                                                synset)
 WordNet               147307        69408      1.75                             αw,s = 1 − clog(x), where
 Thesaurus(Introduced) 4026          500        7.37                                 x = ranks,w + d(w, s)

Table 1: Vocabulary and synset cluster changes in WordNet           For word (w) in the w, s pair, WN which gives the list
through Thesaurus Inclusion.                                     of senses(Sw ) in decreasing order of likelihood. We use this
                                                                 to obtain the rank ranks,w of a senses with respect to w.
                                                                 The sense with rank 1 in Sw for a word is thus the most
WordNet Form Extension To obtain representations that            likely sense of the word. As outlined in our previous sec-
cater to both similarity and relatedness, we modify the          tions, we use an augmented synonym set by adding from a
synset nodes in WordNet. A synset in WordNet is repre-           thesaurus for each synset node which means there are many
sented by a set of synonyms. We observe that these synonym       word-sense pairs in our extended-WN not found in WN. For
example, the extended-WN includes “hold” as a synonym              ing similarity scores without using context.
for sense “influence.n.01”. This word and sense pair(hold,                                          M    N
influence.n.01) is not found in WN. Thus “influence.n.01” is                          0        1 XX
                                                                       AvgSim(w, w ) =                    (cos(vw,i , vw0 ,j ))
not part of Shold in the original WN. If a word(w),sense(s)                                   M N i=1 j=1
pair from our extended-WN is present in Sw , we use the                                   0
rank directly. If not, we use the rank of the synset in Sw that        M axSim(w, w ) =            max        cos(vw,i , vw0 ,j )
                                                                                              1≤i≤M,1≤j≤M
is “closest” to the sense s in the word-sense pair. The WN
path-similarity(p) metric is used to denote “closeness”. We        AvgSim computes word similarity as the average similar-
would also like to penalise senses s found in our extended-        ity between all pairs of sense vectors. Whereas M axSim
WN pairs more if they are farther in the WN graph structure        computes the maximum over all pairwise sense vector simi-
to the original senses Sw given by WN for word w. The in-          larities.
tuition is, the closer a sense is to a word in the WN graph,
the more relevant it is to the word. The same intuition is fol-    Baselines We denote two baselines in Table 2. and
lowed in retrofitting vectors to lexicons as well(Faruqui et       Table 3., in addition to the baseline score of the single
al. 2014). d(Sw , s) is the penalizer in our equation which        prototype word embeddings themselves. The first baseline
obtains the distance between a word and a sense as follows:        we denote is to measure performance on concatenat-
                                                                   ing sense embeddings learned from the OMSTI corpus
           d(w, s) = min([1 − p(s, x)∀x ∈ Sw ])                    along with word embeddings using WordNet to retrieve
                                                                   senses for a word. This baseline is to indicate scores on
   Recall p(s, x) is the path-similarity score with a higher       concatenating embeddings from two different sources.
score denoting closer pairs, implying closer pairs get as-         This is denoted as +Synset(WN) in the table. The sec-
signed a lower penalizing distance. We use a monotonically         ond baseline, +CoKE(Ext-WN) is to track performance
decreasing distribution 1 − clog(x) with c as some constant        changes when splitting senses to sense-forms and ground-
in our probability distribution as found by (Arora et al.          ing them to extended-WN. Finally, we show scores with
2018). As a result, of feeding ranks and graph structure           +CoKE(Thes+Ext-WN) which reflects performance of
distances between w and s, to this distribution, the lower         grounded word-sense representations using sense-forms,
ranked(with one being the highest) and farther away synsets        extended-WordNet and the thesaurus.
(or bigger d) get lower scaling scores. Senses similar in rank
and distance thus get similar scaling scores.                      Word Similarity We evaluate our embeddings on sev-
We thus get grounded representations with the scaling factor       eral standard word similarity datasets namely, SimLex
αw,s reflecting likelihood and ontology graph structure.           (Hill, Reichart, and Korhonen 2015)(SL-999), WordSim-
                                                                   353(Gabrilovich and Markovitch ) (WS-S), MC-30(Miller
                                                                   and Charles 1991) , RG-65 (Rubenstein and Goodenough
Experiments                                                        1965), YP-130 (Yang and Powers 2006),SimVerb(Gerz
                                                                   et al. 2016)(SV) and Rare Word(RW) similarity (Luong,
In this section, we describe the experiments done to evaluate      Socher, and Manning 2013a).
our multi word-sense word embeddings. We use an array of           Each dataset contains a list of word pairs with an individual
existing word similarity and relatedness datasets to conduct       score generated by humans of how similar the two words
intrinsic evaluation and 4 datasets across 2 tasks for extrinsic   are. We calculate the Spearman correlation between the la-
evaluation.                                                        bels and the scores generated by our method. For similarity,
                                                                   we use M axSim as a metric to find the most similar pair
Intrinsic Evaluation                                               among different senses of a word. The results are outlined
                                                                   in Table 2.
We test our embeddings intrinsically on similarity, related-
ness and contextual similarity datasets.                              We observe that the lower performance for Synset(WN),
                                                                   obtained by concatenating word with sense embeddings to
Word Representations To run our experiments,we                     get word-sense embeddings, is because of the limited num-
pick two different embeddings of 300 dimension                     ber of synonyms for a synset recorded in WordNet along
GLoVE(Pennington, Socher, and Manning 2014) ,and                   with the limited size of the dataset used to learn these em-
Skip-Gram(SG)(Mikolov et al. 2013a). We use these                  beddings.
embeddings for word sense induction in our experiments             The average improvement column in the table(Avg Improve-
because they are a popular choice for NLP systems at the           ment), shows a significant improvement in performance on
time of writing the paper. The resulting CoKE embeddings           splitting senses to sense-forms and grounding(CoKE(Ext-
after scaling and concatenation with word embeddings is            WN)). The benefits of this approach are reflected mainly
600 dimension.                                                     on the SimVerb-3500 dataset. This is not a surprising result
                                                                   since words tend to have more syntactic forms when they oc-
Similarity Measures Given a pair of words w with M                 cur as verbs. With distributional semantics, syntactic forms
             0
senses and w with N senses, we use the following two met-          of verbs often remain close making it hard to capture differ-
rics proposed by (Reisinger and Mooney 2010) for comput-           ences. However drastic improvements can be seen through
    Vector                   WS-S      RG-65     RW        SL-       YP         MC        SV-      Avg Improvement
                                                           999                            3500
    SG                       76.96     74.97     50.33     44.19     55.89      78.80     36.35    -
    +Synset(WN)              -25.76    -11.85    -28.24    +0.59     +5.41      -11.44    +1.1     -10.02
    +CoKE(Ext-WN)            -24.64    -7.96     -27.7     +4.04     +11.75     -9.48     +6.71    -6.75
    +CoKE(Thes+Ext-          +0.21     +10.84    +1.72     +17.69    +11.69     +5.98     +13.51   +8.80
    WN)
    Glove                    79.43     76.15     45.78     40.82     57.08      78.60     28.32    -
    +Synset(WN)              -23.05    -10.34    -23.03    +0.48     +0.26      -10.24    +0.47    -9.35
    +CoKE(Ext-WN)            -22.11    -4.23     -25.38    +6.96     +7.02      -6.19     +8.06    -5.12
    +CoKE(Thes+Ext-          +0.23     +11.6     +1.51     +18.29    +11.8      +7.27     +17.59   +9.75
    WN)

Table 2: Table showing performance difference using CoKE on similarity tasks. Baselines of scores of original pre-trained
embeddings are included at the top. Synset(WN) indicates concatenation with synset embeddings using senses of a word from
WordNet, CoKE(Ext-WN) represents CoKE obtained using extended-WordNet, and CoKE(Thes+Ext-WN) is CoKE obtained
using the thesaurus augmented version of the extended-WordNet.
                   Vector                   WS-R      MEN       MT-       SGS         Avg Improvement
                                                                771
                   SG                       61.75     73.59     67.71     56.61       -
                   +Synset(WN)              -12.37    -10.07    -6.15     -13.25      -10.46
                   +CoKE(Ext-WN)            -11.65    -8.38     -5.34     -15.72      -10.27
                   +CoKE(Thes+Ext-          +0.13     +0.71     +0.19     +8.51       +2.38
                   WN)
                   Glove                    66.92     79.88     71.57     58.34       -
                   +CoKE(WN)                -6.52     -11.31    -4.54     -14.36      -9.18
                   +CoKE(EXT-WN)            -6.78     -10.64    -3.8      -14.7       -8.98
                   +CoKE(Thes+Ext-          +0.2      +0.49     +0.47     +12.92      +3.52
                   WN)

Table 3: Performance differences using CoKE on word relatedness tasks. Baselines of scores of original pre-trained embeddings
are included at the top. Synset(WN) indicates concatenation with synset embeddings using senses of a word from WordNet,
CoKE(Ext-WN) represents CoKE obtained using extended-WordNet, and CoKE(Thes+Ext-WN) is CoKE obtained using the
thesaurus augmented version of the extended-WordNet.
                                                       Model                         ρ x 100
                                          (Jauhar, Dyer, and Hovy 2015)               61.3
                                   (Iacobacci, Pilehvar, and Navigli 2015) , 2015     62.4
                                                (Huang et al. 2012)                   62.8
                                         (Athiwaratkun and Wilson 2017)               65.5
                                            (Chen, Liu, and Sun 2014)                 66.2
                                              CoKE + SG(Our model)                    67.3
                                             Rothe and Schutze (2015)                 68.9

Table 4: Comparison of our multi word-sense representations with other state-of-the art representations on the Stanford Con-
textual Word Similarity(SCWS) dataset to evaluate polysemous word similarity.


thesaurus inclusion(CoKE(Thes+Ext-WN)), this is because           emes that closely reflect all possible senses of a word.
using WordNet alone leads to limited lexemes on account           We also note that the improvements for WS-S is relatively
of words being represented by fewer senses as opposed to          lower; we suspect this is because the dataset is designed
a large number of senses captured for a word by word em-          based on association rather than similarity alone. We also
beddings, as a result of being trained on large datasets. On      observe that as baselines of embedding spaces get higher for
including a thesaurus and augmenting the synonym set for          datasets, the performance gains reduces since most of the
synsets in WordNet, we see that the number of senses that         information is captured in the embedding spaces. The same
represent a word drastically changes leading to more lex-         trend is also observed in (Faruqui et al. 2014).
Table 5: Accuracy differences on sentiment analysis and classification tasks of CoKE, CoVE, CoVE+CoKE , ELMo,
CoKE+ELMo with GLoVE as baseline.
                 Dataset     GloVe CoKE CoVE CoKE(+CoVE ) ELMo CoKE(+ELMo)
                  SST-2       85.99   85.72    88.18         89.41          88.02       89.32
                  SST-5       50.19   50.56    51.4          50.97          51.62       51.60
                 TREC-6       89.90   91.53    90.56         91.15          91.59       92.78
                TREC-50 83.84          85.5    84.59         85.46          84.31      84.249

Table 6: CoKE improves performance when used alone as well as when used with a disambiguation system. Note, CoVE and
ELMo are only used for disambiguation, their representations aren’t included with CoKE


Word Relatedness Integration of our vectors also shows          Word Similarity for Polysemous Words We use the
improvements in word relatedness tasks. As our benchmark,       SCWS dataset introduced by (Huang et al. 2012), where
we evaluate on WS-R (relatedness) , MTurk(771) ((Halawi         word pairs are chosen to have variations in meanings for pol-
et al. 2012)), MEN((Bruni et al. 2012)), and on SGS130          ysemous and homonymous words. We compare our method
( (Szumlanski, Gomez, and Sims 2013)) which includes            with other state-of-the-art multi-prototype models.We find
phrases. We evaluate the performance of our method against      that our model performs competitively with previous mod-
standard pre-trained word embedding using Spearman cor-         els. We use the Skip-Gram(SG) word embedding with our
relation. We use AvgSim as our metric to measure related-       method to allow for fair comparison, since previous work
ness and report scores Table 3.                                 uses Skip-Gram for retrofitting to WordNet.The Spearman
The baselines we use are the same as for word similarity        correlation between the labels and scores are indicated in
as described above. We notice how performance improve-          Table 4.
ments through sense-form splitting are not as drastic as for
word similarity. This could be on account of word related-      Extrinsic Evaluation
ness tasks more frequently checking for relatedness of ob-
jects rather than verbs; sense-form splitting is more benefi-   A lot of the prior work on obtaining sense embeddings show
cial to verbs than nouns on account of more varying forms       performance improvements in intrinsic tasks, but leave out
of words as verbs.                                              testing them on downstream tasks. It is thus difficult to judge
We are not sure why the overall performance gains are not       the effectiveness of these representations. To bridge this
as high as for similarity, but the scores do reflect gains as   gap, we run experiments on two tasks(Sentiment Analysis
opposed to retrofitting directly to lexicons which leads to a   and Question Classification) across 4 datasets to provide
serious drop in relatedness. The big performance gains on       some insight on the usefulness of our representations.
SGS (Szumlanski, Gomez, and Sims 2013) is due to phrases
present in the dataset. By using a thesaurus and WN, we         Datasets For sentiment analysis we use the Stanford Sen-
learn multiple phrasal representations not found in the orig-   timent Treebank dataset(Socher et al. 2013). We train seper-
inal word embedding space.                                      ately and test on the Binary Version(SST-2) as well as the
                                                                five class version(SST-5). For question classification, we
                                                                evaluate performance on the TREC(Voorhees 2001) ques-
tion classification dataset which consists of open domain        size of 300 and run our experiments. Parameters were fine-
questions and semantic categories.                               tuned specifically for each task and embedding type.
Performance Comparisons We first run experiments on              Results As shown in Table 6. using CoKE shows more
CoKE by representing words as an average of their respec-        significant improvements with Classification as opposed
tive sense embeddings. It is a known fact that words are a       to Sentiment Analysis. This is an expected outcome since
weighted sum of their senses. Thus the intuition of using        our approach focuses on ontology grounding without
averaged embeddings is that having grounded word-sense           considering polarity of words which is the primary goal
representations should lead to better word representations       of Sentiment Analysis. On the other hand, Classification
through averaging.                                               as a task is more sensitive to representations that cater to
                                                                 similarity and relatedness between sentences. Significant
   Recent trends have also lead to an increasing interest        improvements can be seen on classification tasks even by
in transfer learning for obtaining superior word represen-       using averaged CoKE embeddings without disambiguation.
tations. CoVE (McCann et al. 2017) and ELMo (Peters et
al. 2018) show significant improvements in extrinsic tasks.
CoVE uses word representations learned from a machine                             Qualitative Analysis
translation system in combination with GloVE embeddings.
ELMo, on the other hand, uses a language model to obtain         In this section, we look at some visualisations of senses
contextualised word representations. As shown by (Peters et      induced and show how they are easily interpretable. Since
al. 2018), these systems inherently act as word sense disam-     sense tags have lexical mappings to an ontology, they can be
biguation and representation systems. They give word rep-        looked up to find meanings. Moreover, the semantic distribu-
resentations conditioned on the context it occurs in and per-    tion of the word-senses also plays a role in obtaining mean-
form on par with state-of-the-art word sense disambiguation      ingful sense clusters. We analyse two things 1) Sense clus-
systems, but it is unclear how informative the sense repre-      ters induced 2) How using different sense forms affect rep-
sentations are. We thus hypothesise that the systems can ben-    resentations and sense interactions in their respective word
efit by using better sense representations.                      forms. For all our analysis, we use the concatenated version
Due to the promising performance of CoVE and ELMo as             of CoKE + GLoVE embeddings and use Principle Compo-
word sense disambiguation systems and increasing interest        nent Analysis to perform dimensionality reduction.
in using them in NLP tasks, we use them as disambiguation
systems in our experiments to sense tag the four benchmark       Sense Clusters
datasets. To get the disambiguated sense tags using CoVE         We look at the sense clusters formed by our word specific
or ELMo, we use the same approach as outlined in (Peters         senses embeddings for the word “rock”.
et al. 2018). We compute each word’s representation in OM-          The clusters for the word ”rock” is depicted in Figure 2.
STI using CoVE or ELMo and then use the average of all           The multiple fine-grained word-sense embeddings for the
the representations obtained for a sense to get its respective   word “rock” cluster to form 5 basic senses. We see three
sense representations. To disambiguate a sentence, we then       distinct clusters that dominate. “Cluster#2” can be inter-
run the sentence through the CoVE or ELMo architecture           preted as all synsets that speak of rock as a ”substance”. In,
to get word representations and then tag the word by taking      “Cluster#3”, the synsets cluster together to speak of rock as
the nearest neighbour sense from the corresponding CoVE          “music”. An interesting property can be observed compar-
or ELMo computed sense representations. For ELMo, we             ing “Cluster#1” and “Cluster#5”. The senses found in both
use the last layer and the pre-trained version made available    of these clusters interpret “rock” as “movement/motion”.
publicly.                                                        However, the two distinct clusters also capture the kind of
In our experiments, we use the CoKE word-sense em-               motion. For instance ,the senses roll.v.13 and rock.v.01
beddings obtained by using GLoVE with the thesaurus              in “Cluster#5” map specifically “sideways movement”.
and extended-WordNet for grounding.We pick CoKE with             While the senses in “Cluster#1” map to glosses “sudden
GLoVE embeddings to be fair in comparison with CoVE              movements”(convulse , lurch,move , tremble) and “back
which is obtained by concatenation with GLoVE embed-             and forth movements(wobble , rock)”. Another interesting
dings.                                                           property is depicted by “Cluster#4”, although they are more
We thus compare performance using GLoVE, CoVE and                synonymous in meaning to rock as a “substance”, the senses
ELMo independently, using an average of CoKE rep-                for gravel cluster very closely to senses mapping to gloss
resentations to get word representations, and also using         “jerking” movements capturing deeper relations between
ELMo/CoVE as disambiguation systems with sense-tagged            senses.
words represented with CoKE embeddings(CoKE+(CoVE),
CoKE(+ELMo)). Note if a word is not sense-tagged we use
vanilla GLoVE vectors concatenated with an unknown vec-          Sense Forms
tor.
                                                                 In this section, we analyse how different sense-form repre-
Training Details To test for performance of different            sentations interact for synonyms within a synset. We do so
embeddings on datasets, we implement a single-layer              by considering the word-forms “plan” and “planning” both
LSTM(Hochreiter and Schmidhuber 1997) with a hidden              of which are synonyms of their respective sense-forms of
                             Figure 2: Sense clusters for the word ”rock”, visualized using PCA.


Figure 3: a) Interactions between different senses of the word ”plan” b) Interactions between different senses of the word
”planning”


“mastermind.v.01” (Gloss: plan and direct, a complex un-          “sketch”, “prepare”. In contrast, the same synset in the em-
dertaking).                                                       bedding space for ”planning” as shown in Figure 3.b) in-
   In order to observe the difference in sense-form relation-     teracts closely with synsets that are analogous to “project
ships of word-forms, we consider only common synsets in           planning”, “scheduling”, “organising”. This shows how us-
“plan” and “planning” for visualisation and observe the in-       ing different sense-form representations, leads to different
teractions with each other. For the word “plan” as shown          and unique interactions among the same group of synsets
in Figure 3.a), we observe that the synset “mastermind” is        for each word.
closer in proximity to synsets that map to words like “plan”,
                       Conclusion                                 Hill, F.; Reichart, R.; and Korhonen, A. 2015. Simlex-999:
In our work, we explore the possibility of obtaining multi        Evaluating semantic models with (genuine) similarity esti-
word-sense representations and sense induction to embed-          mation. Computational Linguistics 41(4):665–695.
ding spaces by using distributional semantics and a knowl-        Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
edge base. The prototypes allow ease of use with WSD sys-         memory. Neural computation 9(8):1735–1780.
tems, can easily be used in downstream applications since         Huang, E. H.; Socher, R.; Manning, C. D.; and Ng, A. Y.
they are portable and are flexible to use in a wide variety       2012. Improving word representations via global context
of tasks. Previous work on obtaining sense representations        and multiple word prototypes. In Proceedings of the 50th
falls under three distinct clusters - Unsupervised methods,       Annual Meeting of the Association for Computational Lin-
Supervised resource-specific methods and ontology ground-         guistics: Long Papers-Volume 1, 873–882. Association for
ing. By using pre-trained unsupervised embeddings, super-         Computational Linguistics.
vised sense embeddings and jointly grounding them in an
ontology, ours is the first approach that lies in the intersec-   Iacobacci, I.; Pilehvar, M. T.; and Navigli, R. 2015. Sensem-
tion of all three approaches. The code and vectors will be        bed: Learning sense embeddings for word and relational
made available publicly as well.                                  similarity. In Proceedings of the 53rd Annual Meeting of the
                                                                  Association for Computational Linguistics and the 7th Inter-
                                                                  national Joint Conference on Natural Language Processing
                        References                                (Volume 1: Long Papers), volume 1, 95–105.
Arora, S.; Li, Y.; Liang, Y.; Ma, T.; and Risteski, A. 2018.
Linear algebraic structure of word senses, with applications      Jauhar, S. K.; Dyer, C.; and Hovy, E. 2015. Ontologically
to polysemy. Transactions of the Association of Computa-          grounded multi-sense representation learning for semantic
tional Linguistics 6:483–495.                                     vector space models. In Proceedings of the 2015 Confer-
                                                                  ence of the North American Chapter of the Association for
Athiwaratkun, B., and Wilson, A. G. 2017. Multimodal              Computational Linguistics: Human Language Technologies,
word distributions. In Proceedings of the 55th Annual Meet-       683–693.
ing of the Association for Computational Linguistics (Vol-
ume 1: Long Papers), 1645–1656.                                   Luong, M.-T.; Socher, R.; and Manning, C. D. 2013a. Bet-
                                                                  ter word representations with recursive neural networks for
Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T.            morphology. In CoNLL.
2016a. Enriching word vectors with subword information.
arXiv preprint arXiv:1607.04606.                                  Luong, T.; Socher, R.; and Manning, C. 2013b. Better word
                                                                  representations with recursive neural networks for morphol-
Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T.            ogy. Proceedings of the Seventeenth Conference on Compu-
2016b. Enriching word vectors with subword information.           tational Natural Language Learning.
arXiv preprint arXiv:1607.04606.
                                                                  McCann, B.; Bradbury, J.; Xiong, C.; and Socher, R. 2017.
Bruni, E.; Boleda, G.; Baroni, M.; and Tran, N.-K. 2012.          Learned in translation: Contextualized word vectors. In Ad-
Distributional semantics in technicolor. In Proceedings of        vances in Neural Information Processing Systems, 6297–
the 50th Annual Meeting of the Association for Computa-           6308.
tional Linguistics: Long Papers-Volume 1, 136–145. Asso-
ciation for Computational Linguistics.                            Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a.
                                                                  Efficient estimation of word representations in vector space.
Chen, X.; Liu, Z.; and Sun, M. 2014. A unified model for
                                                                  arXiv preprint arXiv:1301.3781.
word sense representation and disambiguation. In Proceed-
ings of the 2014 Conference on Empirical Methods in Natu-         Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
ral Language Processing (EMNLP), 1025–1035.                       Dean, J. 2013b. Distributed representations of words and
                                                                  phrases and their compositionality. In Advances in neural
Cheng, J., and Kartsaklis, D. 2015. Syntax-aware multi-
                                                                  information processing systems, 3111–3119.
sense word embeddings for deep compositional models of
meaning. arXiv preprint arXiv:1508.02354.                         Miller, G. A., and Charles, W. G. 1991. Contextual corre-
Faruqui, M.; Dodge, J.; Jauhar, S. K.; Dyer, C.; Hovy, E.;        lates of semantic similarity. Language and cognitive pro-
and Smith, N. A. 2014. Retrofitting word vectors to semantic      cesses 6(1):1–28.
lexicons. arXiv preprint arXiv:1411.4166.                         Miller, G. A. 1995. Wordnet: a lexical database for english.
Gabrilovich, E., and Markovitch, S. Computing semantic re-        Communications of the ACM 38(11):39–41.
latedness using wikipedia-based explicit semantic analysis.       Neelakantan, A.; Shankar, J.; Passos, A.; and McCallum,
Gerz, D.; Vulić, I.; Hill, F.; Reichart, R.; and Korhonen, A.    A. 2015. Efficient non-parametric estimation of multi-
2016. Simverb-3500: A large-scale evaluation set of verb          ple embeddings per word in vector space. arXiv preprint
similarity. arXiv preprint arXiv:1608.00869.                      arXiv:1504.06654.
Halawi, G.; Dror, G.; Gabrilovich, E.; and Koren, Y. 2012.        Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
Large-scale learning of word relatedness with constraints. In     Global vectors for word representation. In Proceedings of
Proceedings of the 18th ACM SIGKDD international confer-          the 2014 conference on empirical methods in natural lan-
ence on Knowledge discovery and data mining, 1406–1414.           guage processing (EMNLP), 1532–1543.
ACM.                                                              Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
word representations. In Proc. of NAACL.
Reisinger, J., and Mooney, R. J. 2010. Multi-prototype
vector-space models of word meaning. In Human Lan-
guage Technologies: The 2010 Annual Conference of the
North American Chapter of the Association for Computa-
tional Linguistics, 109–117. Association for Computational
Linguistics.
Rothe, S., and Schütze, H. 2015. Autoextend: Extending
word embeddings to embeddings for synsets and lexemes.
arXiv preprint arXiv:1507.01127.
Rubenstein, H., and Goodenough, J. B. 1965. Contex-
tual correlates of synonymy. Communications of the ACM
8(10):627–633.
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning,
C. D.; Ng, A.; and Potts, C. 2013. Recursive deep mod-
els for semantic compositionality over a sentiment treebank.
In Proceedings of the 2013 conference on empirical methods
in natural language processing, 1631–1642.
Szumlanski, S.; Gomez, F.; and Sims, V. K. 2013. A new
set of norms for semantic relatedness measures. In Proceed-
ings of the 51st Annual Meeting of the Association for Com-
putational Linguistics (Volume 2: Short Papers), volume 2,
890–895.
Taghipour, K., and Ng, H. T. 2015. One million sense-
tagged instances for word sense disambiguation and induc-
tion. In Proceedings of the Nineteenth Conference on Com-
putational Natural Language Learning, 338–344.
Tian, F.; Dai, H.; Bian, J.; Gao, B.; Zhang, R.; Chen, E.; and
Liu, T.-Y. 2014. A probabilistic model for learning multi-
prototype word embeddings. In Proceedings of COLING
2014, the 25th International Conference on Computational
Linguistics: Technical Papers, 151–160.
Voorhees, E. M. 2001. The trec question answering track.
Natural Language Engineering 7(4):361–378.
Wu, Z., and Giles, C. L. 2015. Sense-aaware semantic
analysis: A multi-prototype word representation model us-
ing wikipedia. In AAAI, 2188–2194.
Yang, D., and Powers, D. M. 2006. Verb similarity on the
taxonomy of WordNet. Masaryk University.

</pre>