How Contextualized Word Embeddings Represent Word Senses

                                               Rocco Tripodi
                                            University of Bologna
                                        rocco.tripodi@unibo.it


                       Abstract                                BERT (Devlin et al., 2019), allows the construc-
                                                               tion of vector representations of lexical items that
    English. Contextualized embedding mod-
                                                               adapt to the context in which words appear. It has
    els, such as ELMo and BERT, allow the
                                                               been shown that the upper layers of these mod-
    construction of vector representations of
                                                               els contain semantic information (Jawahar et al.,
    lexical items that adapt to the context in
                                                               2019) and are more diversified than lower lay-
    which words appear. It was demonstrated
                                                               ers (Ethayarajh, 2019). These word representa-
    that the upper layers of these models cap-
                                                               tions overcame the meaning conflation deficiency
    ture semantic information. This evidence
                                                               that affects static word embedding techniques
    paved the way for the development of
                                                               (Camacho-Collados and Pilehvar, 2018; Tripodi
    sense representations based on words in
                                                               and Pira, 2017), such as word2vec (Mikolov et al.,
    context. In this paper, we analyze the
                                                               2013) or GloVe (Pennington et al., 2014) thanks to
    vector spaces produced by 11 pre-trained
                                                               the adaptation to the context of use.
    models and evaluate these representations
    on two tasks. The analysis shows that                         The evaluation of these models has been con-
    all these representations contain redundant                ducted mainly on downstream tasks (Wang et al.,
    information. The results show the disad-                   2018; Wang et al., 2019). With extrinsic evalua-
    vantage of this aspect.                                    tions, the models are fine-tuned, adapting the vec-
                                                               tor representations to specific tasks. The result-
    Italiano. Modelli come ELMo o BERT                         ing vectors are then used as features in classifica-
    consentono di ottenere rappresentazioni                    tion problems. This hinders a direct evaluation and
    vettoriali delle parole che si adattano                    analysis of the models because the evaluation also
    al contesto in cui queste appaiono. Il                     takes into account the ability of the classifier to
    fatto che i livelli alti di questi mod-                    learn the task. A model trained for this kind of task
    elli immagazzinino informazione seman-                     may learn only to discriminate among features that
    tica ha portato a sviluppare rappresen-                    belong to each class with poor generalization.
    tazioni di senso basate su parole nel
                                                                  The interpretability of neural networks is an
    contesto. In questo lavoro analizziamo
                                                               emerging line of research NLP that aims at ana-
    gli spazi vettoriali prodotti con 11 mod-
                                                               lyzing the properties of pre-trained language mod-
    elli pre-addestrati e valutiamo le loro
                                                               els (Belinkov and Glass, 2019). Different stud-
    prestazioni nel rappresentare i diversi
                                                               ies have been conducted in recent years to dis-
    sensi delle parole. Le analisi condotte
                                                               cover what kind of linguistic information is stored
    mostrano che questi modelli contengono
                                                               in large neural language models. Many of them
    informazioni ridondanti. I risultati eviden-
                                                               are focused on syntax (Hewitt and Manning, 2019;
    ziano le criticità inerenti a questo aspetto.
                                                               Jawahar et al., 2019) and attention (Michel et
                                                               al., 2019; Kovaleva et al., 2019). For what con-
1   Introduction                                               cerns semantics, the majority of the studies fo-
The introduction of contextualized embedding                   cus on common knowledge (Petroni et al., 2019)
models, such as ELMo (Peters et al., 2018) and                 and inference and role-based event prediction (Et-
                                                               tinger, 2020). Only a few of them have been de-
     Copyright © 2021 for this paper by its author. Use per-
mitted under Creative Commons License Attribution 4.0 In-      voted to lexical semantics, for example, Reif et al.
ternational (CC BY 4.0).                                       (2019) show how different representations of the
            Figure 1: t-SNE representations for the word foot in SemCor, grouped by sense.


same lexical form tend to cluster according to their     sis. In fact, it is difficult to understand whether
sense.                                                   high accuracy values are due to the representation
   In this work, we propose an in-depth analy-           itself or, instead, they are the result of the ability
sis of the properties of the vector spaces induced       to learn a specific task during training.
by different embedding models and an evaluation             Our work is more in line with works that try
of their word representations. We present how            to find general properties of the representations
the properties of the vector space contribute to         generated by different contextualized models. For
the success of the models in two tasks: sense in-        example, Mimno and Thompson (2017) demon-
duction and word sense disambiguation. In fact,          strated that the vector space produced by a static
even if contextualized models do not create one          embedding model is concentrated in a narrow
representation per word sense (Ethayarajh, 2019),        cone and that its concentration depends on the ra-
their contextualization create similar representa-       tio of positive and negative examples. Mu and
tions for the same word sense that can be easily         Viswanath (2018) explored this analysis further,
clustered.                                               demonstrating that the embedding vectors share
                                                         the same common vector and have the same main
2   Related Work                                         direction. Ethayarajh (2019) demonstrated how
Given the success (and the opacity) of contextual-       upper layers of a contextualizing model produce
ized embedding models, many works have been              more contextualized representations. We built on
proposed to analyze their inner representations.         top of these works analyzing the vector space gen-
These analyses are based on probing tasks (Con-          erated by contextualized models and evaluating
neau et al., 2018) that aim at measuring how the         them.
information extracted from a pre-trained model is
                                                         3   Construction of the Vector Spaces
useful to represent linguistic structures. Probing
tasks involve training a diagnostic classifier to de-    We used SemCor (Miller et al., 1993) as reference
termine if it encodes desired features. Tenney et al.    corpus for our work. This choice is motivated by
(2019) discovered that specific BERT’s layers are        the fact that it is the largest dataset manually anno-
more suited for representing information useful to       tated with sense information and it is commonly
solve specific tasks and that the ordering of its lay-   used as training set for word sense disambigua-
ers resembles the ordering of a traditional NLP          tion. It contains 352 documents whose content
pipeline: POS tagging, parsing, NER, semantic            words (about 226, 000) have been annotated with
role labeling, and coreference resolution. He-           WordNet (Miller, 1995) senses. In total there are
witt and Manning (2019) evaluated whether syn-           33, 341 unique senses distributed over 22, 417 dif-
tax trees are embedded in a linear transformation        ferent words. The sense distribution in this corpus
of a neural network’s word representation space.         is very skewed, and follows a power law (Kilgar-
Hewitt and Liang (2019) raised the problem of in-        riff, 2004). This makes the identification of senses
terpreting the results derived from probing analy-       challenging. The dataset is also difficult due to the
    Model                                training data   vocab. size   n. param.   vec. dim.   objective
    BERTbase (Devlin et al., 2019)           16GB           30K         110M          768      masked language model and next sentence prediction
    BERTlarge (Devlin et al., 2019)          16GB           30K         340M         1024      masked language model and next sentence prediction
    GPT-2base (Radford et al., 2019)         40GB           50K         117M          768      language model
    GPT-2medium (Radford et al., 2019)       40GB           50K         345M         1024      language model
    GPT-2large (Radford et al., 2019)        40GB           50K         774M         1280      language model
    RoBERTabase (Liu et al., 2019)          160GB           50K         125M          768      masked language model
    RoBERTalarge (Liu et al., 2019)         160GB           50K         355M         1024      masked language model
    XLNetbase (Yang et al., 2019)           126GB           32K         110M          768      bidirectional language model
    XLNetlarge (Yang et al., 2019)          126GB           32K         340M         1024      bidirectional language model
    XLMenglish                               16GB           30K         665M         2048      language model
    CTRL (Keskar et al., 2019)              140GB          250K         1.63B        1280      conditional transformer language model


                                  Table 1: Statistics and hyperparameters of the models.

     Model            AvgNorm               MeanVecNorm(A)             MeanVecNorm(Â)           avg.MEV        avg.IntSim       avg.ExtSim
     BERTbase         25.78 ± 1.28          17.94                      17.84                     0.43 ± 0.18    0.74 ± 0.05      0.69 ± 0.06
     BERTlarge        20.83 ± 2.51          12.43                      11.58                     0.38 ± 0.18    0.66 ± 0.08      0.59 ± 0.08
     GPT-2base        125.13 ± 10.25        91.46                      90.99                     0.46 ± 0.18    0.79 ± 0.05      0.76 ± 0.05
     GPT-2medium      427.45 ± 38.78        371.86                     360.36                    0.51 ± 0.18    0.85 ± 0.03      0.84 ± 0.03
     GPT-2large       290.29 ± 38.56        226.39                     212.97                    0.43 ± 0.18    0.75 ± 0.05      0.72 ± 0.05
     RoBERTabase      25.78 ± 0.56          22.17                      22.25                     0.51 ± 0.17    0.87 ± 0.02      0.85 ± 0.03
     RoBERTalarge     31.47 ± 0.65          26.99                      27.04                     0.52 ± 0.18    0.88 ± 0.02      0.84 ± 0.03
     XLNetbase        47.68 ± 0.66          43.28                      43.26                     0.53 ± 0.17    0.88 ± 0.01      0.87 ± 0.02
     XLNetlarge       28.27 ± 1.42          19.56                      19.68                     0.38 ± 0.17    0.66 ± 0.04      0.62 ± 0.05
     XLMenglish       44.92 ± 2.61          37.13                      36.7                      0.45 ± 0.18    0.79 ± 0.03      0.77 ± 0.03
     CTRL             4443.62 ± 351.98      3927.86                    3879.56                   0.49 ± 0.18    0.84 ± 0.02      0.83 ± 0.02


              Table 2: Detailed description of the embedding space produced with each model.


fine granularity of WordNet (Navigli, 2006).                                 12-heads, 110M parameters) and large cased
   To construct the vector space A from Sem-                                 (24-layer, 1024-hidden, 16-heads, 340M param-
Cor we collected all the senses Si of a word                                 eters); three GPT-2 (Radford et al., 2019) mod-
wi and for each sense sj ∈ Si we recovered                                   els, base (12-layer, 768-hidden, 12-heads, 117M
                             ws        ws           ws
the sentences {Sent1 i j , Sent2 i j , ..., Sentn i j }                      parameters), medium (24-layer, 1024-hidden, 16-
in which this particular sense occurs. These sen-                            heads, 345M parameters) and large (36-layer,
tences are then fed into a pre-trained model and                             1280-hidden, 20-heads, 774M parameters); two
the token embedding representations of word wi ,                             RoBERTa (Liu et al., 2019) models, base (12-
   ws      ws            ws
{e1 i j , e2 i j , ..., en i j }, are extracted from the                     layer, 768-hidden, 12-heads, 125M parameters)
last hidden layer. This operation is repeated for                            and large (24-layer, 1024-hidden, 16-heads, 355M
all the senses in Si , and for all the tagged words in                       parameters); two XLNet (Yang et al., 2019) mod-
the vocabulary, V . The vector space corresponds                             els, base (12-layer, 768-hidden, 12-heads, 110M
to all the representations of the words in V .                               parameters) and large (24-layer, 1024-hidden, 16-
   A t-SNE visualization of the different embed-                             heads, 340M parameters); one XLM (Lample
dings in SemCor for the word foot is presented in                            et al., 2019) model (12-layer, 2048-hidden, 16-
Figure 1. In this Figure, we can see that the three                          heads) and one CTRL (Keskar et al., 2019) model
main senses of foot (i.e., human foot, unit of length                        (48-layer, 1280-hidden, 16-heads, 1.6B parame-
and lower part) occupy a definite position in the                            ters). The main features of these models are sum-
vector space, suggesting that the models are able                            marized in Table 1. We averaged the embed-
to produce specific representations for the differ-                          dings of sub-tokens to obtain token-level represen-
ent senses of a word and that they lie on defined                            tations.
subspaces. In this work we want to test to what
extent this feature is present in language models.                           3.1      Analysis
                                                                             The first objective of this work is to analyze the
Implementations details The pre-trained mod-
                                                                             vector space produced with the models. This anal-
els used in this study are: two BERT (Devlin et al.,
                                                                             ysis is aimed at investigating the properties of the
2019) models, base cased (12-layer, 768-hidden,
                                                                             contextualized vectors. A detailed description of
    We used the transformers library (Wolf et al., 2019).                    the embedding spaces constructed with the pre-
trained models is presented in Table 2. We com-
puted the norm for all the vectors in the vector
space A, and averaged them:

                               |A|
                       1 X
           AvgN orm =      ∥ei ∥2 .                (1)
                      |A|
                               i=1

This measure gives us an intuition on how diverse        Figure 2: The first 500 principal components com-
the semantic space constructed with the different        puted on A and Â.
models is. In fact, we can see that the magnitude
of the vectors constructed with BERT, RoBERTa,           then computed the internal similarity of a cluster,
XLNet, and XLM is low while those of GPT-2 and           c, as:
CTRL are very high.
                                                                            1 XX
   We computed also the norm of the vector re-              IntSim(c) = 2                cos(ej , ek ), (4)
sulting in averaging all the vectors in the semantic                     n −n
                                                                                    j   k̸=j
space V , as:
                                                         where n is the number of data points in the cluster.
                                     |A|                 We computed also the external similarity of a clus-
                         1 X                             ter c by computing the cosine similarity among
       M eanV ecN orm =      ei                .   (2)
                        |A|                              each point in c and all the points in the subspace S
                                     i=1
                                           2
                                                         induced by the senses of a word that has c as one
All the semantic spaces have non-zero mean and           of its senses:
the mean norm is high. This result suggests                                        n    m
                                                                             1 XX
that the vectors contain redundant information and           ExtSim(c) =          cos(ej , ek ),         (5)
share a common nonzero vector. This is not only                             n·m
                                                                                  j=1 k=1
because the vector space contains representations
                                                         where m is the total number of data points in the
of the same sense. In fact, if we create a new se-
                                                         subspace S (excluding those in c) and n is the
mantic space, Â, averaging all the representations
                                                         number of points in the cluster c. Our hypothe-
of the same word sense, the M eanV ecN orm of
                                                         sis is that good representations should have high
this space is still high for all the models.
                                                         internal similarity and low external similarity and
   We used the Maximum Explainable Variance
                                                         that the difference between them should be high.
(MEV) for the representations of each word in V .
                                                            As it can be seen from Table 2 the internal
This measure corresponds to the proportion of the
                                                         similarity is higher than the external for all the
variance in the embeddings that can be explained
                                                         models. Despite this, the scores are in a wide
by their first principal components and was com-
                                                         range. The lowest IntSim is given by BERTlarge
puted as:
                                                         and the highest by RoBERTalarge and XLNetbase .
                                   σ2
                M EV (w) = P 1 2 .              (3)      The lowest ExtSim is given by BERTlarge and
                                   i σi                  the highest by XLNetbase . The largest difference
where σi2 1 is the first principal component of the      between the two measures is given by BERTlarge .
vector space A. It can give an upper bound on how        RoBERTalarge gives has also a large gap between
contextualized representations can be replaced by        the two measures, furthermore, their standard de-
a static embedding (Ethayarajh, 2019). The model         viation is very low. As we will see in Section 4
with the lowest MEV is BERTlarge and XLNetlarge .        these last two models perform better than others
   The other measures that we used for the evalu-        in clustering and classification tasks.
ation of the vector space are based on the very no-
                                                         4    Evaluation
tion of a cluster, which imposes that the data points
inside a cluster must satisfy two conditions: inter-     Sense Induction This task is aimed at under-
nal similarity and external dissimilarity (Pelillo,      standing if representations belonging to different
2009). To this end, we used the senses of each           senses can be separated using an unsupervised ap-
word in the vocabulary of SemCor as clusters and         proach. We hypothesize that a good contextualiza-
extracted the corresponding vectors from V . We          tion process should produce more discriminative
              model                         k-means                                  dominant-set
                              N      V      A      R         All        N      V      A      R          All
              BERTbase       57.2   50.6   56.2   62.0   54.9 ± 14.8   55.7   45.3   51.7   45.8    51.0 ± 17.5
              BERTlarge      59.3   51.9   56.9   59.0   56.2 ± 15.3   53.4   42.6   46.8   39.9    47.8 ± 17.1
              GPT-2base      54.1   48.3   55.6   56.8   52.3±14.7     54.3   45.3   50.2   46.3    50.1 ± 17.2
              GPT-2medium    53.9   49.1   56.2   59.8   52.8 ± 14.5   59.7   49.8   58.7   54.8    56.0 ± 18.8
              GPT-2large     53.8   49.4   58.1   58.8   53.0 ± 14.8   50.2   44.1   46.1   44.1    47.1 ± 16.0
              RoBERTabase    56.4   51.4   56.7   59.7   54.8 ± 14.7   65.3   55.1   64.8   61.4    61.6 ± 19.2
              RoBERTalarge   58.5   53.0   58.6   62.7   56.7±14.9     66.7   56.6   66.3   64.2    63.2±19.3
              XLNetbase      54.2   49.1   53.8   56.8   52.2 ± 14.4   67.2   55.0   68.7   63.8    62.7±20.7
              XLNetlarge     57.6   52.5   57.9   60.8   55.9±14.4     51.0   44.8   47.5   40.9    47.6±15.0
              XLMenglish     56.3   50.1   56.5   62.1   54.3 ± 15.1   60.4   51.3   59.5   55.9    57.0 ± 18.1
              CTRL           53.8   47.0   56.5   57.4   51.9 ± 15.4   60.4   49.4   61.7   56.3    56.8 ± 19.2

Table 3: Results (as average accuracy) on clustering divided by algorithm and part of speech: nouns (N),
verbs (V), adjectives (A), adverbs (R) and on the concatenations of all datasets (All).


representations that can be easily identified by a              tially using a peel-off strategy. This feature al-
clustering algorithm.                                           lows us to include in the evaluation also unam-
   We used the sense clusters extracted from Sem-               biguous words and to see if their representations
Cor as ground truth for this experiment (see Sec-               are grouped into a single cluster or partitioned into
tion 3) and grouped them if they are senses of                  different ones. We used cosine similarity to weigh
the same word (with a given part of speech). We                 the edges of the input graph.
retained only the groups that have at least 20                     The results of this evaluation are presented in
data points and we discarded also monosemous                    Table 3. RoBERTa and BERT have the overall best
words for the evaluation on k-means. The re-                    performances on this task using both algorithms.
sulting datasets consist of 1871 (entire) and 1499              In particular, RoBERTalarge performs consistently
(without monosemous words) sub-datasets with                    well on all parts of speech and across algorithms,
141, 074 and 116, 019 data points in total, respec-             while other models perform well only in combina-
tively. We computed the accuracy on each sub-                   tion with one of the two algorithms. This is pre-
dataset computing the number of data points that                sumably owing to the big gap between the internal
have been clustered correctly and averaged the re-              and the external similarity produced by this model,
sults to measure the performance of each model.                 as explained in Section 3.1.
                                                                   This evaluation tends to confirm the claim that
   The first algorithm is k-means (Lloyd, 1982).
                                                                larger versions of the same model achieve bet-
It is a partitioning, iterative algorithm whose ob-
                                                                ter results. From Table 3, we can also see that
jective is to minimize the sum of point-to-centroid
                                                                the models have more difficulties in identifying
distances, summed over all k clusters. We used
                                                                the different senses of verbs, while nouns and ad-
the k-means++ heuristic (Arthur and Vassilvitskii,
                                                                verbs have higher results. This is probably due
2007) and the cosine distance metric to determine
                                                                to the different distribution of these word classes
distances. We selected this algorithm because it
                                                                in the training sets of the models and WordNet’s
is simple, non-parametric, and is widely used. It
                                                                fine-granularity. The performances of the models
is important to notice that k-means requires the
                                                                with dominant-set are surprisingly high, consid-
number of clusters to extract, for this reason, we
                                                                ering that the setting of this experiment is com-
restricted the evaluation only to ambiguous words.
                                                                pletely unsupervised. Furthermore, this algorithm
   The second algorithm used is dominant-set (Pa-               is conceived to extract compact clusters and this
van and Pelillo, 2007). It is a graph-based algo-               feature could drive it to over partition the vector
rithm that extracts compact structures from graphs              space of monosemous words. Instead, the results
generalizing the notion of maximal clique defined               suggest the opposite: that the models are able to
on unweighted graphs to edge-weighted graphs.                   produce representations with high internal similar-
We selected this algorithm because it is non-                   ity, positioning their representations on a defined
parametric, requires only the adjacency matrix of               sub-space.
a weighted graph as input, and, more importantly,
does not require the number of clusters to extract.             Word Sense Disambiguation We used the
The clusters are extracted from the graph sequen-               method proposed in Peters et al. (2018) to create
 Model                  S2                   S3                  SE07                 SE13                 SE15                 All
                 P      R      F1     P      R      F1     P      R      F1     P      R      F1     P      R      F1     P      R      F1
 BERTbase       80.6   67.9   73.7   77.2   68.8   72.8   66.4   63.1   64.7   74.4   62.7   68.1   78.3   68.8   73.2   77.0   66.8   71.5
 BERTlarge      81.2   68.4   74.3   80.3   71.5   75.6   68.5   65.1   66.7   75.8   63.9   69.3   79.7   70.1   74.6   77.9   67.5   72.3
 GPT-2base      75.6   63.7   69.1   71.5   63.7   67.4   59.3   56.3   57.7   71.8   60.5   65.7   74.4   65.4   69.6   72.4   62.8   67.2
 GPT-2medium    76.5   64.5   70.0   72.9   65.0   68.7   62.0   58.9   60.4   74.0   62.3   67.7   76.6   67.3   71.7   74.0   64.2   68.8
 GPT-2large     76.4   64.4   69.9   72.1   64.2   67.9   61.8   58.7   60.2   72.8   61.4   66.6   75.6   66.3   70.7   73.4   63.6   68.1
 RoBERTabase    82.0   69.1   75.0   79.4   70.7   74.8   66.7   63.3   64.9   75.5   63.7   69.1   79.5   69.9   74.4   78.5   68.0   72.9
 RoBERTalarge   82.0   69.1   75.0   80.0   71.2   75.4   70.6   67.0   68.8   77.1   65.0   70.5   81.0   71.1   75.7   79.4   68.9   73.8
 XLNetbase      78.8   65.8   71.7   76.2   67.4   71.5   67.3   63.7   65.5   70.7   58.3   63.9   77.5   67.1   71.9   75.4   64.6   69.5
 XLNetlarge     80.6   67.9   73.7   78.7   70.1   74.2   67.6   64.2   65.8   75.3   63.5   68.9   80.6   70.8   75.4   78.0   67.7   72.5
 CTRL           73.4   61.9   67.1   70.1   62.5   66.1   54.2   51.4   52.8   68.2   57.5   62.4   72.3   63.5   67.6   69.9   60.6   64.9


Table 4: Results indicating precision (P), recall (R) and F1 on each dataset and on their concatenation
(All). All the results are computed using Â as vector space.


sense vectors from contextualized word vectors.                         5      Conclusion and Future Work
This method consists in averaging all the repre-
                                                                        We conducted an extensive analysis of the seman-
sentations of a given sense. The resulting vector
                                                                        tic capabilities of contextualized embedding mod-
space corresponds to Â (see Section 3.1). We eval-
                                                                        els. We analyzed the vector space constructed us-
uated the generated vectors on a standard bench-
                                                                        ing pre-trained models and found that their vectors
mark (Raganato et al., 2017) for WSD. It consists
                                                                        contain redundant information and that their first
of five datasets that were unified to the same Word-
                                                                        two principal components are dominant.
Net version: Senseval-2 (S2), Senseval-3 (S3),
                                                                           The results on sense induction are promising.
SemEval-2007 (S7), SemEval-2013 and SemEval-
                                                                        They demonstrated the effectiveness of contex-
2015, having in total 10, 619 target words.
                                                                        tualized embeddings to capture semantic infor-
   The identification of word senses is conducted                       mation. We did not find higher performances
by feeding the entire texts of the datasets into a                      from more complex models, rather, we found that
pre-trained model and extracting, for each target                       RoBERTa, a model that was developed by sim-
word wi , its embedding representation ew  i
                                          k as was                      plifying a more complex model, BERT, was one
done for the construction of the semantic space.                        of the best performers. Neither the dimension of
Once these representations are available, we com-                       the hidden layers, the size of the training data,
pute the cosine similarities among ew i
                                    k and the em-                       nor the size of the vocabulary seems to play a big
beddings in Â constructed with the same model                          role in modeling semantics. As stated in previous
and selected the sense with the highest similarity.                     works, inserting an anisotropy penalty to the ob-
We did not use more sophisticated models such as                        jective function of the models could improve di-
WSD-games (Tripodi and Navigli, 2019; Tripodi                           rectly the representations. We also noticed that,
et al., 2016) because we wanted to keep the evalu-                      even if BERT models and XLNet have different
ation as simple as possible as not to influence the                     objectives and are trained on different data, they
evaluation of the results.                                              have similar performances. It emerged that these
   The results of this evaluation are presented in                      models are less redundant than others.
Table 4. The first trend that emerges from the                             The conclusion that we can draw from our
results is the big gap between precision and re-                        analysis and evaluation is that pre-trained lan-
call. This is due to the absence of many senses in                      guage models can capture lexical-semantic infor-
our training set. We did not want to use back-off                       mation and that unsupervised models can be used
strategies or other techniques usually employed in                      to distinguish among their representations. On
the WSD literature, to not influence the perfor-                        the other hand, these representations are redun-
mances and the analysis of the results. Despite                         dant and anisotropic. We hypothesize that reduc-
the simplicity of the approach, it performs surpris-                    ing these aspects can lead to better representations.
ingly well. In particular, BERT, RoBERTa, and                           This operation can be carried out post-hoc but we
XLNet (three bidirectional models) have very high                       think that training new models keeping this point
results. The low performances of CTRL are proba-                        in mind could lead to the development of better
bly due to its large vocabulary and to its objective,                   models.
designed to solve different tasks.
References                                                  the North American Chapter of the Association for
                                                            Computational Linguistics: Human Language Tech-
David Arthur and Sergei Vassilvitskii. 2007. k-             nologies, Volume 1 (Long and Short Papers), pages
  means++: the advantages of careful seeding. In Pro-       4129–4138, Minneapolis, Minnesota, June. Associ-
  ceedings of the Eighteenth Annual ACM-SIAM Sym-           ation for Computational Linguistics.
  posium on Discrete Algorithms, SODA 2007, New
  Orleans, Louisiana, USA, January 7-9, 2007, pages       Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah.
  1027–1035.                                                2019. What does BERT learn about the structure
                                                            of language? In Proceedings of the 57th Annual
Yonatan Belinkov and James Glass. 2019. Analysis
                                                            Meeting of the Association for Computational Lin-
  methods in neural language processing: A survey.
                                                            guistics, pages 3651–3657, Florence, Italy, July. As-
  Transactions of the Association for Computational
                                                            sociation for Computational Linguistics.
  Linguistics, 7:49–72, March.
José Camacho-Collados and Mohammad Taher Pile-           Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,
   hvar. 2018. From word to sense embeddings: A             Caiming Xiong, and Richard Socher. 2019. Ctrl: A
   survey on vector representations of meaning. J. Ar-      conditional transformer language model for control-
   tif. Intell. Res., 63:743–788.                           lable generation. arXiv preprint arXiv:1909.05858.

Alexis Conneau, German Kruszewski, Guillaume              Adam Kilgarriff. 2004. How dominant is the common-
  Lample, Loı̈c Barrault, and Marco Baroni. 2018.           est sense of a word? In Petr Sojka, Ivan Kopeček,
  What you can cram into a single $&!#* vector:             and Karel Pala, editors, Text, Speech and Dialogue,
  Probing sentence embeddings for linguistic proper-        pages 103–111, Berlin, Heidelberg. Springer Berlin
  ties. In Proceedings of the 56th Annual Meeting of        Heidelberg.
  the Association for Computational Linguistics (Vol-
  ume 1: Long Papers), pages 2126–2136, Melbourne,        Olga Kovaleva, Alexey Romanov, Anna Rogers, and
  Australia, July. Association for Computational Lin-       Anna Rumshisky. 2019. Revealing the dark secrets
  guistics.                                                 of BERT. In Proceedings of the 2019 Conference on
                                                            Empirical Methods in Natural Language Processing
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               and the 9th International Joint Conference on Natu-
   Kristina Toutanova. 2019. BERT: Pre-training of          ral Language Processing (EMNLP-IJCNLP), pages
   deep bidirectional transformers for language under-      4365–4374, Hong Kong, China, November. Associ-
   standing. In Proceedings of the 2019 Conference of       ation for Computational Linguistics.
   the North American Chapter of the Association for
   Computational Linguistics: Human Language Tech-        Guillaume     Lample,     Alexandre     Sablayrolles,
   nologies, Volume 1 (Long and Short Papers), pages        Marc’Aurelio Ranzato, Ludovic Denoyer, and
   4171–4186, Minneapolis, Minnesota, June. Associ-         Hervé Jégou. 2019. Large memory layers with
   ation for Computational Linguistics.                     product keys. arXiv preprint arXiv:1907.05242.

Kawin Ethayarajh. 2019. How contextual are contex-        Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
  tualized word representations? comparing the ge-          dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
  ometry of BERT, ELMo, and GPT-2 embeddings.               Luke Zettlemoyer, and Veselin Stoyanov. 2019.
  In Proceedings of the 2019 Conference on Empir-           Roberta: A robustly optimized bert pretraining ap-
  ical Methods in Natural Language Processing and           proach. arXiv preprint arXiv:1907.11692.
  the 9th International Joint Conference on Natural
  Language Processing (EMNLP-IJCNLP), pages 55–           Stuart P. Lloyd. 1982. Least squares quantization in
  65, Hong Kong, China, November. Association for            PCM. IEEE Trans. Information Theory, 28(2):129–
  Computational Linguistics.                                 136.

Allyson Ettinger. 2020. What bert is not: Lessons         Paul Michel, Omer Levy, and Graham Neubig. 2019.
  from a new suite of psycholinguistic diagnostics for      Are sixteen heads really better than one? In Ad-
  language models. Transactions of the Association          vances in Neural Information Processing Systems
  for Computational Linguistics, 8:34–48.                   32: Annual Conference on Neural Information Pro-
                                                            cessing Systems 2019, NeurIPS 2019, 8-14 Decem-
John Hewitt and Percy Liang. 2019. Designing and in-        ber 2019, Vancouver, BC, Canada, pages 14014–
  terpreting probes with control tasks. In Proceedings      14024.
  of the 2019 Conference on Empirical Methods in
  Natural Language Processing and the 9th Interna-        Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
  tional Joint Conference on Natural Language Pro-          Corrado, and Jeffrey Dean. 2013. Distributed rep-
  cessing (EMNLP-IJCNLP), pages 2733–2743, Hong             resentations of words and phrases and their com-
  Kong, China, November. Association for Computa-           positionality. In Advances in Neural Information
  tional Linguistics.                                       Processing Systems 26: 27th Annual Conference on
                                                            Neural Information Processing Systems 2013. Pro-
John Hewitt and Christopher D. Manning. 2019. A             ceedings of a meeting held December 5-8, 2013,
  structural probe for finding syntax in word represen-     Lake Tahoe, Nevada, United States, pages 3111–
  tations. In Proceedings of the 2019 Conference of         3119.
George A. Miller, Claudia Leacock, Randee Tengi, and        Processing and the 9th International Joint Confer-
  Ross T. Bunker. 1993. A semantic concordance.             ence on Natural Language Processing (EMNLP-
  In HUMAN LANGUAGE TECHNOLOGY: Proceed-                    IJCNLP), pages 2463–2473, Hong Kong, China,
  ings of a Workshop Held at Plainsboro, New Jersey,        November. Association for Computational Linguis-
  March 21-24, 1993.                                        tics.

George A. Miller. 1995. Wordnet: A lexical database       Alec Radford, Jeff Wu, Rewon Child, David Luan,
  for english. Commun. ACM, 38(11):39–41, Novem-            Dario Amodei, and Ilya Sutskever. 2019. Language
  ber.                                                      models are unsupervised multitask learners.

David Mimno and Laure Thompson. 2017. The                 Alessandro Raganato, Jose Camacho-Collados, and
  strange geometry of skip-gram with negative sam-          Roberto Navigli. 2017. Word sense disambigua-
  pling. In Proceedings of the 2017 Conference              tion: A unified evaluation framework and empiri-
  on Empirical Methods in Natural Language Pro-             cal comparison. In Proceedings of the 15th Confer-
  cessing, pages 2873–2878, Copenhagen, Denmark,            ence of the European Chapter of the Association for
  September. Association for Computational Linguis-         Computational Linguistics: Volume 1, Long Papers,
  tics.                                                     pages 99–110, Valencia, Spain, April. Association
                                                            for Computational Linguistics.
Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-
   top: Simple and effective postprocessing for word      Emily Reif, Ann Yuan, Martin Wattenberg, Fer-
   representations. In 6th International Conference on      nanda B. Viégas, Andy Coenen, Adam Pearce, and
   Learning Representations, ICLR 2018, Vancouver,          Been Kim. 2019. Visualizing and measuring the
   BC, Canada, April 30 - May 3, 2018, Conference           geometry of BERT. In Advances in Neural Infor-
   Track Proceedings.                                       mation Processing Systems 32: Annual Conference
                                                            on Neural Information Processing Systems 2019,
Roberto Navigli. 2006. Meaningful clustering of             NeurIPS 2019, 8-14 December 2019, Vancouver,
  senses helps boost word sense disambiguation per-         BC, Canada, pages 8592–8600.
  formance. In Proceedings of the 21st International
  Conference on Computational Linguistics and the         Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
  44th Annual Meeting of the Association for Com-            BERT rediscovers the classical NLP pipeline. In
  putational Linguistics, ACL-44, pages 105–112,             Proceedings of the 57th Annual Meeting of the Asso-
  Stroudsburg, PA, USA. Association for Computa-             ciation for Computational Linguistics, pages 4593–
  tional Linguistics.                                        4601, Florence, Italy, July. Association for Compu-
                                                             tational Linguistics.
Massimiliano Pavan and Marcello Pelillo. 2007. Dom-
 inant sets and pairwise clustering. IEEE Trans. Pat-     Rocco Tripodi and Roberto Navigli. 2019. Game
 tern Anal. Mach. Intell., 29(1):167–172.                   theory meets embeddings: a unified framework for
                                                            word sense disambiguation. In Proceedings of the
Marcello Pelillo. 2009. What is a cluster? perspectives     2019 Conference on Empirical Methods in Natu-
 from game theory. In Proc. of the NIPS Workshop on         ral Language Processing and the 9th International
 Clustering Theory.                                         Joint Conference on Natural Language Process-
                                                            ing (EMNLP-IJCNLP), pages 88–99, Hong Kong,
Jeffrey Pennington, Richard Socher, and Christopher         China, November. Association for Computational
   Manning. 2014. Glove: Global vectors for word            Linguistics.
   representation. In Proceedings of the 2014 Con-
   ference on Empirical Methods in Natural Language       Rocco Tripodi and Stefano Li Pira. 2017. Analysis
   Processing (EMNLP), pages 1532–1543, Doha,               of italian word embeddings. In Proceedings of the
   Qatar, October. Association for Computational Lin-       Fourth Italian Conference on Computational Lin-
   guistics.                                                guistics (CLiC-it 2017), Rome, Italy, December 11-
                                                            13, 2017.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
 Gardner, Christopher Clark, Kenton Lee, and Luke         Rocco Tripodi, Sebastiano Vascon, and Marcello
 Zettlemoyer. 2018. Deep contextualized word rep-           Pelillo. 2016. Context aware nonnegative ma-
 resentations. In Proceedings of the 2018 Confer-           trix factorization clustering. In 23rd International
 ence of the North American Chapter of the Associ-          Conference on Pattern Recognition, ICPR 2016,
 ation for Computational Linguistics: Human Lan-            Cancún, Mexico, December 4-8, 2016, pages 1719–
 guage Technologies, Volume 1 (Long Papers), pages          1724.
 2227–2237, New Orleans, Louisiana, June. Associ-
 ation for Computational Linguistics.                     Alex Wang, Amanpreet Singh, Julian Michael, Fe-
                                                            lix Hill, Omer Levy, and Samuel Bowman. 2018.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,          GLUE: A multi-task benchmark and analysis plat-
  Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and             form for natural language understanding. In Pro-
  Alexander Miller. 2019. Language models as                ceedings of the 2018 EMNLP Workshop Black-
  knowledge bases? In Proceedings of the 2019 Con-          boxNLP: Analyzing and Interpreting Neural Net-
  ference on Empirical Methods in Natural Language          works for NLP, pages 353–355, Brussels, Belgium,
  November. Association for Computational Linguis-
  tics.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Amanpreet Singh, Julian Michael, Felix Hill, Omer
  Levy, and Samuel R. Bowman. 2019. Superglue:
  A stickier benchmark for general-purpose language
  understanding systems. In Advances in Neural In-
  formation Processing Systems 32: Annual Con-
  ference on Neural Information Processing Systems
  2019, NeurIPS 2019, 8-14 December 2019, Vancou-
  ver, BC, Canada, pages 3261–3275.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Chaumond, Clement Delangue, Anthony Moi, Pier-
  ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
  icz, and Jamie Brew. 2019. Huggingface’s trans-
  formers: State-of-the-art natural language process-
  ing. ArXiv, abs/1910.03771.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
  bonell, Ruslan Salakhutdinov, and Quoc V Le.
  2019. Xlnet: Generalized autoregressive pretrain-
  ing for language understanding. arXiv preprint
  arXiv:1906.08237.