<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How Contextualized Word Embeddings Represent Word Senses</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rocco Tripodi</string-name>
          <email>rocco.tripodi@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bologna</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. Contextualized embedding models, such as ELMo and BERT, allow the construction of vector representations of lexical items that adapt to the context in which words appear. It was demonstrated that the upper layers of these models capture semantic information. This evidence paved the way for the development of sense representations based on words in context. In this paper, we analyze the vector spaces produced by 11 pre-trained models and evaluate these representations on two tasks. The analysis shows that all these representations contain redundant information. The results show the disadvantage of this aspect.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Modelli come ELMo o BERT
consentono di ottenere rappresentazioni
vettoriali delle parole che si adattano
al contesto in cui queste appaiono. Il
fatto che i livelli alti di questi
modelli immagazzinino informazione
semantica ha portato a sviluppare
rappresentazioni di senso basate su parole nel
contesto. In questo lavoro analizziamo
gli spazi vettoriali prodotti con 11
modelli pre-addestrati e valutiamo le loro
prestazioni nel rappresentare i diversi
sensi delle parole. Le analisi condotte
mostrano che questi modelli contengono
informazioni ridondanti. I risultati
evidenziano le criticita` inerenti a questo aspetto.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        The introduction of contextualized embedding
models, such as ELMo
        <xref ref-type="bibr" rid="ref29">(Peters et al., 2018)</xref>
        and
      </p>
      <p>Copyright © 2021 for this paper by its author. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).</p>
      <p>
        BERT
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        , allows the
construction of vector representations of lexical items that
adapt to the context in which words appear. It has
been shown that the upper layers of these
models contain semantic information
        <xref ref-type="bibr" rid="ref10">(Jawahar et al.,
2019)</xref>
        and are more diversified than lower
layers
        <xref ref-type="bibr" rid="ref6">(Ethayarajh, 2019)</xref>
        . These word
representations overcame the meaning conflation deficiency
that affects static word embedding techniques
        <xref ref-type="bibr" rid="ref23 ref24 ref3 ref34">(Camacho-Collados and Pilehvar, 2018; Tripodi
and Pira, 2017)</xref>
        , such as word2vec
        <xref ref-type="bibr" rid="ref18">(Mikolov et al.,
2013)</xref>
        or GloVe
        <xref ref-type="bibr" rid="ref28">(Pennington et al., 2014)</xref>
        thanks to
the adaptation to the context of use.
      </p>
      <p>
        The evaluation of these models has been
conducted mainly on downstream tasks
        <xref ref-type="bibr" rid="ref36 ref37">(Wang et al.,
2018; Wang et al., 2019)</xref>
        . With extrinsic
evaluations, the models are fine-tuned, adapting the
vector representations to specific tasks. The
resulting vectors are then used as features in
classification problems. This hinders a direct evaluation and
analysis of the models because the evaluation also
takes into account the ability of the classiefir to
learn the task. A model trained for this kind of task
may learn only to discriminate among features that
belong to each class with poor generalization.
      </p>
      <p>
        The interpretability of neural networks is an
emerging line of research NLP that aims at
analyzing the properties of pre-trained language
models
        <xref ref-type="bibr" rid="ref10 ref13 ref17 ref2 ref32 ref33 ref8 ref9">(Belinkov and Glass, 2019)</xref>
        . Different
studies have been conducted in recent years to
discover what kind of linguistic information is stored
in large neural language models. Many of them
are focused on syntax
        <xref ref-type="bibr" rid="ref10 ref10 ref13 ref17 ref2 ref32 ref33 ref8 ref9">(Hewitt and Manning, 2019;
Jawahar et al., 2019)</xref>
        and attention
        <xref ref-type="bibr" rid="ref13 ref17">(Michel et
al., 2019; Kovaleva et al., 2019)</xref>
        . For what
concerns semantics, the majority of the studies
focus on common knowledge
        <xref ref-type="bibr" rid="ref30">(Petroni et al., 2019)</xref>
        and inference and role-based event prediction
        <xref ref-type="bibr" rid="ref7">(Ettinger, 2020)</xref>
        . Only a few of them have been
devoted to lexical semantics, for example, Reif et al.
(2019) show how different representations of the
same lexical form tend to cluster according to their
sense.
      </p>
      <p>
        In this work, we propose an in-depth
analysis of the properties of the vector spaces induced
by different embedding models and an evaluation
of their word representations. We present how
the properties of the vector space contribute to
the success of the models in two tasks: sense
induction and word sense disambiguation. In fact,
even if contextualized models do not create one
representation per word sense
        <xref ref-type="bibr" rid="ref6">(Ethayarajh, 2019)</xref>
        ,
their contextualization create similar
representations for the same word sense that can be easily
clustered.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        Given the success (and the opacity) of
contextualized embedding models, many works have been
proposed to analyze their inner representations.
These analyses are based on probing tasks
        <xref ref-type="bibr" rid="ref4">(Conneau et al., 2018)</xref>
        that aim at measuring how the
information extracted from a pre-trained model is
useful to represent linguistic structures. Probing
tasks involve training a diagnostic classifier to
determine if it encodes desired features. Tenney et al.
(2019) discovered that specific BERT’s layers are
more suited for representing information useful to
solve specific tasks and that the ordering of its
layers resembles the ordering of a traditional NLP
pipeline: POS tagging, parsing, NER, semantic
role labeling, and coreference resolution.
Hewitt and Manning (2019) evaluated whether
syntax trees are embedded in a linear transformation
of a neural network’s word representation space.
Hewitt and Liang (2019) raised the problem of
interpreting the results derived from probing
analysis. In fact, it is difficult to understand whether
high accuracy values are due to the representation
itself or, instead, they are the result of the ability
to learn a specific task during training.
      </p>
      <p>Our work is more in line with works that try
to find general properties of the representations
generated by different contextualized models. For
example, Mimno and Thompson (2017)
demonstrated that the vector space produced by a static
embedding model is concentrated in a narrow
cone and that its concentration depends on the
ratio of positive and negative examples. Mu and
Viswanath (2018) explored this analysis further,
demonstrating that the embedding vectors share
the same common vector and have the same main
direction. Ethayarajh (2019) demonstrated how
upper layers of a contextualizing model produce
more contextualized representations. We built on
top of these works analyzing the vector space
generated by contextualized models and evaluating
them.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Construction of the Vector Spaces</title>
      <p>
        We used SemCor
        <xref ref-type="bibr" rid="ref19">(Miller et al., 1993)</xref>
        as reference
corpus for our work. This choice is motivated by
the fact that it is the largest dataset manually
annotated with sense information and it is commonly
used as training set for word sense
disambiguation. It contains 352 documents whose content
words (about 226, 000) have been annotated with
WordNet
        <xref ref-type="bibr" rid="ref21">(Miller, 1995)</xref>
        senses. In total there are
33, 341 unique senses distributed over 22, 417
different words. The sense distribution in this corpus
is very skewed, and follows a power law
        <xref ref-type="bibr" rid="ref12">(Kilgarriff, 2004)</xref>
        . This makes the identification of senses
challenging. The dataset is also difficult due to the
Model
BERTbase
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        BERTlarge
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        GPT-2base
        <xref ref-type="bibr" rid="ref22">(Radford et al., 2019)</xref>
        GPT-2medium
        <xref ref-type="bibr" rid="ref22">(Radford et al., 2019)</xref>
        GPT-2large
        <xref ref-type="bibr" rid="ref22">(Radford et al., 2019)</xref>
        RoBERTabase
        <xref ref-type="bibr" rid="ref15">(Liu et al., 2019)</xref>
        RoBERTalarge
        <xref ref-type="bibr" rid="ref15">(Liu et al., 2019)</xref>
        XLNetbase
        <xref ref-type="bibr" rid="ref39">(Yang et al., 2019)</xref>
        XLNetlarge
        <xref ref-type="bibr" rid="ref39">(Yang et al., 2019)</xref>
        XLMenglish
CTRL
        <xref ref-type="bibr" rid="ref11">(Keskar et al., 2019)</xref>
        training data vocab. size n. param. vec. dim. objective
16GB 30K 110M 768 masked language model and next sentence prediction
16GB 30K 340M 1024 masked language model and next sentence prediction
40GB 50K 117M 768 language model
40GB 50K 345M 1024 language model
40GB 50K 774M 1280 language model
160GB 50K 125M 768 masked language model
160GB 50K 355M 1024 masked language model
126GB 32K 110M 768 bidirectional language model
126GB 32K 340M 1024 bidirectional language model
16GB 30K 665M 2048 language model
140GB 250K 1.63B 1280 conditional transformer language model
ifne granularity of WordNet
        <xref ref-type="bibr" rid="ref25">(Navigli, 2006)</xref>
        .
      </p>
      <p>To construct the vector space A from
SemCor we collected all the senses Si of a word
twhie asenndtefnocresea{cShensetn1wsiesj ,sSj en∈t2wisj , ..., Sentnwisj</p>
      <p>Si we recovered
}
in which this particular sense occurs. These
sentences are then fed into a pre-trained model and
the token embedding representations of word wi,
{e1wisj , e2wisj , ..., enwisj }, are extracted from the
last hidden layer. This operation is repeated for
all the senses in Si, and for all the tagged words in
the vocabulary, V . The vector space corresponds
to all the representations of the words in V .</p>
      <p>A t-SNE visualization of the different
embeddings in SemCor for the word foot is presented in
Figure 1. In this Figure, we can see that the three
main senses of foot (i.e., human foot, unit of length
and lower part) occupy a definite position in the
vector space, suggesting that the models are able
to produce specific representations for the
different senses of a word and that they lie on defined
subspaces. In this work we want to test to what
extent this feature is present in language models.</p>
      <sec id="sec-4-1">
        <title>Implementations details The pre-trained mod</title>
        <p>
          els used in this study are: two BERT
          <xref ref-type="bibr" rid="ref5">(Devlin et al.,
2019)</xref>
          models, base cased (12-layer, 768-hidden,
We used the transformers library
          <xref ref-type="bibr" rid="ref38">(Wolf et al., 2019)</xref>
          .
12-heads, 110M parameters) and large cased
(24-layer, 1024-hidden, 16-heads, 340M
parameters); three GPT-2
          <xref ref-type="bibr" rid="ref22">(Radford et al., 2019)</xref>
          models, base (12-layer, 768-hidden, 12-heads, 117M
parameters), medium (24-layer, 1024-hidden,
16heads, 345M parameters) and large (36-layer,
1280-hidden, 20-heads, 774M parameters); two
RoBERTa
          <xref ref-type="bibr" rid="ref15">(Liu et al., 2019)</xref>
          models, base
(12layer, 768-hidden, 12-heads, 125M parameters)
and large (24-layer, 1024-hidden, 16-heads, 355M
parameters); two XLNet
          <xref ref-type="bibr" rid="ref39">(Yang et al., 2019)</xref>
          models, base (12-layer, 768-hidden, 12-heads, 110M
parameters) and large (24-layer, 1024-hidden,
16heads, 340M parameters); one XLM
          <xref ref-type="bibr" rid="ref14">(Lample
et al., 2019)</xref>
          model (12-layer, 2048-hidden,
16heads) and one CTRL
          <xref ref-type="bibr" rid="ref11">(Keskar et al., 2019)</xref>
          model
(48-layer, 1280-hidden, 16-heads, 1.6B
parameters). The main features of these models are
summarized in Table 1. We averaged the
embeddings of sub-tokens to obtain token-level
representations.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.1 Analysis</title>
        <p>The first objective of this work is to analyze the
vector space produced with the models. This
analysis is aimed at investigating the properties of the
contextualized vectors. A detailed description of
the embedding spaces constructed with the
pretrained models is presented in Table 2. We
computed the norm for all the vectors in the vector
space A, and averaged them:
1 |A|</p>
        <p>X
A
| | i=1
AvgN orm =
∥ei∥2.</p>
        <p>(1)
This measure gives us an intuition on how diverse
the semantic space constructed with the different
models is. In fact, we can see that the magnitude
of the vectors constructed with BERT, RoBERTa,
XLNet, and XLM is low while those of GPT-2 and
CTRL are very high.</p>
        <p>We computed also the norm of the vector
resulting in averaging all the vectors in the semantic
space V , as:</p>
        <p>M eanV ecN orm =
1 |A|
A X ei
| | i=1
2
.</p>
        <p>(2)
All the semantic spaces have non-zero mean and
the mean norm is high. This result suggests
that the vectors contain redundant information and
share a common nonzero vector. This is not only
because the vector space contains representations
of the same sense. In fact, if we create a new
semantic space, Aˆ, averaging all the representations
of the same word sense, the M eanV ecN orm of
this space is still high for all the models.</p>
        <p>We used the Maximum Explainable Variance
(MEV) for the representations of each word in V .
This measure corresponds to the proportion of the
variance in the embeddings that can be explained
by their first principal components and was
computed as:</p>
        <p>M EV (w) =
(3)
σ 12 .</p>
        <p>
          Pi σ i2
where σ i21 is the first principal component of the
vector space A. It can give an upper bound on how
contextualized representations can be replaced by
a static embedding
          <xref ref-type="bibr" rid="ref6">(Ethayarajh, 2019)</xref>
          . The model
with the lowest MEV is BERTlarge and XLNetlarge.
        </p>
        <p>
          The other measures that we used for the
evaluation of the vector space are based on the very
notion of a cluster, which imposes that the data points
inside a cluster must satisfy two conditions:
internal similarity and external dissimilarity
          <xref ref-type="bibr" rid="ref27">(Pelillo,
2009)</xref>
          . To this end, we used the senses of each
word in the vocabulary of SemCor as clusters and
extracted the corresponding vectors from V . We
then computed the internal similarity of a cluster,
c, as:
        </p>
        <p>IntSim(c) =
where n is the number of data points in the cluster.
We computed also the external similarity of a
cluster c by computing the cosine similarity among
each point in c and all the points in the subspace S
induced by the senses of a word that has c as one
of its senses:</p>
        <p>ExtSim(c) =</p>
        <p>1 Xn Xm cos(ej , ek), (5)
n · m j=1 k=1
where m is the total number of data points in the
subspace S (excluding those in c) and n is the
number of points in the cluster c. Our
hypothesis is that good representations should have high
internal similarity and low external similarity and
that the difference between them should be high.</p>
        <p>As it can be seen from Table 2 the internal
similarity is higher than the external for all the
models. Despite this, the scores are in a wide
range. The lowest IntSim is given by BERTlarge
and the highest by RoBERTalarge and XLNetbase.
The lowest ExtSim is given by BERTlarge and
the highest by XLNetbase. The largest difference
between the two measures is given by BERTlarge.
RoBERTalarge gives has also a large gap between
the two measures, furthermore, their standard
deviation is very low. As we will see in Section 4
these last two models perform better than others
in clustering and classification tasks.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>Sense Induction This task is aimed at
understanding if representations belonging to different
senses can be separated using an unsupervised
approach. We hypothesize that a good
contextualization process should produce more discriminative
model
BERTbase
BERTlarge
GPT-2base
GPT-2medium
GPT-2large
RoBERTabase
RoBERTalarge
XLNetbase
XLNetlarge
XLMenglish
CTRL
representations that can be easily identified by a
clustering algorithm.</p>
      <p>We used the sense clusters extracted from
SemCor as ground truth for this experiment (see
Section 3) and grouped them if they are senses of
the same word (with a given part of speech). We
retained only the groups that have at least 20
data points and we discarded also monosemous
words for the evaluation on k-means. The
resulting datasets consist of 1871 (entire) and 1499
(without monosemous words) sub-datasets with
141, 074 and 116, 019 data points in total,
respectively. We computed the accuracy on each
subdataset computing the number of data points that
have been clustered correctly and averaged the
results to measure the performance of each model.</p>
      <p>
        The first algorithm is k-means
        <xref ref-type="bibr" rid="ref16">(Lloyd, 1982)</xref>
        .
It is a partitioning, iterative algorithm whose
objective is to minimize the sum of point-to-centroid
distances, summed over all k clusters. We used
the k-means++ heuristic
        <xref ref-type="bibr" rid="ref1 ref26">(Arthur and Vassilvitskii,
2007)</xref>
        and the cosine distance metric to determine
distances. We selected this algorithm because it
is simple, non-parametric, and is widely used. It
is important to notice that k-means requires the
number of clusters to extract, for this reason, we
restricted the evaluation only to ambiguous words.
      </p>
      <p>
        The second algorithm used is dominant-set
        <xref ref-type="bibr" rid="ref1 ref26">(Pavan and Pelillo, 2007)</xref>
        . It is a graph-based
algorithm that extracts compact structures from graphs
generalizing the notion of maximal clique defined
on unweighted graphs to edge-weighted graphs.
We selected this algorithm because it is
nonparametric, requires only the adjacency matrix of
a weighted graph as input, and, more importantly,
does not require the number of clusters to extract.
The clusters are extracted from the graph
sequentially using a peel-off strategy. This feature
allows us to include in the evaluation also
unambiguous words and to see if their representations
are grouped into a single cluster or partitioned into
different ones. We used cosine similarity to weigh
the edges of the input graph.
      </p>
      <p>The results of this evaluation are presented in
Table 3. RoBERTa and BERT have the overall best
performances on this task using both algorithms.
In particular, RoBERTalarge performs consistently
well on all parts of speech and across algorithms,
while other models perform well only in
combination with one of the two algorithms. This is
presumably owing to the big gap between the internal
and the external similarity produced by this model,
as explained in Section 3.1.</p>
      <p>This evaluation tends to confirm the claim that
larger versions of the same model achieve
better results. From Table 3, we can also see that
the models have more difficulties in identifying
the different senses of verbs, while nouns and
adverbs have higher results. This is probably due
to the different distribution of these word classes
in the training sets of the models and WordNet’s
ifne-granularity. The performances of the models
with dominant-set are surprisingly high,
considering that the setting of this experiment is
completely unsupervised. Furthermore, this algorithm
is conceived to extract compact clusters and this
feature could drive it to over partition the vector
space of monosemous words. Instead, the results
suggest the opposite: that the models are able to
produce representations with high internal
similarity, positioning their representations on a defined
sub-space.</p>
      <sec id="sec-5-1">
        <title>Word Sense Disambiguation We used the</title>
        <p>method proposed in Peters et al. (2018) to create
sense vectors from contextualized word vectors.
This method consists in averaging all the
representations of a given sense. The resulting vector
space corresponds to Aˆ (see Section 3.1). We
evaluated the generated vectors on a standard
benchmark (Raganato et al., 2017) for WSD. It consists
of vfie datasets that were unified to the same
WordNet version: Senseval-2 (S2), Senseval-3 (S3),
SemEval-2007 (S7), SemEval-2013 and
SemEval2015, having in total 10, 619 target words.</p>
        <p>
          The identification of word senses is conducted
by feeding the entire texts of the datasets into a
pre-trained model and extracting, for each target
word wi, its embedding representation ekwi as was
done for the construction of the semantic space.
Once these representations are available, we
compute the cosine similarities among ekwi and the
embeddings in Aˆ constructed with the same model
and selected the sense with the highest similarity.
We did not use more sophisticated models such as
WSD-games
          <xref ref-type="bibr" rid="ref10 ref13 ref17 ref2 ref32 ref33 ref35 ref8 ref9">(Tripodi and Navigli, 2019; Tripodi
et al., 2016)</xref>
          because we wanted to keep the
evaluation as simple as possible as not to influence the
evaluation of the results.
        </p>
        <p>The results of this evaluation are presented in
Table 4. The rfist trend that emerges from the
results is the big gap between precision and
recall. This is due to the absence of many senses in
our training set. We did not want to use back-off
strategies or other techniques usually employed in
the WSD literature, to not influence the
performances and the analysis of the results. Despite
the simplicity of the approach, it performs
surprisingly well. In particular, BERT, RoBERTa, and
XLNet (three bidirectional models) have very high
results. The low performances of CTRL are
probably due to its large vocabulary and to its objective,
designed to solve different tasks.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>We conducted an extensive analysis of the
semantic capabilities of contextualized embedding
models. We analyzed the vector space constructed
using pre-trained models and found that their vectors
contain redundant information and that their first
two principal components are dominant.</p>
      <p>The results on sense induction are promising.
They demonstrated the effectiveness of
contextualized embeddings to capture semantic
information. We did not find higher performances
from more complex models, rather, we found that
RoBERTa, a model that was developed by
simplifying a more complex model, BERT, was one
of the best performers. Neither the dimension of
the hidden layers, the size of the training data,
nor the size of the vocabulary seems to play a big
role in modeling semantics. As stated in previous
works, inserting an anisotropy penalty to the
objective function of the models could improve
directly the representations. We also noticed that,
even if BERT models and XLNet have different
objectives and are trained on different data, they
have similar performances. It emerged that these
models are less redundant than others.</p>
      <p>The conclusion that we can draw from our
analysis and evaluation is that pre-trained
language models can capture lexical-semantic
information and that unsupervised models can be used
to distinguish among their representations. On
the other hand, these representations are
redundant and anisotropic. We hypothesize that
reducing these aspects can lead to better representations.
This operation can be carried out post-hoc but we
think that training new models keeping this point
in mind could lead to the development of better
models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Arthur</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sergei</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>kmeans++: the advantages of careful seeding</article-title>
          .
          <source>In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA</source>
          <year>2007</year>
          , New Orleans, Louisiana, USA, January 7-
          <issue>9</issue>
          ,
          <year>2007</year>
          , pages
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Yonatan</given-names>
            <surname>Belinkov</surname>
          </string-name>
          and
          <string-name>
            <given-names>James</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Analysis methods in neural language processing: A survey</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>7</volume>
          :
          <fpage>49</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>March</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Jose´ Camacho-Collados and Mohammad Taher Pilehvar</article-title>
          .
          <year>2018</year>
          .
          <article-title>From word to sense embeddings: A survey on vector representations of meaning</article-title>
          .
          <source>J. Artif. Intell. Res.</source>
          ,
          <volume>63</volume>
          :
          <fpage>743</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Conneau</surname>
          </string-name>
          , German Kruszewski, Guillaume Lample, Lo¨ıc Barrault, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>What you can cram into a single $&amp;!#* vector: Probing sentence embeddings for linguistic properties</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          , pages
          <fpage>2126</fpage>
          -
          <lpage>2136</lpage>
          , Melbourne, Australia, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Kawin</given-names>
            <surname>Ethayarajh</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>55</fpage>
          -
          <lpage>65</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Allyson</given-names>
            <surname>Ettinger</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>8</volume>
          :
          <fpage>34</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>John</given-names>
            <surname>Hewitt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Percy</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Designing and interpreting probes with control tasks</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>2733</fpage>
          -
          <lpage>2743</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>John</given-names>
            <surname>Hewitt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A structural probe for finding syntax in word representations</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4129</fpage>
          -
          <lpage>4138</lpage>
          , Minneapolis, Minnesota, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Ganesh</given-names>
            <surname>Jawahar</surname>
          </string-name>
          , Benoˆıt Sagot, and Djame´ Seddah.
          <year>2019</year>
          .
          <article-title>What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          , pages
          <fpage>3651</fpage>
          -
          <lpage>3657</lpage>
          , Florence, Italy, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Nitish</given-names>
            <surname>Shirish</surname>
          </string-name>
          <string-name>
            <given-names>Keskar</given-names>
            ,
            <surname>Bryan</surname>
          </string-name>
          <string-name>
            <given-names>McCann</given-names>
            ,
            <surname>Lav R Varshney</surname>
          </string-name>
          , Caiming Xiong, and Richard Socher.
          <year>2019</year>
          .
          <article-title>Ctrl: A conditional transformer language model for controllable generation</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .05858.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Adam</given-names>
            <surname>Kilgarriff</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>How dominant is the commonest sense of a word? In Petr Sojka, Ivan Kopecˇek</article-title>
          , and Karel Pala, editors,
          <source>Text, Speech and Dialogue</source>
          , pages
          <fpage>103</fpage>
          -
          <lpage>111</lpage>
          , Berlin, Heidelberg. Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Olga</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          , Alexey Romanov, Anna Rogers, and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Revealing the dark secrets of BERT</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>4365</fpage>
          -
          <lpage>4374</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Lample</surname>
          </string-name>
          , Alexandre Sablayrolles,
          <string-name>
            <surname>Marc'Aurelio Ranzato</surname>
          </string-name>
          , Ludovic Denoyer, and Herve´ Je´gou.
          <year>2019</year>
          .
          <article-title>Large memory layers with product keys</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .05242.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Stuart P.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          .
          <year>1982</year>
          .
          <article-title>Least squares quantization in PCM</article-title>
          .
          <source>IEEE Trans. Information Theory</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ):
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Paul</given-names>
            <surname>Michel</surname>
          </string-name>
          , Omer Levy, and
          <string-name>
            <given-names>Graham</given-names>
            <surname>Neubig</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Are sixteen heads really better than one?</article-title>
          <source>In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          , NeurIPS
          <year>2019</year>
          ,
          <fpage>8</fpage>
          -
          <issue>14</issue>
          <year>December 2019</year>
          , Vancouver, BC, Canada, pages
          <fpage>14014</fpage>
          -
          <lpage>14024</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Gregory S. Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8</source>
          ,
          <year>2013</year>
          ,
          <string-name>
            <given-names>Lake</given-names>
            <surname>Tahoe</surname>
          </string-name>
          , Nevada, United States, pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>George A.</given-names>
            <surname>Miller</surname>
          </string-name>
          , Claudia Leacock, Randee Tengi, and
          <string-name>
            <surname>Ross</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Bunker</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>A semantic concordance</article-title>
          .
          <source>In HUMAN LANGUAGE TECHNOLOGY: Proceedings of a Workshop Held</source>
          at Plainsboro, New Jersey, March
          <volume>21</volume>
          -24,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)</source>
          , pages
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>George A.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          , November.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jeff Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Mimno</surname>
          </string-name>
          and
          <string-name>
            <given-names>Laure</given-names>
            <surname>Thompson</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The strange geometry of skip-gram with negative sampling</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2873</fpage>
          -
          <lpage>2878</lpage>
          , Copenhagen, Denmark, September. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Jiaqi</given-names>
            <surname>Mu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pramod</given-names>
            <surname>Viswanath</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>All-but-thetop: Simple and effective postprocessing for word representations</article-title>
          .
          <source>In 6th International Conference on Learning Representations, ICLR</source>
          <year>2018</year>
          , Vancouver, BC, Canada, April 30 - May 3,
          <year>2018</year>
          , Conference Track Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Meaningful clustering of senses helps boost word sense disambiguation performance</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Computational Linguistics</source>
          and
          <article-title>the 44th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL-
          <volume>44</volume>
          , pages
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          , Stroudsburg, PA, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Massimiliano</given-names>
            <surname>Pavan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marcello</given-names>
            <surname>Pelillo</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Dominant sets and pairwise clustering</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell.,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <fpage>167</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Marcello</given-names>
            <surname>Pelillo</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>What is a cluster? perspectives from game theory</article-title>
          .
          <source>In Proc. of the NIPS Workshop on Clustering Theory.</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          , Doha, Qatar, October. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Peters</surname>
          </string-name>
          , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <issue>Long Papers)</issue>
          , pages
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          , New Orleans, Louisiana, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Petroni</surname>
          </string-name>
          , Tim Rockta¨schel, Sebastian Riedel,
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Lewis</surname>
          </string-name>
          , Anton Bakhtin,
          <string-name>
            <surname>Yuxiang Wu</surname>
            , and
            <given-names>Alexander</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language models as knowledge bases?</article-title>
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Alessandro Raganato</source>
          , Jose Camacho-Collados, and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Word sense disambiguation: A unified evaluation framework and empirical comparison</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>Papers</given-names>
          </string-name>
          , pages
          <fpage>99</fpage>
          -
          <lpage>110</lpage>
          , Valencia, Spain, April. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Emily</surname>
            <given-names>Reif</given-names>
          </string-name>
          , Ann Yuan, Martin Wattenberg, Fernanda B. Vie´gas, Andy Coenen, Adam Pearce, and
          <string-name>
            <given-names>Been</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Visualizing and measuring the geometry of BERT</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          , NeurIPS
          <year>2019</year>
          ,
          <fpage>8</fpage>
          -
          <issue>14</issue>
          <year>December 2019</year>
          , Vancouver, BC, Canada, pages
          <fpage>8592</fpage>
          -
          <lpage>8600</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>Ian</given-names>
            <surname>Tenney</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
            , and
            <given-names>Ellie</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT rediscovers the classical NLP pipeline</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>4593</fpage>
          -
          <lpage>4601</lpage>
          , Florence, Italy, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Rocco</given-names>
            <surname>Tripodi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Game theory meets embeddings: a unified framework for word sense disambiguation</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          , pages
          <fpage>88</fpage>
          -
          <lpage>99</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>Rocco</given-names>
            <surname>Tripodi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stefano Li</given-names>
            <surname>Pira</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Analysis of italian word embeddings</article-title>
          .
          <source>In Proceedings of the Fourth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2017</year>
          ), Rome, Italy,
          <source>December 11- 13</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Rocco</given-names>
            <surname>Tripodi</surname>
          </string-name>
          , Sebastiano Vascon, and
          <string-name>
            <given-names>Marcello</given-names>
            <surname>Pelillo</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Context aware nonnegative matrix factorization clustering</article-title>
          .
          <source>In 23rd International Conference on Pattern Recognition, ICPR</source>
          <year>2016</year>
          , Cancu´n, Mexico, December 4-
          <issue>8</issue>
          ,
          <year>2016</year>
          , pages
          <fpage>1719</fpage>
          -
          <lpage>1724</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Amanpreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>GLUE: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          .
          <source>In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          , pages
          <fpage>353</fpage>
          -
          <lpage>355</lpage>
          , Brussels, Belgium, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Wang</surname>
          </string-name>
          , Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <given-names>Omer Levy</given-names>
            , and
            <surname>Samuel</surname>
          </string-name>
          <string-name>
            <given-names>R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Superglue: A stickier benchmark for general-purpose language understanding systems</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          , NeurIPS
          <year>2019</year>
          ,
          <fpage>8</fpage>
          -
          <issue>14</issue>
          <year>December 2019</year>
          , Vancouver, BC, Canada, pages
          <fpage>3261</fpage>
          -
          <lpage>3275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Brew</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Huggingface's transformers: State-of-the-art natural language processing</article-title>
          . ArXiv, abs/
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <given-names>Zhilin</given-names>
            <surname>Yang</surname>
          </string-name>
          , Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1906</year>
          .08237.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>