How Contextualized Word Embeddings Represent Word Senses Rocco Tripodi University of Bologna rocco.tripodi@unibo.it Abstract BERT (Devlin et al., 2019), allows the construc- tion of vector representations of lexical items that English. Contextualized embedding mod- adapt to the context in which words appear. It has els, such as ELMo and BERT, allow the been shown that the upper layers of these mod- construction of vector representations of els contain semantic information (Jawahar et al., lexical items that adapt to the context in 2019) and are more diversified than lower lay- which words appear. It was demonstrated ers (Ethayarajh, 2019). These word representa- that the upper layers of these models cap- tions overcame the meaning conflation deficiency ture semantic information. This evidence that affects static word embedding techniques paved the way for the development of (Camacho-Collados and Pilehvar, 2018; Tripodi sense representations based on words in and Pira, 2017), such as word2vec (Mikolov et al., context. In this paper, we analyze the 2013) or GloVe (Pennington et al., 2014) thanks to vector spaces produced by 11 pre-trained the adaptation to the context of use. models and evaluate these representations on two tasks. The analysis shows that The evaluation of these models has been con- all these representations contain redundant ducted mainly on downstream tasks (Wang et al., information. The results show the disad- 2018; Wang et al., 2019). With extrinsic evalua- vantage of this aspect. tions, the models are fine-tuned, adapting the vec- tor representations to specific tasks. The result- Italiano. Modelli come ELMo o BERT ing vectors are then used as features in classifica- consentono di ottenere rappresentazioni tion problems. This hinders a direct evaluation and vettoriali delle parole che si adattano analysis of the models because the evaluation also al contesto in cui queste appaiono. Il takes into account the ability of the classifier to fatto che i livelli alti di questi mod- learn the task. A model trained for this kind of task elli immagazzinino informazione seman- may learn only to discriminate among features that tica ha portato a sviluppare rappresen- belong to each class with poor generalization. tazioni di senso basate su parole nel The interpretability of neural networks is an contesto. In questo lavoro analizziamo emerging line of research NLP that aims at ana- gli spazi vettoriali prodotti con 11 mod- lyzing the properties of pre-trained language mod- elli pre-addestrati e valutiamo le loro els (Belinkov and Glass, 2019). Different stud- prestazioni nel rappresentare i diversi ies have been conducted in recent years to dis- sensi delle parole. Le analisi condotte cover what kind of linguistic information is stored mostrano che questi modelli contengono in large neural language models. Many of them informazioni ridondanti. I risultati eviden- are focused on syntax (Hewitt and Manning, 2019; ziano le criticità inerenti a questo aspetto. Jawahar et al., 2019) and attention (Michel et al., 2019; Kovaleva et al., 2019). For what con- 1 Introduction cerns semantics, the majority of the studies fo- The introduction of contextualized embedding cus on common knowledge (Petroni et al., 2019) models, such as ELMo (Peters et al., 2018) and and inference and role-based event prediction (Et- tinger, 2020). Only a few of them have been de- Copyright © 2021 for this paper by its author. Use per- mitted under Creative Commons License Attribution 4.0 In- voted to lexical semantics, for example, Reif et al. ternational (CC BY 4.0). (2019) show how different representations of the Figure 1: t-SNE representations for the word foot in SemCor, grouped by sense. same lexical form tend to cluster according to their sis. In fact, it is difficult to understand whether sense. high accuracy values are due to the representation In this work, we propose an in-depth analy- itself or, instead, they are the result of the ability sis of the properties of the vector spaces induced to learn a specific task during training. by different embedding models and an evaluation Our work is more in line with works that try of their word representations. We present how to find general properties of the representations the properties of the vector space contribute to generated by different contextualized models. For the success of the models in two tasks: sense in- example, Mimno and Thompson (2017) demon- duction and word sense disambiguation. In fact, strated that the vector space produced by a static even if contextualized models do not create one embedding model is concentrated in a narrow representation per word sense (Ethayarajh, 2019), cone and that its concentration depends on the ra- their contextualization create similar representa- tio of positive and negative examples. Mu and tions for the same word sense that can be easily Viswanath (2018) explored this analysis further, clustered. demonstrating that the embedding vectors share the same common vector and have the same main 2 Related Work direction. Ethayarajh (2019) demonstrated how Given the success (and the opacity) of contextual- upper layers of a contextualizing model produce ized embedding models, many works have been more contextualized representations. We built on proposed to analyze their inner representations. top of these works analyzing the vector space gen- These analyses are based on probing tasks (Con- erated by contextualized models and evaluating neau et al., 2018) that aim at measuring how the them. information extracted from a pre-trained model is 3 Construction of the Vector Spaces useful to represent linguistic structures. Probing tasks involve training a diagnostic classifier to de- We used SemCor (Miller et al., 1993) as reference termine if it encodes desired features. Tenney et al. corpus for our work. This choice is motivated by (2019) discovered that specific BERT’s layers are the fact that it is the largest dataset manually anno- more suited for representing information useful to tated with sense information and it is commonly solve specific tasks and that the ordering of its lay- used as training set for word sense disambigua- ers resembles the ordering of a traditional NLP tion. It contains 352 documents whose content pipeline: POS tagging, parsing, NER, semantic words (about 226, 000) have been annotated with role labeling, and coreference resolution. He- WordNet (Miller, 1995) senses. In total there are witt and Manning (2019) evaluated whether syn- 33, 341 unique senses distributed over 22, 417 dif- tax trees are embedded in a linear transformation ferent words. The sense distribution in this corpus of a neural network’s word representation space. is very skewed, and follows a power law (Kilgar- Hewitt and Liang (2019) raised the problem of in- riff, 2004). This makes the identification of senses terpreting the results derived from probing analy- challenging. The dataset is also difficult due to the Model training data vocab. size n. param. vec. dim. objective BERTbase (Devlin et al., 2019) 16GB 30K 110M 768 masked language model and next sentence prediction BERTlarge (Devlin et al., 2019) 16GB 30K 340M 1024 masked language model and next sentence prediction GPT-2base (Radford et al., 2019) 40GB 50K 117M 768 language model GPT-2medium (Radford et al., 2019) 40GB 50K 345M 1024 language model GPT-2large (Radford et al., 2019) 40GB 50K 774M 1280 language model RoBERTabase (Liu et al., 2019) 160GB 50K 125M 768 masked language model RoBERTalarge (Liu et al., 2019) 160GB 50K 355M 1024 masked language model XLNetbase (Yang et al., 2019) 126GB 32K 110M 768 bidirectional language model XLNetlarge (Yang et al., 2019) 126GB 32K 340M 1024 bidirectional language model XLMenglish 16GB 30K 665M 2048 language model CTRL (Keskar et al., 2019) 140GB 250K 1.63B 1280 conditional transformer language model Table 1: Statistics and hyperparameters of the models. Model AvgNorm MeanVecNorm(A) MeanVecNorm(Â) avg.MEV avg.IntSim avg.ExtSim BERTbase 25.78 ± 1.28 17.94 17.84 0.43 ± 0.18 0.74 ± 0.05 0.69 ± 0.06 BERTlarge 20.83 ± 2.51 12.43 11.58 0.38 ± 0.18 0.66 ± 0.08 0.59 ± 0.08 GPT-2base 125.13 ± 10.25 91.46 90.99 0.46 ± 0.18 0.79 ± 0.05 0.76 ± 0.05 GPT-2medium 427.45 ± 38.78 371.86 360.36 0.51 ± 0.18 0.85 ± 0.03 0.84 ± 0.03 GPT-2large 290.29 ± 38.56 226.39 212.97 0.43 ± 0.18 0.75 ± 0.05 0.72 ± 0.05 RoBERTabase 25.78 ± 0.56 22.17 22.25 0.51 ± 0.17 0.87 ± 0.02 0.85 ± 0.03 RoBERTalarge 31.47 ± 0.65 26.99 27.04 0.52 ± 0.18 0.88 ± 0.02 0.84 ± 0.03 XLNetbase 47.68 ± 0.66 43.28 43.26 0.53 ± 0.17 0.88 ± 0.01 0.87 ± 0.02 XLNetlarge 28.27 ± 1.42 19.56 19.68 0.38 ± 0.17 0.66 ± 0.04 0.62 ± 0.05 XLMenglish 44.92 ± 2.61 37.13 36.7 0.45 ± 0.18 0.79 ± 0.03 0.77 ± 0.03 CTRL 4443.62 ± 351.98 3927.86 3879.56 0.49 ± 0.18 0.84 ± 0.02 0.83 ± 0.02 Table 2: Detailed description of the embedding space produced with each model. fine granularity of WordNet (Navigli, 2006). 12-heads, 110M parameters) and large cased To construct the vector space A from Sem- (24-layer, 1024-hidden, 16-heads, 340M param- Cor we collected all the senses Si of a word eters); three GPT-2 (Radford et al., 2019) mod- wi and for each sense sj ∈ Si we recovered els, base (12-layer, 768-hidden, 12-heads, 117M ws ws ws the sentences {Sent1 i j , Sent2 i j , ..., Sentn i j } parameters), medium (24-layer, 1024-hidden, 16- in which this particular sense occurs. These sen- heads, 345M parameters) and large (36-layer, tences are then fed into a pre-trained model and 1280-hidden, 20-heads, 774M parameters); two the token embedding representations of word wi , RoBERTa (Liu et al., 2019) models, base (12- ws ws ws {e1 i j , e2 i j , ..., en i j }, are extracted from the layer, 768-hidden, 12-heads, 125M parameters) last hidden layer. This operation is repeated for and large (24-layer, 1024-hidden, 16-heads, 355M all the senses in Si , and for all the tagged words in parameters); two XLNet (Yang et al., 2019) mod- the vocabulary, V . The vector space corresponds els, base (12-layer, 768-hidden, 12-heads, 110M to all the representations of the words in V . parameters) and large (24-layer, 1024-hidden, 16- A t-SNE visualization of the different embed- heads, 340M parameters); one XLM (Lample dings in SemCor for the word foot is presented in et al., 2019) model (12-layer, 2048-hidden, 16- Figure 1. In this Figure, we can see that the three heads) and one CTRL (Keskar et al., 2019) model main senses of foot (i.e., human foot, unit of length (48-layer, 1280-hidden, 16-heads, 1.6B parame- and lower part) occupy a definite position in the ters). The main features of these models are sum- vector space, suggesting that the models are able marized in Table 1. We averaged the embed- to produce specific representations for the differ- dings of sub-tokens to obtain token-level represen- ent senses of a word and that they lie on defined tations. subspaces. In this work we want to test to what extent this feature is present in language models. 3.1 Analysis The first objective of this work is to analyze the Implementations details The pre-trained mod- vector space produced with the models. This anal- els used in this study are: two BERT (Devlin et al., ysis is aimed at investigating the properties of the 2019) models, base cased (12-layer, 768-hidden, contextualized vectors. A detailed description of We used the transformers library (Wolf et al., 2019). the embedding spaces constructed with the pre- trained models is presented in Table 2. We com- puted the norm for all the vectors in the vector space A, and averaged them: |A| 1 X AvgN orm = ∥ei ∥2 . (1) |A| i=1 This measure gives us an intuition on how diverse Figure 2: The first 500 principal components com- the semantic space constructed with the different puted on A and Â. models is. In fact, we can see that the magnitude of the vectors constructed with BERT, RoBERTa, then computed the internal similarity of a cluster, XLNet, and XLM is low while those of GPT-2 and c, as: CTRL are very high. 1 XX We computed also the norm of the vector re- IntSim(c) = 2 cos(ej , ek ), (4) sulting in averaging all the vectors in the semantic n −n j k̸=j space V , as: where n is the number of data points in the cluster. |A| We computed also the external similarity of a clus- 1 X ter c by computing the cosine similarity among M eanV ecN orm = ei . (2) |A| each point in c and all the points in the subspace S i=1 2 induced by the senses of a word that has c as one All the semantic spaces have non-zero mean and of its senses: the mean norm is high. This result suggests n m 1 XX that the vectors contain redundant information and ExtSim(c) = cos(ej , ek ), (5) share a common nonzero vector. This is not only n·m j=1 k=1 because the vector space contains representations where m is the total number of data points in the of the same sense. In fact, if we create a new se- subspace S (excluding those in c) and n is the mantic space, Â, averaging all the representations number of points in the cluster c. Our hypothe- of the same word sense, the M eanV ecN orm of sis is that good representations should have high this space is still high for all the models. internal similarity and low external similarity and We used the Maximum Explainable Variance that the difference between them should be high. (MEV) for the representations of each word in V . As it can be seen from Table 2 the internal This measure corresponds to the proportion of the similarity is higher than the external for all the variance in the embeddings that can be explained models. Despite this, the scores are in a wide by their first principal components and was com- range. The lowest IntSim is given by BERTlarge puted as: and the highest by RoBERTalarge and XLNetbase . σ2 M EV (w) = P 1 2 . (3) The lowest ExtSim is given by BERTlarge and i σi the highest by XLNetbase . The largest difference where σi2 1 is the first principal component of the between the two measures is given by BERTlarge . vector space A. It can give an upper bound on how RoBERTalarge gives has also a large gap between contextualized representations can be replaced by the two measures, furthermore, their standard de- a static embedding (Ethayarajh, 2019). The model viation is very low. As we will see in Section 4 with the lowest MEV is BERTlarge and XLNetlarge . these last two models perform better than others The other measures that we used for the evalu- in clustering and classification tasks. ation of the vector space are based on the very no- 4 Evaluation tion of a cluster, which imposes that the data points inside a cluster must satisfy two conditions: inter- Sense Induction This task is aimed at under- nal similarity and external dissimilarity (Pelillo, standing if representations belonging to different 2009). To this end, we used the senses of each senses can be separated using an unsupervised ap- word in the vocabulary of SemCor as clusters and proach. We hypothesize that a good contextualiza- extracted the corresponding vectors from V . We tion process should produce more discriminative model k-means dominant-set N V A R All N V A R All BERTbase 57.2 50.6 56.2 62.0 54.9 ± 14.8 55.7 45.3 51.7 45.8 51.0 ± 17.5 BERTlarge 59.3 51.9 56.9 59.0 56.2 ± 15.3 53.4 42.6 46.8 39.9 47.8 ± 17.1 GPT-2base 54.1 48.3 55.6 56.8 52.3±14.7 54.3 45.3 50.2 46.3 50.1 ± 17.2 GPT-2medium 53.9 49.1 56.2 59.8 52.8 ± 14.5 59.7 49.8 58.7 54.8 56.0 ± 18.8 GPT-2large 53.8 49.4 58.1 58.8 53.0 ± 14.8 50.2 44.1 46.1 44.1 47.1 ± 16.0 RoBERTabase 56.4 51.4 56.7 59.7 54.8 ± 14.7 65.3 55.1 64.8 61.4 61.6 ± 19.2 RoBERTalarge 58.5 53.0 58.6 62.7 56.7±14.9 66.7 56.6 66.3 64.2 63.2±19.3 XLNetbase 54.2 49.1 53.8 56.8 52.2 ± 14.4 67.2 55.0 68.7 63.8 62.7±20.7 XLNetlarge 57.6 52.5 57.9 60.8 55.9±14.4 51.0 44.8 47.5 40.9 47.6±15.0 XLMenglish 56.3 50.1 56.5 62.1 54.3 ± 15.1 60.4 51.3 59.5 55.9 57.0 ± 18.1 CTRL 53.8 47.0 56.5 57.4 51.9 ± 15.4 60.4 49.4 61.7 56.3 56.8 ± 19.2 Table 3: Results (as average accuracy) on clustering divided by algorithm and part of speech: nouns (N), verbs (V), adjectives (A), adverbs (R) and on the concatenations of all datasets (All). representations that can be easily identified by a tially using a peel-off strategy. This feature al- clustering algorithm. lows us to include in the evaluation also unam- We used the sense clusters extracted from Sem- biguous words and to see if their representations Cor as ground truth for this experiment (see Sec- are grouped into a single cluster or partitioned into tion 3) and grouped them if they are senses of different ones. We used cosine similarity to weigh the same word (with a given part of speech). We the edges of the input graph. retained only the groups that have at least 20 The results of this evaluation are presented in data points and we discarded also monosemous Table 3. RoBERTa and BERT have the overall best words for the evaluation on k-means. The re- performances on this task using both algorithms. sulting datasets consist of 1871 (entire) and 1499 In particular, RoBERTalarge performs consistently (without monosemous words) sub-datasets with well on all parts of speech and across algorithms, 141, 074 and 116, 019 data points in total, respec- while other models perform well only in combina- tively. We computed the accuracy on each sub- tion with one of the two algorithms. This is pre- dataset computing the number of data points that sumably owing to the big gap between the internal have been clustered correctly and averaged the re- and the external similarity produced by this model, sults to measure the performance of each model. as explained in Section 3.1. This evaluation tends to confirm the claim that The first algorithm is k-means (Lloyd, 1982). larger versions of the same model achieve bet- It is a partitioning, iterative algorithm whose ob- ter results. From Table 3, we can also see that jective is to minimize the sum of point-to-centroid the models have more difficulties in identifying distances, summed over all k clusters. We used the different senses of verbs, while nouns and ad- the k-means++ heuristic (Arthur and Vassilvitskii, verbs have higher results. This is probably due 2007) and the cosine distance metric to determine to the different distribution of these word classes distances. We selected this algorithm because it in the training sets of the models and WordNet’s is simple, non-parametric, and is widely used. It fine-granularity. The performances of the models is important to notice that k-means requires the with dominant-set are surprisingly high, consid- number of clusters to extract, for this reason, we ering that the setting of this experiment is com- restricted the evaluation only to ambiguous words. pletely unsupervised. Furthermore, this algorithm The second algorithm used is dominant-set (Pa- is conceived to extract compact clusters and this van and Pelillo, 2007). It is a graph-based algo- feature could drive it to over partition the vector rithm that extracts compact structures from graphs space of monosemous words. Instead, the results generalizing the notion of maximal clique defined suggest the opposite: that the models are able to on unweighted graphs to edge-weighted graphs. produce representations with high internal similar- We selected this algorithm because it is non- ity, positioning their representations on a defined parametric, requires only the adjacency matrix of sub-space. a weighted graph as input, and, more importantly, does not require the number of clusters to extract. Word Sense Disambiguation We used the The clusters are extracted from the graph sequen- method proposed in Peters et al. (2018) to create Model S2 S3 SE07 SE13 SE15 All P R F1 P R F1 P R F1 P R F1 P R F1 P R F1 BERTbase 80.6 67.9 73.7 77.2 68.8 72.8 66.4 63.1 64.7 74.4 62.7 68.1 78.3 68.8 73.2 77.0 66.8 71.5 BERTlarge 81.2 68.4 74.3 80.3 71.5 75.6 68.5 65.1 66.7 75.8 63.9 69.3 79.7 70.1 74.6 77.9 67.5 72.3 GPT-2base 75.6 63.7 69.1 71.5 63.7 67.4 59.3 56.3 57.7 71.8 60.5 65.7 74.4 65.4 69.6 72.4 62.8 67.2 GPT-2medium 76.5 64.5 70.0 72.9 65.0 68.7 62.0 58.9 60.4 74.0 62.3 67.7 76.6 67.3 71.7 74.0 64.2 68.8 GPT-2large 76.4 64.4 69.9 72.1 64.2 67.9 61.8 58.7 60.2 72.8 61.4 66.6 75.6 66.3 70.7 73.4 63.6 68.1 RoBERTabase 82.0 69.1 75.0 79.4 70.7 74.8 66.7 63.3 64.9 75.5 63.7 69.1 79.5 69.9 74.4 78.5 68.0 72.9 RoBERTalarge 82.0 69.1 75.0 80.0 71.2 75.4 70.6 67.0 68.8 77.1 65.0 70.5 81.0 71.1 75.7 79.4 68.9 73.8 XLNetbase 78.8 65.8 71.7 76.2 67.4 71.5 67.3 63.7 65.5 70.7 58.3 63.9 77.5 67.1 71.9 75.4 64.6 69.5 XLNetlarge 80.6 67.9 73.7 78.7 70.1 74.2 67.6 64.2 65.8 75.3 63.5 68.9 80.6 70.8 75.4 78.0 67.7 72.5 CTRL 73.4 61.9 67.1 70.1 62.5 66.1 54.2 51.4 52.8 68.2 57.5 62.4 72.3 63.5 67.6 69.9 60.6 64.9 Table 4: Results indicating precision (P), recall (R) and F1 on each dataset and on their concatenation (All). All the results are computed using  as vector space. sense vectors from contextualized word vectors. 5 Conclusion and Future Work This method consists in averaging all the repre- We conducted an extensive analysis of the seman- sentations of a given sense. The resulting vector tic capabilities of contextualized embedding mod- space corresponds to  (see Section 3.1). We eval- els. We analyzed the vector space constructed us- uated the generated vectors on a standard bench- ing pre-trained models and found that their vectors mark (Raganato et al., 2017) for WSD. It consists contain redundant information and that their first of five datasets that were unified to the same Word- two principal components are dominant. Net version: Senseval-2 (S2), Senseval-3 (S3), The results on sense induction are promising. SemEval-2007 (S7), SemEval-2013 and SemEval- They demonstrated the effectiveness of contex- 2015, having in total 10, 619 target words. tualized embeddings to capture semantic infor- The identification of word senses is conducted mation. We did not find higher performances by feeding the entire texts of the datasets into a from more complex models, rather, we found that pre-trained model and extracting, for each target RoBERTa, a model that was developed by sim- word wi , its embedding representation ew i k as was plifying a more complex model, BERT, was one done for the construction of the semantic space. of the best performers. Neither the dimension of Once these representations are available, we com- the hidden layers, the size of the training data, pute the cosine similarities among ew i k and the em- nor the size of the vocabulary seems to play a big beddings in  constructed with the same model role in modeling semantics. As stated in previous and selected the sense with the highest similarity. works, inserting an anisotropy penalty to the ob- We did not use more sophisticated models such as jective function of the models could improve di- WSD-games (Tripodi and Navigli, 2019; Tripodi rectly the representations. We also noticed that, et al., 2016) because we wanted to keep the evalu- even if BERT models and XLNet have different ation as simple as possible as not to influence the objectives and are trained on different data, they evaluation of the results. have similar performances. It emerged that these The results of this evaluation are presented in models are less redundant than others. Table 4. The first trend that emerges from the The conclusion that we can draw from our results is the big gap between precision and re- analysis and evaluation is that pre-trained lan- call. This is due to the absence of many senses in guage models can capture lexical-semantic infor- our training set. We did not want to use back-off mation and that unsupervised models can be used strategies or other techniques usually employed in to distinguish among their representations. On the WSD literature, to not influence the perfor- the other hand, these representations are redun- mances and the analysis of the results. Despite dant and anisotropic. We hypothesize that reduc- the simplicity of the approach, it performs surpris- ing these aspects can lead to better representations. ingly well. In particular, BERT, RoBERTa, and This operation can be carried out post-hoc but we XLNet (three bidirectional models) have very high think that training new models keeping this point results. The low performances of CTRL are proba- in mind could lead to the development of better bly due to its large vocabulary and to its objective, models. designed to solve different tasks. References the North American Chapter of the Association for Computational Linguistics: Human Language Tech- David Arthur and Sergei Vassilvitskii. 2007. k- nologies, Volume 1 (Long and Short Papers), pages means++: the advantages of careful seeding. In Pro- 4129–4138, Minneapolis, Minnesota, June. Associ- ceedings of the Eighteenth Annual ACM-SIAM Sym- ation for Computational Linguistics. posium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah. 1027–1035. 2019. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Yonatan Belinkov and James Glass. 2019. Analysis Meeting of the Association for Computational Lin- methods in neural language processing: A survey. guistics, pages 3651–3657, Florence, Italy, July. As- Transactions of the Association for Computational sociation for Computational Linguistics. Linguistics, 7:49–72, March. José Camacho-Collados and Mohammad Taher Pile- Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, hvar. 2018. From word to sense embeddings: A Caiming Xiong, and Richard Socher. 2019. Ctrl: A survey on vector representations of meaning. J. Ar- conditional transformer language model for control- tif. Intell. Res., 63:743–788. lable generation. arXiv preprint arXiv:1909.05858. Alexis Conneau, German Kruszewski, Guillaume Adam Kilgarriff. 2004. How dominant is the common- Lample, Loı̈c Barrault, and Marco Baroni. 2018. est sense of a word? In Petr Sojka, Ivan Kopeček, What you can cram into a single $&!#* vector: and Karel Pala, editors, Text, Speech and Dialogue, Probing sentence embeddings for linguistic proper- pages 103–111, Berlin, Heidelberg. Springer Berlin ties. In Proceedings of the 56th Annual Meeting of Heidelberg. the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 2126–2136, Melbourne, Olga Kovaleva, Alexey Romanov, Anna Rogers, and Australia, July. Association for Computational Lin- Anna Rumshisky. 2019. Revealing the dark secrets guistics. of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing Jacob Devlin, Ming-Wei Chang, Kenton Lee, and and the 9th International Joint Conference on Natu- Kristina Toutanova. 2019. BERT: Pre-training of ral Language Processing (EMNLP-IJCNLP), pages deep bidirectional transformers for language under- 4365–4374, Hong Kong, China, November. Associ- standing. In Proceedings of the 2019 Conference of ation for Computational Linguistics. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Guillaume Lample, Alexandre Sablayrolles, nologies, Volume 1 (Long and Short Papers), pages Marc’Aurelio Ranzato, Ludovic Denoyer, and 4171–4186, Minneapolis, Minnesota, June. Associ- Hervé Jégou. 2019. Large memory layers with ation for Computational Linguistics. product keys. arXiv preprint arXiv:1907.05242. Kawin Ethayarajh. 2019. How contextual are contex- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tualized word representations? comparing the ge- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, ometry of BERT, ELMo, and GPT-2 embeddings. Luke Zettlemoyer, and Veselin Stoyanov. 2019. In Proceedings of the 2019 Conference on Empir- Roberta: A robustly optimized bert pretraining ap- ical Methods in Natural Language Processing and proach. arXiv preprint arXiv:1907.11692. the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55– Stuart P. Lloyd. 1982. Least squares quantization in 65, Hong Kong, China, November. Association for PCM. IEEE Trans. Information Theory, 28(2):129– Computational Linguistics. 136. Allyson Ettinger. 2020. What bert is not: Lessons Paul Michel, Omer Levy, and Graham Neubig. 2019. from a new suite of psycholinguistic diagnostics for Are sixteen heads really better than one? In Ad- language models. Transactions of the Association vances in Neural Information Processing Systems for Computational Linguistics, 8:34–48. 32: Annual Conference on Neural Information Pro- cessing Systems 2019, NeurIPS 2019, 8-14 Decem- John Hewitt and Percy Liang. 2019. Designing and in- ber 2019, Vancouver, BC, Canada, pages 14014– terpreting probes with control tasks. In Proceedings 14024. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Interna- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. tional Joint Conference on Natural Language Pro- Corrado, and Jeffrey Dean. 2013. Distributed rep- cessing (EMNLP-IJCNLP), pages 2733–2743, Hong resentations of words and phrases and their com- Kong, China, November. Association for Computa- positionality. In Advances in Neural Information tional Linguistics. Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Pro- John Hewitt and Christopher D. Manning. 2019. A ceedings of a meeting held December 5-8, 2013, structural probe for finding syntax in word represen- Lake Tahoe, Nevada, United States, pages 3111– tations. In Proceedings of the 2019 Conference of 3119. George A. Miller, Claudia Leacock, Randee Tengi, and Processing and the 9th International Joint Confer- Ross T. Bunker. 1993. A semantic concordance. ence on Natural Language Processing (EMNLP- In HUMAN LANGUAGE TECHNOLOGY: Proceed- IJCNLP), pages 2463–2473, Hong Kong, China, ings of a Workshop Held at Plainsboro, New Jersey, November. Association for Computational Linguis- March 21-24, 1993. tics. George A. Miller. 1995. Wordnet: A lexical database Alec Radford, Jeff Wu, Rewon Child, David Luan, for english. Commun. ACM, 38(11):39–41, Novem- Dario Amodei, and Ilya Sutskever. 2019. Language ber. models are unsupervised multitask learners. David Mimno and Laure Thompson. 2017. The Alessandro Raganato, Jose Camacho-Collados, and strange geometry of skip-gram with negative sam- Roberto Navigli. 2017. Word sense disambigua- pling. In Proceedings of the 2017 Conference tion: A unified evaluation framework and empiri- on Empirical Methods in Natural Language Pro- cal comparison. In Proceedings of the 15th Confer- cessing, pages 2873–2878, Copenhagen, Denmark, ence of the European Chapter of the Association for September. Association for Computational Linguis- Computational Linguistics: Volume 1, Long Papers, tics. pages 99–110, Valencia, Spain, April. Association for Computational Linguistics. Jiaqi Mu and Pramod Viswanath. 2018. All-but-the- top: Simple and effective postprocessing for word Emily Reif, Ann Yuan, Martin Wattenberg, Fer- representations. In 6th International Conference on nanda B. Viégas, Andy Coenen, Adam Pearce, and Learning Representations, ICLR 2018, Vancouver, Been Kim. 2019. Visualizing and measuring the BC, Canada, April 30 - May 3, 2018, Conference geometry of BERT. In Advances in Neural Infor- Track Proceedings. mation Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Roberto Navigli. 2006. Meaningful clustering of NeurIPS 2019, 8-14 December 2019, Vancouver, senses helps boost word sense disambiguation per- BC, Canada, pages 8592–8600. formance. In Proceedings of the 21st International Conference on Computational Linguistics and the Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. 44th Annual Meeting of the Association for Com- BERT rediscovers the classical NLP pipeline. In putational Linguistics, ACL-44, pages 105–112, Proceedings of the 57th Annual Meeting of the Asso- Stroudsburg, PA, USA. Association for Computa- ciation for Computational Linguistics, pages 4593– tional Linguistics. 4601, Florence, Italy, July. Association for Compu- tational Linguistics. Massimiliano Pavan and Marcello Pelillo. 2007. Dom- inant sets and pairwise clustering. IEEE Trans. Pat- Rocco Tripodi and Roberto Navigli. 2019. Game tern Anal. Mach. Intell., 29(1):167–172. theory meets embeddings: a unified framework for word sense disambiguation. In Proceedings of the Marcello Pelillo. 2009. What is a cluster? perspectives 2019 Conference on Empirical Methods in Natu- from game theory. In Proc. of the NIPS Workshop on ral Language Processing and the 9th International Clustering Theory. Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), pages 88–99, Hong Kong, Jeffrey Pennington, Richard Socher, and Christopher China, November. Association for Computational Manning. 2014. Glove: Global vectors for word Linguistics. representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Rocco Tripodi and Stefano Li Pira. 2017. Analysis Processing (EMNLP), pages 1532–1543, Doha, of italian word embeddings. In Proceedings of the Qatar, October. Association for Computational Lin- Fourth Italian Conference on Computational Lin- guistics. guistics (CLiC-it 2017), Rome, Italy, December 11- 13, 2017. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Rocco Tripodi, Sebastiano Vascon, and Marcello Zettlemoyer. 2018. Deep contextualized word rep- Pelillo. 2016. Context aware nonnegative ma- resentations. In Proceedings of the 2018 Confer- trix factorization clustering. In 23rd International ence of the North American Chapter of the Associ- Conference on Pattern Recognition, ICPR 2016, ation for Computational Linguistics: Human Lan- Cancún, Mexico, December 4-8, 2016, pages 1719– guage Technologies, Volume 1 (Long Papers), pages 1724. 2227–2237, New Orleans, Louisiana, June. Associ- ation for Computational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018. Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, GLUE: A multi-task benchmark and analysis plat- Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and form for natural language understanding. In Pro- Alexander Miller. 2019. Language models as ceedings of the 2018 EMNLP Workshop Black- knowledge bases? In Proceedings of the 2019 Con- boxNLP: Analyzing and Interpreting Neural Net- ference on Empirical Methods in Natural Language works for NLP, pages 353–355, Brussels, Belgium, November. Association for Computational Linguis- tics. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural In- formation Processing Systems 32: Annual Con- ference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancou- ver, BC, Canada, pages 3261–3275. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- icz, and Jamie Brew. 2019. Huggingface’s trans- formers: State-of-the-art natural language process- ing. ArXiv, abs/1910.03771. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretrain- ing for language understanding. arXiv preprint arXiv:1906.08237.