=Paper= {{Paper |id=Vol-3226/paper12 |storemode=property |title=Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus |pdfUrl=https://ceur-ws.org/Vol-3226/paper12.pdf |volume=Vol-3226 |authors=Xinying Chen,Jan Hůla,Antonín Dvořák |dblpUrl=https://dblp.org/rec/conf/itat/ChenHD22 }} ==Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus == https://ceur-ws.org/Vol-3226/paper12.pdf
Analysis of the Semantic Vector Space Induced by a Neural
Language Model and a Corpus
Xinying Chen1 , Jan Hůla1 and Antonín Dvořák1
1
 Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, CE IT4Innovations, 30. dubna 22,
701 03 Ostrava, Czech Republic


                                       Abstract
                                       Although contextual word representations produced by transformer-based language models (e.g., BERT) have proven to be
                                       very successful in different kinds of NLP tasks, there is still little knowledge about how these contextual embeddings are
                                       connected to word meanings or semantic features. In this article, we provide a quantitative analysis of the semantic vector
                                       space induced by the XLM-RoBERTa model and the Wikicorpus. We study the geometric properties of vector embeddings of
                                       selected words. We use HDBSCAN clustering algorithm and propose a score called Cluster Dispersion Score which reflects
                                       how disperse is the collection of clusters. Our analysis shows that the number of meanings of a word is not directly correlated
                                       with the dispersion of embeddings of this word in the semantic vector space induced by the language model and a corpus.
                                       Some observations about the division of clusters of embeddings for several selected words are provided.

                                       Keywords
                                       semantic vector space, neural language models, vector embeddings, clustering analysis, polysemy



1. Introduction                                                                                 beddings and apply the HDBSCAN clustering algorithm
                                                                                                to cluster these embeddings.
Contextual word representations (embeddings) produced                                              To study the geometric properties of this collection of
by transformer-based language models, such as BERT,                                             clusters of word-specific embeddings, we propose a mea-
have proven to be valuable and very successful in differ-                                       sure called Cluster Dispersion Score. We provide figures
ent kinds of NLP tasks, including machine translation,                                          and descriptions of the results for several selected words.
text generation, word sense disambiguation, etc. How-                                           We also quantify the correlation between the score and
ever, there is still little knowledge about how these con-                                      the number of meanings of a given word. Our analysis
textual embeddings are connected to word meanings or                                            shows that the number of meanings of a word is not di-
semantic features.                                                                              rectly correlated with the dispersion of the embeddings
   We believe that if we better understand the relation of                                      of this word in the semantic vector space induced by the
these embeddings to semantics of corresponding words,                                           language model and a corpus.
we will be able to figure out the way in which transformer-                                        The paper is structured as follows. Section 2 discusses
based models learn and represent natural language. It                                           related work on the usage and properties of embeddings
can also help to design more robust methods for word                                            obtained by transformer models. In Section 3, we describe
sense disambiguation, analysis of semantic change, and                                          the methods we use, including the selection of words we
related tasks.                                                                                  investigate, the computation of embeddings, clustering,
   In this article, we provide a quantitative analysis of                                       the computation of the cluster dispersion score, and clus-
the semantic vector space induced by a popular language                                         ter summarization. The description of our experiments
model called XLM-RoBERTa [1] and a text corpus called                                           and results can be found in Section 4. It also contains
Wikicorpus [2]. Concretely, we study the geometric prop-                                        a more detailed description of the results for several se-
erties of vector embeddings of selected polysemous (e.g.,                                       lected target words. Then, a discussion of the interpre-
“developer”) and monosemous (e.g., “sheet”) words.1 For                                         tation of the results is provided in Section 5. Finally,
a given word, we collect all sentences containing this                                          Section 6 contains conclusions and directions for further
word, process these sentences by the language model,                                            research.
and collect word-specific embeddings. We then used the
UMAP algorithm to reduce the dimensionality of the em-
                                                                                                 2. Related Work
ITAT’22: Information technologies – Applications and Theory, Septem-
ber 23–27, 2022, Zuberec, Slovakia                                                                      Although neural network language models are well rec-
$ cici13306@gmail.com (X. Chen); jan.hula@osu.cz (J. Hůla);
                                                                                                        ognized for their ability to capture contextual semantics,
antonin.dvorak@osu.cz (A. Dvořák)
           © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in-depth discussions about the relationships between
           Attribution 4.0 International (CC BY 4.0).
    CEUR

           CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                                                                                        word vector representations and word meanings are not
1
  For details on how we differentiate between monosemous and pol- so common. The majority of works are concentrated on
    ysemous words see Section 3.1.
improving the performance of language models for Word              ability. The study focuses on the macroscopic discussion
Sense Disambiguation (WSD) tasks, and only a few are               of whether language models can detect word polysemy
investigating how language models encode and recover               level, and does not probe deeply into the fine-grained
word senses.                                                       differences within different clusters of embeddings.
   As a semantic disambiguation task [3], WSD has                     Finally, how semantic clusters are formed and con-
progressed greatly since the appearance of neural net-             nected in language models has been addressed more
work language models [4]. This is especially true for              qualitatively than quantitatively [10, 11, 6], and there
transformer-based models [5]. For instance, BERT and               are still no agreed-upon answers to these questions.
its derivatives (BERT family models) have proven to be                Our work differs in that we are trying to understand
very successful for WSD and word embeddings produced               the geometric properties of word-specific embeddings
by these models can deliver rather satisfying results even         and how they connect to semantic knowledge by con-
with a simple non-parametric approach (e.g., nearest               ducting quantitative and qualitative analyses with the
neighbors) and a small training set [6, 7]. However, with          Wikicorpus.
the priority of improving WSD performance, such studies
offer little insight into word vector organizations.
   A few works have attempted to discuss more in-depth             3. Methods
how transformer-based language modes encode semantic
                                                                   In this section, we describe all the steps we follow in
knowledge, such as semantic information provided by
                                                                   our analysis. Concretely, we describe the selection of
WordNet (a predefined word sense inventory). Loureiro
                                                                   target words for the analysis, the creation of contextual
et al. [7] provided quantitative and qualitative analysis
                                                                   embeddings, the clustering of the embeddings, the com-
of different classes of words (with different numbers of
                                                                   putation of the Cluster Dispersion Score (CDS) and, finally,
meanings) in the BERT model and found that BERT can
                                                                   the summarization of each cluster.
capture high-level or coarse-grained sense distinctions,
but it does not capture fine-grained sense distinctions. In
reality, it sometimes even fails with the coarse-grained           3.1. Selection of Target Words
setting due to problems such as availability of training           For the analysis described in this contribution, we se-
data and computing resources. Loureiro et al. also gave a          lected 43 unique words (target words). The selection pro-
detailed investigation of the BERT model regarding lexi-           cess reflected two requirements: 1. The selected words
cal ambiguity and different semantic knowledge-based               should have approximately the same frequency within
benchmarks. But they did not put much emphasis on the              the given corpus (to be sure that our analysis is not in-
relationship between vector spaces and semantic knowl-             fluenced by the frequency), 2. The selected list of words
edge. In order to better understand the emergent seman-            should contain examples of words with only one unique
tic space, Yenicelik et al. [8] investigated the vectors of        meaning (monosemous words) and words with multiple
polysemous words by using cluster analysis. Their study            meanings (polysemous words). To satisfy the second
shows a similar result: BERT can to some extent distin-            requirement, we used the SemCor corpus [12] which is
guish different meanings of polysemous words, but with             a textual corpus with each word labeled by a specific
challenges that cannot be ignored. The work of Yenicelik           meaning from the WordNet ontology [13]. We selected
et al. is informative about the relation between BERT em-          1000 words that have only one specific meaning within
beddings and semantic knowledge, but suffers from small            the SemCor corpus, and 1000 words that have more than
sample sizes (using SemCor data with approximately 500             one meaning. From these, we filtered only words with a
embeddings per word) and the missing control group                 frequency in the range of 5700–6000.
(monosemous words).                                                    Another important criterion for selecting words is to
   Unlike the above studies, the work of Garí Soler and            choose words that remain the same after the tokenization
Apidianaki [9] shows that BERT can detect the polysemy             process. The language model that we use for this study is
level of words as well as their sense partitionability. How-       XLM-RoBERTa [1], which is a transformer-based model
ever, its performance is not universal. English BERT em-           pre-trained on a large corpus (2.5TB of filtered Common-
beddings are more likely to contain polysemy-related in-           Crawl data) in a self-supervised fashion. The model uses
formation, but models in other languages can also distin-          a tokenizer based on SentencePiece [14], and it some-
guish between words at different polysemy levels. With             times tokenizes one word into two or more pieces. After
carefully designed experiments, they discussed several             filtering out words with this tokenization condition, we fi-
closely related tasks: lexical polysemy detection, poly-           nally obtained a list of 43 target words for this study. The
semy level prediction, the impact of frequency and POS2 ,          resulting list contains 15 words from the monosemous
classification by polysemy level, and word sense cluster-
2                                                                  tive, adverb, pronoun, preposition, etc. For more details, see
    A part-of-speech (POS) is a category of words that have sim-
    ilar grammatical properties. For example, noun, verb, adjec-   https://en.wikipedia.org/wiki/Part_of_speech
category and 28 words from the polysemous category. clusters and also their size. First, we introduce a simple
The concrete words are listed in Table 1.                  notation used in the definition of the score.
                                                              Let 𝑋 = {𝑋1 , . . . , 𝑋𝑛 } be a set of embedding vectors
3.2. Computing the Contextual                              of a given target word and 𝐶 = {𝑐1 , . . . , 𝑐𝑚 } the set
                                                           of indices of the clusters discovered by the clustering
      Embeddings                                           algorithm. We denote the distance between two clusters
Our analysis of the embedding space is carried out on 𝑐𝑖 , 𝑐𝑗 by 𝑑𝑐𝑙 (𝑐𝑖 , 𝑐𝑗 ) and the embeddings corresponding
the Wikicorpus [2], which contains a large portion of the to the cluster 𝑐𝑖 by 𝑋(𝑐𝑖 ). At a high level, the score has
Wikipedia 2006 dump. It contains parallel contents of the following form:
three languages, namely, Catalan, Spanish, and English.                                     ∑︁
The size of the corpus is more than 750 million words.             𝐶𝐷𝑆(𝑋) =                           𝑑𝑐𝑙 (𝑐𝑖 , 𝑐𝑗 ) · 𝑊𝑖𝑗 .
For our experiment, we used only English content for                                 𝑐𝑖 ,𝑐𝑗 ∈𝐶,𝑐𝑖 <𝑐𝑗

analysis.
   To compute contextual embeddings for a given target        It is the sum of weighted distances over all pairs of
word, we first collect all sentences from the Wikicorpus distinct clusters. If 𝑚 = 0, the score is defined to be
that contain this word. Each sentence is then processed equal to 0. The weights and distances are symmetric;
by the neural language model. For our experiments, we therefore, we ignore pairs with 𝑐𝑖 ≥ 𝑐𝑗 . To compute the
use a transformer-based model called XLM-RoBERTa [1] distance between two clusters, we first select the 20 most
because of its popularity in the NLP community and the similar pairs of vectors (𝑋𝑖𝑘 , 𝑋𝑗𝑘 ), where 𝑋𝑖𝑘 ∈ 𝑋(𝑐𝑖 )
available pre-trained implementation3 The model pro- and 𝑋𝑗𝑘 ∈ 𝑋(𝑐𝑗 ). For the similarity of two vectors, we
duces a vector embedding for every word within the sen- use the cosine distance and compute it in the original
tence by taking other words in the sentence into account. 768-dimensional space. The distance between the two
This allows the embeddings to be contextual in contrast clusters is then the average over the 20 pairs:
to Word2Vec [4] embeddings, which are fixed and inde-                                          20
pendent of the context. We collect only the embeddings                                      1 ∑︁
                                                                       𝑑𝑐𝑙 (𝑐𝑖 , 𝑐𝑗 ) =            𝑑𝑐𝑜𝑠 (𝑋𝑖𝑘 , 𝑋𝑗𝑘 )
that correspond to the target word. Each embedding has                                     20
                                                                                              𝑘=1
a dimension of 768.
                                                              It is a variation of the single linkage distance [17], which
                                                           is obtained by setting 𝑘 = 1. Averaging over 20 most
3.3. Clustering and Visualization                          similar pairs makes the computation more robust to out-
Our hypothesis was that distinct meanings of a given liers.
target word will form well-separated clusters in the em-      The rationale behind using the closest pairs to calculate
bedding space. We wanted to detect these clusters in the distance instead of computing the distance between
an unsupervised way without specifying the number cluster centers is that sometimes the clustering algorithm
of clusters in advance. For this purpose, we used the splits one large cluster into multiple smaller ones as seen
UMAP algorithm [15] to reduce the dimensionality of in Figure 4. This is not a problem if we use the closest
each embedding to 50 and the HDBSCAN clustering al- pairs to compute the distance, because the distance will
gorithm [16] to cluster the reduced embeddings. We set be negligible in this case and will not influence the score
the hyperparameters of these algorithms to fixed values,4 significantly.
but we note that for the analysis described in this paper,    The weight 𝑊𝑖𝑗 for a pair of two clusters 𝑐𝑖 , 𝑐𝑗 is a
one could tweak the hyperparameters for each word sep- product of two terms:
arately. For the visualization of the clusters shown in
Figure 4, we use the UMAP algorithm with the same hy-                                𝑊𝑖𝑗 = 𝑆𝑖𝑗 · 𝐻𝑖𝑗 .
perparameters, except that the embeddings are projected
                                                           𝑆𝑖𝑗 quantifies the proportion of embeddings contained
into the 2D space.
                                                           in these two clusters. It is computed by:

3.4. Cluster Dispersion Score                                                         |𝑋(𝑐𝑖 )| + |𝑋(𝑐𝑗 )|
                                                                      𝑆𝑖𝑗 = ∑︀                                    .
                                                                                𝑐𝑘 ,𝑐𝑙 ∈𝐶,𝑐𝑘 <𝑐𝑙 |𝑋(𝑐𝑘 )| + |𝑋(𝑐𝑙 )|
As part of our analysis, we invent a score which should
measure how varied the usage of a given target word is.             The sum in the denominator normalizes the size with
We call it cluster dispersion score or shortly dispersion score. respect to all possible pairs. The intuition behind 𝑆𝑖𝑗
It reflects the average distance between the discovered is that we want the score to be influenced more if the
3
  https://huggingface.co/roberta-base.
                                                                 two clusters contain a large proportion of embeddings,
4
  For UMAP: n_neighbors = 30, min_dist = 0.0, and for HDBSCAN: compared to the case when the clusters are the same
  min_samples = 40, min_cluster_size = 50.                       distance apart but contain only few embeddings. In the
second case, the clusters could correspond to a very rare                                SemCor     WordNet
                                                                         word                                       score
                                                                                         NM         NPOS NM
usage of a given word or to outliers in the given corpus.5
                                                                         keyboard        1          1       2       0.0009
   The value of 𝐻𝑖𝑗 reflects how imbalanced the propor-                  mystery         1          1       2       0.0013
tion of the cluster 𝑐𝑖 is with respect to the size of the                buying          1          2       6       0.0012
cluster 𝑐𝑗 . This imbalance is captured by the binary en-                conversation    1          1       1       0.0008
tropy function 𝐻𝑏 :                                                      lots            1          3       11      0.0025
                      (︂                     )︂                          basically       1          1       1       0.0009
                               |𝑋(𝑐𝑖 )|                                  clothes         1          2       4       0.0006
            𝐻𝑖𝑗 = 𝐻𝑏                            .
                         |𝑋(𝑐𝑖 )| + |𝑋(𝑐𝑗 )|                             patron          1          1       3       0.0016
                                                                         obviously       1          1       1       0.0007
The intuition behind 𝐻𝑖𝑗 is that we want the score to                    quest           1          2       7       0.0004
be influenced more if the two distinct clusters have ap-                 celebrity       1          1       2       0.0012
proximately similar size compared to the case when one                   sky             1          2       2       0.0010
cluster contains, say, 95% and other 5% of embeddings.                   successive      1          1       1       0.0015
                                                                         developer       1          1       2       0.0030
                                                                         everyday        1          1       3       0.0015
3.5. Cluster summarization                                               companion       2          2       4       0.0015
In order to produce a summary of each cluster, we list                   tag             4          2       10      0.0036
                                                                         quiet           10         4       13      0.0004
10 words with the highest TF-IDF score (Term Frequency
                                                                         depression      4          1       10      0.0013
– Inverse Document Frequency) [18, 19, 20]. TF-IDF is                    coin            2          2       3       0.0015
a popular score used in information retrieval that is in-                afternoon       2          1       2       0.0017
tended to reflect how important a given word is to a                     carefully       2          1       2       0.0010
document in a collection of documents. It is a product of                installation    2          1       3       0.0011
two statistics: term frequency (how many times a given                   initiative      2          2       3       0.0014
word appears in a document relative to all words in this                 cruise          2          2       5       0.0014
document) and inverse document frequency (how rare is                    export          2          2       4       0.0014
the word across all documents). In our case, we concate-                 topic           2          1       2       0.0017
nate all sentences within one cluster together to form a                 tight           7          2       16      0.0020
document and then apply the TF-IDF to all clusters/doc-                  sheet           3          2       10      0.0026
                                                                         girlfriend      2          1       2       0.0012
uments of a given word. Before applying the TF-IDF, we
                                                                         rap             2          2       10      0.0006
remove the stop words.                                                   seal            5          2       15      0.0020
                                                                         evident         2          1       2       0.0013
                                                                         sweet           9          3       16      0.0008
4. Data and Experiments                                                  span            3          2       7       0.0031
                                                                         spin            2          2       13      0.0018
In this section, we present the experimental results with
                                                                         stem            4          2       10      0.0032
discussion.                                                              conductor       3          1       4       0.0011
   For this study, we selected 43 target words that contain              employ          3          2       3       0.0015
15 monosemous words and 28 polysemous words. For                         configuration   2          1       2       0.0002
each target word, we conducted the clustering analysis                   stick           6          2       25      0.0026
based on the extracted embeddings. Then we calculated                    comment         4          2       6       0.0009
the dispersion score (Section 3.4) to measure how disperse               confidence      3          1       5       0.0012
are the clusters of a target word, see Table 1.                      Table 1
   Comparing the dispersion scores of monosemous                     The overview of target words. NM: number of meanings,
words and polysemous words in Figure 1 and Table 2, we               NPOS: number of POS. The category of monosemous words
can see that polysemous words have a larger mean and                 consists of words which have the value 1 in the SemCor NM
median. These results are in line with intuition. There              column.
should be distinct clusters of meanings for a polysemous
word and the distance between these clusters should
be greater than that between clusters for monosemous                 Figure 2. The dispersion score distributions of monose-
words. Although the polysemous word group has a larger               mous and polysemous words seem not to follow the nor-
standard deviation, it might be caused by some outliers.             mal distribution. Therefore, we applied the Rank Sum
   For a more rigorous comparison, we ran a statistical              Test to see whether there were significant differences be-
test. We first looked at the distributions of the scores; see        tween these two groups. With the statistic = −1.4015,
5
For example, there is a small cluster in the embeddigs of the word   p − value = 0.1611, the statistical test shows that
‘tag’ which contains only phrases ‘list by a tag’.                   there are no significant differences between the dis-
Figure 1: Boxplot of the dispersion score of monosemous and
polysemous words.

                                                                Figure 3: The scatter plot of the dispersion scores.
          descriptive statistics   mono      poly
          mean                     0.0013    0.0016
          median                   0.0012    0.0014
          standard deviation       0.0007    0.0008             B may have ten. However, the cluster distances of word
                                                                A may be greater than that of word B. The reasons may
Table 2                                                         be because word A has two very distinct meanings and
Descriptive statistics of the dispersion scores of monosemous   contexts, whereas word B has ten meanings and contexts
and polysemous words.                                           that are more similar. A closer look at the clusters will
                                                                help us understand the factors that influence dispersion
                                                                scores.

                                                                        NM_SemCor        NM_WordNet        NPOS_WordNet
                                                                 DS     0.1371           0.3924            0.1499

                                                                Table 3
                                                                The correlation coefficient. DS: dispersion score, NM: number
                                                                of meanings, NPOS: number of POS.


Figure 2: Distributions of cluster dispersion scores.
                                                                4.1. Closer Look at the Selected Words
                                                            Looking at the monosemous words in Table 1 (those
persion scores of monosemous and polysemous words having the value 1 in the SemCor NM column), we can
(p − value > 0.05). This result contradicts our intuition see that there are two outliers ( “lots” and “developer” )
and the results from descriptive statistics. Therefore, in that have the dispersion score much higher than other
terms of the dispersion score, we cautiously conclude words in this category. In Figure 4, we show the UMAP
that it is unclear whether there are real differences be- visualization of these two words together with two words
tween the two groups of words. With more samples and from the polysemous category (“stick” and “sheet”). The
experiments in the future, we might be able to reach a clusters are colored according to the labels assigned by
more reliable conclusion.                                   the clustering algorithm. Next to each cluster, we display
   Furthermore, we would like to know whether there is a    10 words (or 5 for the word “stick”) with the highest TF-
correlation between the dispersion score and the number     IDF  score. As can be seen in the plot for the word “lots”,
of meanings a word has. Table 1 presents the number there are three distinct clusters. Two of them larger and
of meanings of the target words. We believe that there one smaller. The two larger clusters correspond to the
are two different kinds of meaning. Static meanings (in following meanings: lots as “parcels of land” and lots as
an index such as WordNet or a dictionary) and dynamic in “lots of people, money, etc.” and the smaller cluster
meanings (in actual texts). Table 3 and Figure 3 show contains sentences with “parking lots”. Clusters in the
that there are no strong correlations. The dispersion of other three plots can be interpreted in a similar way.
clusters (representing different usages) does not correlate
with the number of meanings (and POS) a word has. Word
A, for example, may only have two meanings while word
Figure 4: This figure shows a UMAP visualization of embeddings of four selected words. The embeddings are colored according
to the class assigned by the clustering algorithm. The dark red color corresponds to the cluster ’-1’, which contains outliers.
The clustering was done in 50-dimensional space and therefore the 2D visualization may distort the geometry used for the
clustering. Next to each cluster, we display 10 (or 5 in the case of the word ’stick’) words with the highest TF-IDF score.



5. Discussion                                                word meanings, such as those given by WordNet and
                                                             other sources. One could also realize that the clusters
After taking a closer look at the discovered clusters of are largely determined by the given corpus, which is a
each word, we can see that it is not clear when to distin- small snapshot of the language used at a specific time
guish one meaning as separate from the other. For exam- and place. It reflects distinctions that are important to
ple, for the word developer, there is a well-separated clus- the people who wrote the texts contained in the corpus.
ter corresponding to the sentences containing the phrase Such distinctions arise because of real needs of the people
“game developer” and another cluster coresponding to using the language (e.g., Inuits having a large number
sentences about software developers. Similar nuances of distinct words for different types of snow). As can be
can also be seen in several other words. This observation seen in Figure 4, neural language models can discover
questions the completeness of manually defined lists of these distinctions just by learning to predict a word from
its context.                                                         abs/1911.02116 (2019). URL: http://arxiv.org/abs/
   We also mention a few problematic points in our                   1911.02116. arXiv:1911.02116.
method. The most problematic point is that the disper-           [2] S. Reese, G. Boleda, M. Cuadros, L. Padró, G. Rigau,
sion score is unstable with respect to larger changes of             Wikicorpus: A word-sense disambiguated mul-
hyperparameters of the clustering algorithm. We tried                tilingual Wikipedia corpus,         in: Proceedings
to design the score to be stable with respect to splits of           of the Seventh International Conference on Lan-
larger clusters into multiple smaller ones, but more work            guage Resources and Evaluation (LREC’10), Euro-
would need to be done in order to really achieve this                pean Language Resources Association (ELRA), Val-
stability.                                                           letta, Malta, 2010. URL: http://www.lrec-conf.org/
   Next, as discovered by Timkey et al. [21], the similarity         proceedings/lrec2010/pdf/222_Paper.pdf.
of embeddings created by transformer-based language              [3] M. T. Pilehvar, J. Camacho-Collados, WiC: the
models may be greatly influenced by very few dimen-                  word-in-context dataset for evaluating context-
sions of the embedding. These dimensions apparently                  sensitive meaning representations, in: Proceedings
distort the cosine similarity and disable distinguishing             of NAACL-HLT, 2019, pp. 1267–1273.
nuanced meanings. Timkey et al. suggest to normalize             [4] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
the embeddings before measuring the cosine similarity                estimation of word representations in vector space,
as a simple way to mitigate this problem. In our exper-              arXiv preprint arXiv:1301.3781 (2013).
iments, we have not seen this problem, as the clusters           [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
were often well separated, but we plan to use the pro-               L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
posed normalization in the future.                                   tention is all you need, Advances in Neural Infor-
   Lastly, the range of selected words is very limited due           mation Processing Systems 30 (2017).
to the requirement of similar frequency and no subword           [6] G. Wiedemann, S. Remus, A. Chawla, C. Biemann,
tokenization, as mentioned in Section 3.1. In the future,            Does BERT make any sense? Interpretable word
we plan to conduct a more extensive analysis without                 sense disambiguation with contextualized embed-
these limitations.                                                   dings, arXiv preprint arXiv:1909.10430 (2019).
                                                                 [7] D. Loureiro, K. Rezaee, M. T. Pilehvar, J. Camacho-
                                                                     Collados, Analysis and evaluation of language mod-
6. Conclusion                                                        els for word sense disambiguation, Computational
                                                                     Linguistics 47 (2021) 387–443.
In this contribution, we provided a quantitative and qual-
                                                                 [8] D. Yenicelik, F. Schmidt, Y. Kilcher, How does BERT
itative analysis of the semantic vector space induced by a
                                                                     capture semantics? A closer look at polysemous
neural language model and a corpus. We showed that the
                                                                     words, in: Proceedings of the Third BlackboxNLP
contextual embeddings created by the language model
                                                                     Workshop on Analyzing and Interpreting Neural
often form well-separated clusters that correspond to dif-
                                                                     Networks for NLP, 2020, pp. 156–162.
ferent meanings of the word. As part of our analysis, we
                                                                 [9] A. Garí Soler, M. Apidianaki, Let’s play mono-poly:
introduced a score that reflects how dispersed is the col-
                                                                     BERT can reveal words’ polysemy level and par-
lection of clusters for a given word. Our analysis shows
                                                                     titionability into senses, Transactions of the As-
that the score is not directly correlated with the number
                                                                     sociation for Computational Linguistics 9 (2021)
of meanings as defined by WordNet. After closer inspec-
                                                                     825–844.
tion of several words, we concluded that it is not clear
                                                                [10] E. Reif, A. Yuan, M. Wattenberg, F. B. Viegas, A. Co-
when one meaning should be separated from the other
                                                                     enen, A. Pearce, B. Kim, Visualizing and measuring
and that manually defined lists of different meanings of
                                                                     the geometry of BERT, Advances in Neural Infor-
the word are not complete or fine-grained enough. Our
                                                                     mation Processing Systems 32 (2019).
analysis also shows the possibility of developing applica-
                                                                [11] G. Jawahar, B. Sagot, D. Seddah, What does BERT
tions that will create a list of different usages of the word
                                                                     learn about the structure of language?, in: ACL
in an automatic data-driven way. We envision that such
                                                                     2019-57th Annual Meeting of the Association for
applications may be useful for foreign language learners.
                                                                     Computational Linguistics, 2019.
                                                                [12] G. A. Miller, C. Leacock, R. Tengi, R. T. Bunker,
                                                                     A semantic concordance, in: Human Language
References                                                           Technology: Proceedings of a Workshop Held at
                                                                     Plainsboro, New Jersey, March 21-24, 1993, 1993,
 [1] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud-                  pp. 303–308.
     hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,              [13] G. A. Miller, WordNet: a lexical database for English,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-                Communications of the ACM 38 (1995) 39–41.
     lingual representation learning at scale, CoRR             [14] T. Kudo, J. Richardson, SentencePiece: A simple and
     language independent subword tokenizer and deto-
     kenizer for Neural Text Processing, arXiv preprint
     arXiv:1808.06226 (2018).
[15] L. McInnes, J. Healy, J. Melville, UMAP: Uniform
     manifold approximation and projection for dimen-
     sion reduction, arXiv preprint arXiv:1802.03426
     (2018).
[16] R. J. Campello, D. Moulavi, J. Sander, Density-based
     clustering based on hierarchical density estimates,
     in: Pacific-Asia Conference on Knowledge Discov-
     ery and Data Mining, Springer, 2013, pp. 160–172.
[17] R. Sibson, SLINK: an optimally efficient algorithm
     for the single-link cluster method, The Computer
     Journal 16 (1973) 30–34.
[18] A. Rajaraman, J. D. Ullman, Mining of Massive
     Datasets, Cambridge University Press, 2011.
[19] K. S. Jones, A statistical interpretation of term speci-
     ficity and its application in retrieval, Journal of
     Documentation (1972).
[20] H. P. Luhn, A statistical approach to mechanized en-
     coding and searching of literary information, IBM
     Journal of Research and Development 1 (1957) 309–
     317.
[21] W. Timkey, M. van Schijndel, All bark and no bite:
     Rogue dimensions in transformer language models
     obscure representational quality, arXiv preprint
     arXiv:2109.04404 (2021).