1. Introduction

Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus

Xinying Chen

Jan Hůla

Antonín Dvořák

0 0 Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, CE IT4Innovations , 30. dubna 22, 701 03 Ostrava , Czech Republic

Although contextual word representations produced by transformer-based language models (e.g., BERT) have proven to be very successful in diferent kinds of NLP tasks, there is still little knowledge about how these contextual embeddings are connected to word meanings or semantic features. In this article, we provide a quantitative analysis of the semantic vector space induced by the XLM-RoBERTa model and the Wikicorpus. We study the geometric properties of vector embeddings of selected words. We use HDBSCAN clustering algorithm and propose a score called Cluster Dispersion Score which reflects how disperse is the collection of clusters. Our analysis shows that the number of meanings of a word is not directly correlated with the dispersion of embeddings of this word in the semantic vector space induced by the language model and a corpus. Some observations about the division of clusters of embeddings for several selected words are provided.

eol>semantic vector space neural language models vector embeddings clustering analysis polysemy

1. Introduction

beddings and apply the HDBSCAN clustering algorithm to cluster these embeddings.

Contextual word representations (embeddings) produced To study the geometric properties of this collection of by transformer-based language models, such as BERT, clusters of word-specific embeddings, we propose a meahave proven to be valuable and very successful in difer- sure called Cluster Dispersion Score. We provide figures ent kinds of NLP tasks, including machine translation, and descriptions of the results for several selected words. text generation, word sense disambiguation, etc. How- We also quantify the correlation between the score and ever, there is still little knowledge about how these con- the number of meanings of a given word. Our analysis textual embeddings are connected to word meanings or shows that the number of meanings of a word is not disemantic features. rectly correlated with the dispersion of the embeddings

We believe that if we better understand the relation of of this word in the semantic vector space induced by the these embeddings to semantics of corresponding words, language model and a corpus. we will be able to figure out the way in which transformer- The paper is structured as follows. Section 2 discusses based models learn and represent natural language. It related work on the usage and properties of embeddings can also help to design more robust methods for word obtained by transformer models. In Section 3, we describe sense disambiguation, analysis of semantic change, and the methods we use, including the selection of words we related tasks. investigate, the computation of embeddings, clustering,

In this article, we provide a quantitative analysis of the computation of the cluster dispersion score, and clusthe semantic vector space induced by a popular language ter summarization. The description of our experiments model called XLM-RoBERTa [1] and a text corpus called and results can be found in Section 4. It also contains Wikicorpus [2]. Concretely, we study the geometric prop- a more detailed description of the results for several seerties of vector embeddings of selected polysemous (e.g., lected target words. Then, a discussion of the interpre“developer”) and monosemous (e.g., “sheet”) words.1 For tation of the results is provided in Section 5. Finally, a given word, we collect all sentences containing this Section 6 contains conclusions and directions for further word, process these sentences by the language model, research. and collect word-specific embeddings. We then used the UMAP algorithm to reduce the dimensionality of the em

2. Related Work

improving the performance of language models for Word ability. The study focuses on the macroscopic discussion Sense Disambiguation (WSD) tasks, and only a few are of whether language models can detect word polysemy investigating how language models encode and recover level, and does not probe deeply into the fine-grained word senses. diferences within diferent clusters of embeddings.

As a semantic disambiguation task [3], WSD has Finally, how semantic clusters are formed and conprogressed greatly since the appearance of neural net- nected in language models has been addressed more work language models [4]. This is especially true for qualitatively than quantitatively [10, 11, 6], and there transformer-based models [5]. For instance, BERT and are still no agreed-upon answers to these questions. its derivatives (BERT family models) have proven to be Our work difers in that we are trying to understand very successful for WSD and word embeddings produced the geometric properties of word-specific embeddings by these models can deliver rather satisfying results even and how they connect to semantic knowledge by conwith a simple non-parametric approach (e.g., nearest ducting quantitative and qualitative analyses with the neighbors) and a small training set [6, 7]. However, with Wikicorpus. the priority of improving WSD performance, such studies ofer little insight into word vector organizations.

A few works have attempted to discuss more in-depth 3. Methods how transformer-based language modes encode semantic knowledge, such as semantic information provided by In this section, we describe all the steps we follow in WordNet (a predefined word sense inventory). Loureiro our analysis. Concretely, we describe the selection of et al. [7] provided quantitative and qualitative analysis target words for the analysis, the creation of contextual of diferent classes of words (with diferent numbers of embeddings, the clustering of the embeddings, the commeanings) in the BERT model and found that BERT can putation of the Cluster Dispersion Score (CDS) and, finally, capture high-level or coarse-grained sense distinctions, the summarization of each cluster. but it does not capture fine-grained sense distinctions. In reality, it sometimes even fails with the coarse-grained 3.1. Selection of Target Words setting due to problems such as availability of training For the analysis described in this contribution, we sedata and computing resources. Loureiro et al. also gave a lected 43 unique words (target words). The selection prodetailed investigation of the BERT model regarding lexi- cess reflected two requirements: 1. The selected words cal ambiguity and diferent semantic knowledge-based should have approximately the same frequency within benchmarks. But they did not put much emphasis on the the given corpus (to be sure that our analysis is not inrelationship between vector spaces and semantic knowl- fluenced by the frequency), 2. The selected list of words edge. In order to better understand the emergent seman- should contain examples of words with only one unique tic space, Yenicelik et al. [8] investigated the vectors of meaning (monosemous words) and words with multiple polysemous words by using cluster analysis. Their study meanings (polysemous words). To satisfy the second shows a similar result: BERT can to some extent distin- requirement, we used the SemCor corpus [12] which is guish diferent meanings of polysemous words, but with a textual corpus with each word labeled by a specific challenges that cannot be ignored. The work of Yenicelik meaning from the WordNet ontology [13]. We selected et al. is informative about the relation between BERT em- 1000 words that have only one specific meaning within beddings and semantic knowledge, but sufers from small the SemCor corpus, and 1000 words that have more than sample sizes (using SemCor data with approximately 500 one meaning. From these, we filtered only words with a embeddings per word) and the missing control group frequency in the range of 5700–6000. (monosemous words). Another important criterion for selecting words is to

Unlike the above studies, the work of Garí Soler and choose words that remain the same after the tokenization Apidianaki [9] shows that BERT can detect the polysemy process. The language model that we use for this study is level of words as well as their sense partitionability. How- XLM-RoBERTa [1], which is a transformer-based model ever, its performance is not universal. English BERT em- pre-trained on a large corpus (2.5TB of filtered Commonbeddings are more likely to contain polysemy-related in- Crawl data) in a self-supervised fashion. The model uses formation, but models in other languages can also distin- a tokenizer based on SentencePiece [14], and it someguish between words at diferent polysemy levels. With times tokenizes one word into two or more pieces. After carefully designed experiments, they discussed several ifltering out words with this tokenization condition, we ficlosely related tasks: lexical polysemy detection, poly- nally obtained a list of 43 target words for this study. The semy level prediction, the impact of frequency and POS2, resulting list contains 15 words from the monosemous classification by polysemy level, and word sense cluster2A part-of-speech (POS) is a category of words that have similar grammatical properties. For example, noun, verb, adjective, adverb, pronoun, preposition, etc. For more details, see https://en.wikipedia.org/wiki/Part_of_speech category and 28 words from the polysemous category. clusters and also their size. First, we introduce a simple The concrete words are listed in Table 1. notation used in the definition of the score. Let = {1, . . . , } be a set of embedding vectors 3.2. Computing the Contextual of a given target word and = {1, . . . , } the set of indices of the clusters discovered by the clustering

Embeddings algorithm. We denote the distance between two clusters Our analysis of the embedding space is carried out on , by (, ) and the embeddings corresponding the Wikicorpus [2], which contains a large portion of the to the cluster by (). At a high level, the score has Wikipedia 2006 dump. It contains parallel contents of the following form: three languages, namely, Catalan, Spanish, and English.

The size of the corpus is more than 750 million words. () = ∑︁ (, ) · . For our experiment, we used only English content for , ∈,< analysis.

To compute contextual embeddings for a given target It is the sum of weighted distances over all pairs of word, we first collect all sentences from the Wikicorpus distinct clusters. If = 0, the score is defined to be that contain this word. Each sentence is then processed equal to 0. The weights and distances are symmetric; by the neural language model. For our experiments, we therefore, we ignore pairs with ≥ . To compute the use a transformer-based model called XLM-RoBERTa [1] distance between two clusters, we first select the 20 most because of its popularity in the NLP community and the similar pairs of vectors (, ), where ∈ () available pre-trained implementation3 The model pro- and ∈ ( ). For the similarity of two vectors, we duces a vector embedding for every word within the sen- use the cosine distance and compute it in the original tence by taking other words in the sentence into account. 768-dimensional space. The distance between the two This allows the embeddings to be contextual in contrast clusters is then the average over the 20 pairs: to Word2Vec [4] embeddings, which are fixed and inde- 20 tpheantdceonrtreosfptohnedcotonttehxet.taWrgeetcowlloercdt.oEnalcyhthemeebmedbdeidndginhgass (, ) = 210 ∑=︁1 (, ) a dimension of 768.

It is a variation of the single linkage distance [17], which is obtained by setting = 1. Averaging over 20 most 3.3. Clustering and Visualization similar pairs makes the computation more robust to outOur hypothesis was that distinct meanings of a given liers. target word will form well-separated clusters in the em- The rationale behind using the closest pairs to calculate bedding space. We wanted to detect these clusters in the distance instead of computing the distance between an unsupervised way without specifying the number cluster centers is that sometimes the clustering algorithm of clusters in advance. For this purpose, we used the splits one large cluster into multiple smaller ones as seen UMAP algorithm [15] to reduce the dimensionality of in Figure 4. This is not a problem if we use the closest each embedding to 50 and the HDBSCAN clustering al- pairs to compute the distance, because the distance will gorithm [16] to cluster the reduced embeddings. We set be negligible in this case and will not influence the score the hyperparameters of these algorithms to fixed values, 4 significantly. but we note that for the analysis described in this paper, The weight for a pair of two clusters , is a one could tweak the hyperparameters for each word sep- product of two terms: arately. For the visualization of the clusters shown in Figure 4, we use the UMAP algorithm with the same hy- = · . perparameters, except that the embeddings are projected into the 2D space. quantifies the proportion of embeddings contained in these two clusters. It is computed by:

3.4. Cluster Dispersion Score

As part of our analysis, we invent a score which should measure how varied the usage of a given target word is. The sum in the denominator normalizes the size with We call it cluster dispersion score or shortly dispersion score. respect to all possible pairs. The intuition behind It reflects the average distance between the discovered is that we want the score to be influenced more if the 3https://huggingface.co/roberta-base. two clusters contain a large proportion of embeddings, 4For UMAP: n_neighbors = 30, min_dist = 0.0, and for HDBSCAN: compared to the case when the clusters are the same min_samples = 40, min_cluster_size = 50. distance apart but contain only few embeddings. In the cluster . This imbalance is captured by the binary entropy function : = ︂(

|()|

3.5. Cluster summarization

In order to produce a summary of each cluster, we list 10 words with the highest TF-IDF score (Term Frequency – Inverse Document Frequency) [18, 19, 20]. TF-IDF is a popular score used in information retrieval that is intended to reflect how important a given word is to a document in a collection of documents. It is a product of two statistics: term frequency (how many times a given word appears in a document relative to all words in this document) and inverse document frequency (how rare is the word across all documents). In our case, we concatenate all sentences within one cluster together to form a document and then apply the TF-IDF to all clusters/documents of a given word. Before applying the TF-IDF, we remove the stop words.

4. Data and Experiments

In this section, we present the experimental results with discussion.

For this study, we selected 43 target words that contain 15 monosemous words and 28 polysemous words. For each target word, we conducted the clustering analysis based on the extracted embeddings. Then we calculated the dispersion score (Section 3.4) to measure how disperse are the clusters of a target word, see Table 1.

Comparing the dispersion scores of monosemous words and polysemous words in Figure 1 and Table 2, we can see that polysemous words have a larger mean and median. These results are in line with intuition. There should be distinct clusters of meanings for a polysemous word and the distance between these clusters should be greater than that between clusters for monosemous words. Although the polysemous word group has a larger standard deviation, it might be caused by some outliers.

For a more rigorous comparison, we ran a statistical test. We first looked at the distributions of the scores; see 5For example, there is a small cluster in the embeddigs of the word ‘tag’ which contains only phrases ‘list by a tag’.

SemCor NM WordNet NPOS

1 0.0017 conversation word keyboard mystery buying lots basically clothes patron obviously quest celebrity sky successive developer everyday companion tag quiet depression coin afternoon carefully installation initiative girlfriend cruise export topic tight sheet rap seal evident sweet span spin stem conductor employ configuration stick comment confidence 2 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 4 2 2 2 2 2 2 2 7 3 2 2 5 2 9 3 2 4 3 3 2 6 4 3 11 2 2 6 1 1 4 3 1 7 2 2 1 2 3 4 2 3 3 5 4 2 2 10 13 10 3 16 10 2 10 15 2 16 7 13 10 4 3 2 6 5 25 score mal distribution. Therefore, we applied the Rank Sum Test to see whether there were significant diferences bep

− tween these two groups. With the statistic = − 1.4015, value =

0.1611, the statistical test shows that there are no significant diferences between the dispersion scores of monosemous and polysemous words (p − value > 0.05). This result contradicts our intuition and the results from descriptive statistics. Therefore, in terms of the dispersion score, we cautiously conclude that it is unclear whether there are real diferences between the two groups of words. With more samples and experiments in the future, we might be able to reach a more reliable conclusion.

Furthermore, we would like to know whether there is a correlation between the dispersion score and the number of meanings a word has. Table 1 presents the number of meanings of the target words. We believe that there are two diferent kinds of meaning. Static meanings (in an index such as WordNet or a dictionary) and dynamic meanings (in actual texts). Table 3 and Figure 3 show that there are no strong correlations. The dispersion of clusters (representing diferent usages) does not correlate with the number of meanings (and POS) a word has. Word A, for example, may only have two meanings while word B may have ten. However, the cluster distances of word A may be greater than that of word B. The reasons may be because word A has two very distinct meanings and contexts, whereas word B has ten meanings and contexts that are more similar. A closer look at the clusters will help us understand the factors that influence dispersion scores.

NM_SemCor NM_WordNet NPOS_WordNet

4.1. Closer Look at the Selected Words

Looking at the monosemous words in Table 1 (those having the value 1 in the SemCor NM column), we can see that there are two outliers ( “lots” and “developer” ) that have the dispersion score much higher than other words in this category. In Figure 4, we show the UMAP visualization of these two words together with two words from the polysemous category (“stick” and “sheet”). The clusters are colored according to the labels assigned by the clustering algorithm. Next to each cluster, we display 10 words (or 5 for the word “stick”) with the highest TFIDF score. As can be seen in the plot for the word “lots”, there are three distinct clusters. Two of them larger and one smaller. The two larger clusters correspond to the following meanings: lots as “parcels of land” and lots as in “lots of people, money, etc.” and the smaller cluster contains sentences with “parking lots”. Clusters in the other three plots can be interpreted in a similar way.

5. Discussion

word meanings, such as those given by WordNet and other sources. One could also realize that the clusters After taking a closer look at the discovered clusters of are largely determined by the given corpus, which is a each word, we can see that it is not clear when to distin- small snapshot of the language used at a specific time guish one meaning as separate from the other. For exam- and place. It reflects distinctions that are important to ple, for the word developer, there is a well-separated clus- the people who wrote the texts contained in the corpus. ter corresponding to the sentences containing the phrase Such distinctions arise because of real needs of the people “game developer” and another cluster coresponding to using the language (e.g., Inuits having a large number sentences about software developers. Similar nuances of distinct words for diferent types of snow). As can be can also be seen in several other words. This observation seen in Figure 4, neural language models can discover questions the completeness of manually defined lists of these distinctions just by learning to predict a word from

6. Conclusion

In this contribution, we provided a quantitative and qualitative analysis of the semantic vector space induced by a neural language model and a corpus. We showed that the contextual embeddings created by the language model often form well-separated clusters that correspond to different meanings of the word. As part of our analysis, we introduced a score that reflects how dispersed is the collection of clusters for a given word. Our analysis shows that the score is not directly correlated with the number of meanings as defined by WordNet. After closer inspection of several words, we concluded that it is not clear when one meaning should be separated from the other and that manually defined lists of diferent meanings of the word are not complete or fine-grained enough. Our analysis also shows the possibility of developing applications that will create a list of diferent usages of the word in an automatic data-driven way. We envision that such applications may be useful for foreign language learners. its context.

We also mention a few problematic points in our method. The most problematic point is that the dispersion score is unstable with respect to larger changes of hyperparameters of the clustering algorithm. We tried to design the score to be stable with respect to splits of larger clusters into multiple smaller ones, but more work would need to be done in order to really achieve this stability.

Next, as discovered by Timkey et al. [21], the similarity of embeddings created by transformer-based language models may be greatly influenced by very few dimensions of the embedding. These dimensions apparently distort the cosine similarity and disable distinguishing nuanced meanings. Timkey et al. suggest to normalize the embeddings before measuring the cosine similarity as a simple way to mitigate this problem. In our experiments, we have not seen this problem, as the clusters were often well separated, but we plan to use the proposed normalization in the future.

Lastly, the range of selected words is very limited due to the requirement of similar frequency and no subword tokenization, as mentioned in Section 3.1. In the future, we plan to conduct a more extensive analysis without these limitations. language independent subword tokenizer and detokenizer for Neural Text Processing, arXiv preprint arXiv:1808.06226 (2018). [15] L. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 (2018). [16] R. J. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 2013, pp. 160–172. [17] R. Sibson, SLINK: an optimally eficient algorithm for the single-link cluster method, The Computer Journal 16 (1973) 30–34. [18] A. Rajaraman, J. D. Ullman, Mining of Massive

Datasets, Cambridge University Press, 2011. [19] K. S. Jones, A statistical interpretation of term speciifcity and its application in retrieval, Journal of Documentation (1972). [20] H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM Journal of Research and Development 1 (1957) 309– 317. [21] W. Timkey, M. van Schijndel, All bark and no bite: Rogue dimensions in transformer language models obscure representational quality, arXiv preprint arXiv:2109.04404 (2021).