<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Xinying</forename><surname>Chen</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Institute for Research and Applications of Fuzzy Modeling</orgName>
								<orgName type="institution">University of Ostrava</orgName>
								<address>
									<addrLine>CE IT4Innovations, 30. dubna 22</addrLine>
									<postCode>701 03</postCode>
									<settlement>Ostrava</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jan</forename><surname>Hůla</surname></persName>
							<email>jan.hula@osu.cz</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for Research and Applications of Fuzzy Modeling</orgName>
								<orgName type="institution">University of Ostrava</orgName>
								<address>
									<addrLine>CE IT4Innovations, 30. dubna 22</addrLine>
									<postCode>701 03</postCode>
									<settlement>Ostrava</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Antonín</forename><surname>Dvořák</surname></persName>
							<email>antonin.dvorak@osu.cz</email>
							<affiliation key="aff0">
								<orgName type="department">Institute for Research and Applications of Fuzzy Modeling</orgName>
								<orgName type="institution">University of Ostrava</orgName>
								<address>
									<addrLine>CE IT4Innovations, 30. dubna 22</addrLine>
									<postCode>701 03</postCode>
									<settlement>Ostrava</settlement>
									<country key="CZ">Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Analysis of the Semantic Vector Space Induced by a Neural Language Model and a Corpus</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">80ACD2A26CBAE703892307AA684D6929</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:15+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>semantic vector space</term>
					<term>neural language models</term>
					<term>vector embeddings</term>
					<term>clustering analysis</term>
					<term>polysemy</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Although contextual word representations produced by transformer-based language models (e.g., BERT) have proven to be very successful in different kinds of NLP tasks, there is still little knowledge about how these contextual embeddings are connected to word meanings or semantic features. In this article, we provide a quantitative analysis of the semantic vector space induced by the XLM-RoBERTa model and the Wikicorpus. We study the geometric properties of vector embeddings of selected words. We use HDBSCAN clustering algorithm and propose a score called Cluster Dispersion Score which reflects how disperse is the collection of clusters. Our analysis shows that the number of meanings of a word is not directly correlated with the dispersion of embeddings of this word in the semantic vector space induced by the language model and a corpus. Some observations about the division of clusters of embeddings for several selected words are provided.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Contextual word representations (embeddings) produced by transformer-based language models, such as BERT, have proven to be valuable and very successful in different kinds of NLP tasks, including machine translation, text generation, word sense disambiguation, etc. However, there is still little knowledge about how these contextual embeddings are connected to word meanings or semantic features.</p><p>We believe that if we better understand the relation of these embeddings to semantics of corresponding words, we will be able to figure out the way in which transformerbased models learn and represent natural language. It can also help to design more robust methods for word sense disambiguation, analysis of semantic change, and related tasks.</p><p>In this article, we provide a quantitative analysis of the semantic vector space induced by a popular language model called XLM-RoBERTa <ref type="bibr" target="#b0">[1]</ref> and a text corpus called Wikicorpus <ref type="bibr" target="#b1">[2]</ref>. Concretely, we study the geometric properties of vector embeddings of selected polysemous (e.g., "developer") and monosemous (e.g., "sheet") words. 1 For a given word, we collect all sentences containing this word, process these sentences by the language model, and collect word-specific embeddings. We then used the UMAP algorithm to reduce the dimensionality of the em-beddings and apply the HDBSCAN clustering algorithm to cluster these embeddings.</p><p>To study the geometric properties of this collection of clusters of word-specific embeddings, we propose a measure called Cluster Dispersion Score. We provide figures and descriptions of the results for several selected words. We also quantify the correlation between the score and the number of meanings of a given word. Our analysis shows that the number of meanings of a word is not directly correlated with the dispersion of the embeddings of this word in the semantic vector space induced by the language model and a corpus.</p><p>The paper is structured as follows. Section 2 discusses related work on the usage and properties of embeddings obtained by transformer models. In Section 3, we describe the methods we use, including the selection of words we investigate, the computation of embeddings, clustering, the computation of the cluster dispersion score, and cluster summarization. The description of our experiments and results can be found in Section 4. It also contains a more detailed description of the results for several selected target words. Then, a discussion of the interpretation of the results is provided in Section 5. Finally, Section 6 contains conclusions and directions for further research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Although neural network language models are well recognized for their ability to capture contextual semantics, in-depth discussions about the relationships between word vector representations and word meanings are not so common. The majority of works are concentrated on improving the performance of language models for Word Sense Disambiguation (WSD) tasks, and only a few are investigating how language models encode and recover word senses.</p><p>As a semantic disambiguation task <ref type="bibr" target="#b2">[3]</ref>, WSD has progressed greatly since the appearance of neural network language models <ref type="bibr" target="#b3">[4]</ref>. This is especially true for transformer-based models <ref type="bibr" target="#b4">[5]</ref>. For instance, BERT and its derivatives (BERT family models) have proven to be very successful for WSD and word embeddings produced by these models can deliver rather satisfying results even with a simple non-parametric approach (e.g., nearest neighbors) and a small training set <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>. However, with the priority of improving WSD performance, such studies offer little insight into word vector organizations.</p><p>A few works have attempted to discuss more in-depth how transformer-based language modes encode semantic knowledge, such as semantic information provided by WordNet (a predefined word sense inventory). Loureiro et al. <ref type="bibr" target="#b6">[7]</ref> provided quantitative and qualitative analysis of different classes of words (with different numbers of meanings) in the BERT model and found that BERT can capture high-level or coarse-grained sense distinctions, but it does not capture fine-grained sense distinctions. In reality, it sometimes even fails with the coarse-grained setting due to problems such as availability of training data and computing resources. Loureiro et al. also gave a detailed investigation of the BERT model regarding lexical ambiguity and different semantic knowledge-based benchmarks. But they did not put much emphasis on the relationship between vector spaces and semantic knowledge. In order to better understand the emergent semantic space, Yenicelik et al. <ref type="bibr" target="#b7">[8]</ref> investigated the vectors of polysemous words by using cluster analysis. Their study shows a similar result: BERT can to some extent distinguish different meanings of polysemous words, but with challenges that cannot be ignored. The work of Yenicelik et al. is informative about the relation between BERT embeddings and semantic knowledge, but suffers from small sample sizes (using SemCor data with approximately 500 embeddings per word) and the missing control group (monosemous words).</p><p>Unlike the above studies, the work of Garí Soler and Apidianaki <ref type="bibr" target="#b8">[9]</ref> shows that BERT can detect the polysemy level of words as well as their sense partitionability. However, its performance is not universal. English BERT embeddings are more likely to contain polysemy-related information, but models in other languages can also distinguish between words at different polysemy levels. With carefully designed experiments, they discussed several closely related tasks: lexical polysemy detection, polysemy level prediction, the impact of frequency and POS <ref type="foot" target="#foot_0">2</ref> , classification by polysemy level, and word sense cluster-ability. The study focuses on the macroscopic discussion of whether language models can detect word polysemy level, and does not probe deeply into the fine-grained differences within different clusters of embeddings.</p><p>Finally, how semantic clusters are formed and connected in language models has been addressed more qualitatively than quantitatively <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b5">6]</ref>, and there are still no agreed-upon answers to these questions.</p><p>Our work differs in that we are trying to understand the geometric properties of word-specific embeddings and how they connect to semantic knowledge by conducting quantitative and qualitative analyses with the Wikicorpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>In this section, we describe all the steps we follow in our analysis. Concretely, we describe the selection of target words for the analysis, the creation of contextual embeddings, the clustering of the embeddings, the computation of the Cluster Dispersion Score (CDS) and, finally, the summarization of each cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Selection of Target Words</head><p>For the analysis described in this contribution, we selected 43 unique words (target words). The selection process reflected two requirements: 1. The selected words should have approximately the same frequency within the given corpus (to be sure that our analysis is not influenced by the frequency), 2. The selected list of words should contain examples of words with only one unique meaning (monosemous words) and words with multiple meanings (polysemous words). To satisfy the second requirement, we used the SemCor corpus <ref type="bibr" target="#b11">[12]</ref> which is a textual corpus with each word labeled by a specific meaning from the WordNet ontology <ref type="bibr" target="#b12">[13]</ref>. We selected 1000 words that have only one specific meaning within the SemCor corpus, and 1000 words that have more than one meaning. From these, we filtered only words with a frequency in the range of 5700-6000.</p><p>Another important criterion for selecting words is to choose words that remain the same after the tokenization process. The language model that we use for this study is XLM-RoBERTa <ref type="bibr" target="#b0">[1]</ref>, which is a transformer-based model pre-trained on a large corpus (2.5TB of filtered Common-Crawl data) in a self-supervised fashion. The model uses a tokenizer based on SentencePiece <ref type="bibr" target="#b13">[14]</ref>, and it sometimes tokenizes one word into two or more pieces. After filtering out words with this tokenization condition, we finally obtained a list of 43 target words for this study. The resulting list contains 15 words from the monosemous category and 28 words from the polysemous category. The concrete words are listed in Table <ref type="table" target="#tab_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Computing the Contextual Embeddings</head><p>Our analysis of the embedding space is carried out on the Wikicorpus <ref type="bibr" target="#b1">[2]</ref>, which contains a large portion of the Wikipedia 2006 dump. It contains parallel contents of three languages, namely, Catalan, Spanish, and English. The size of the corpus is more than 750 million words.</p><p>For our experiment, we used only English content for analysis.</p><p>To compute contextual embeddings for a given target word, we first collect all sentences from the Wikicorpus that contain this word. Each sentence is then processed by the neural language model. For our experiments, we use a transformer-based model called XLM-RoBERTa <ref type="bibr" target="#b0">[1]</ref> because of its popularity in the NLP community and the available pre-trained implementation <ref type="foot" target="#foot_2">3</ref> The model produces a vector embedding for every word within the sentence by taking other words in the sentence into account. This allows the embeddings to be contextual in contrast to Word2Vec <ref type="bibr" target="#b3">[4]</ref> embeddings, which are fixed and independent of the context. We collect only the embeddings that correspond to the target word. Each embedding has a dimension of 768.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Clustering and Visualization</head><p>Our hypothesis was that distinct meanings of a given target word will form well-separated clusters in the embedding space. We wanted to detect these clusters in an unsupervised way without specifying the number of clusters in advance. For this purpose, we used the UMAP algorithm <ref type="bibr" target="#b14">[15]</ref> to reduce the dimensionality of each embedding to 50 and the HDBSCAN clustering algorithm <ref type="bibr" target="#b15">[16]</ref> to cluster the reduced embeddings. We set the hyperparameters of these algorithms to fixed values, <ref type="foot" target="#foot_3">4</ref>but we note that for the analysis described in this paper, one could tweak the hyperparameters for each word separately. For the visualization of the clusters shown in Figure <ref type="figure" target="#fig_3">4</ref>, we use the UMAP algorithm with the same hyperparameters, except that the embeddings are projected into the 2D space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Cluster Dispersion Score</head><p>As part of our analysis, we invent a score which should measure how varied the usage of a given target word is. We call it cluster dispersion score or shortly dispersion score. It reflects the average distance between the discovered clusters and also their size. First, we introduce a simple notation used in the definition of the score.</p><p>Let 𝑋 = {𝑋1, . . . , 𝑋𝑛} be a set of embedding vectors of a given target word and 𝐶 = {𝑐1, . . . , 𝑐𝑚} the set of indices of the clusters discovered by the clustering algorithm. We denote the distance between two clusters 𝑐𝑖, 𝑐𝑗 by 𝑑 𝑐𝑙 (𝑐𝑖, 𝑐𝑗) and the embeddings corresponding to the cluster 𝑐𝑖 by 𝑋(𝑐𝑖). At a high level, the score has the following form:</p><formula xml:id="formula_0">𝐶𝐷𝑆(𝑋) = ∑︁ 𝑐 𝑖 ,𝑐 𝑗 ∈𝐶,𝑐 𝑖 &lt;𝑐 𝑗 𝑑 𝑐𝑙 (𝑐𝑖, 𝑐𝑗) • 𝑊𝑖𝑗.</formula><p>It is the sum of weighted distances over all pairs of distinct clusters. If 𝑚 = 0, the score is defined to be equal to 0. The weights and distances are symmetric; therefore, we ignore pairs with 𝑐𝑖 ≥ 𝑐𝑗. To compute the distance between two clusters, we first select the 20 most similar pairs of vectors (𝑋 𝑖𝑘 , 𝑋 𝑗𝑘 ), where 𝑋 𝑖𝑘 ∈ 𝑋(𝑐𝑖) and 𝑋 𝑗𝑘 ∈ 𝑋(𝑐𝑗). For the similarity of two vectors, we use the cosine distance and compute it in the original 768-dimensional space. The distance between the two clusters is then the average over the 20 pairs:</p><formula xml:id="formula_1">𝑑 𝑐𝑙 (𝑐𝑖, 𝑐𝑗) = 1 20 20 ∑︁ 𝑘=1 𝑑𝑐𝑜𝑠(𝑋 𝑖𝑘 , 𝑋 𝑗𝑘 )</formula><p>It is a variation of the single linkage distance <ref type="bibr" target="#b16">[17]</ref>, which is obtained by setting 𝑘 = 1. Averaging over 20 most similar pairs makes the computation more robust to outliers.</p><p>The rationale behind using the closest pairs to calculate the distance instead of computing the distance between cluster centers is that sometimes the clustering algorithm splits one large cluster into multiple smaller ones as seen in Figure <ref type="figure" target="#fig_3">4</ref>. This is not a problem if we use the closest pairs to compute the distance, because the distance will be negligible in this case and will not influence the score significantly.</p><p>The weight 𝑊𝑖𝑗 for a pair of two clusters 𝑐𝑖, 𝑐𝑗 is a product of two terms:</p><formula xml:id="formula_2">𝑊𝑖𝑗 = 𝑆𝑖𝑗 • 𝐻𝑖𝑗.</formula><p>𝑆𝑖𝑗 quantifies the proportion of embeddings contained in these two clusters. It is computed by:</p><formula xml:id="formula_3">𝑆𝑖𝑗 = |𝑋(𝑐𝑖)| + |𝑋(𝑐𝑗)| ∑︀ 𝑐 𝑘 ,𝑐 𝑙 ∈𝐶,𝑐 𝑘 &lt;𝑐 𝑙 |𝑋(𝑐 𝑘 )| + |𝑋(𝑐 𝑙 )| .</formula><p>The sum in the denominator normalizes the size with respect to all possible pairs. The intuition behind 𝑆𝑖𝑗 is that we want the score to be influenced more if the two clusters contain a large proportion of embeddings, compared to the case when the clusters are the same distance apart but contain only few embeddings. In the second case, the clusters could correspond to a very rare usage of a given word or to outliers in the given corpus. <ref type="foot" target="#foot_5">5</ref>The value of 𝐻𝑖𝑗 reflects how imbalanced the proportion of the cluster 𝑐𝑖 is with respect to the size of the cluster 𝑐𝑗. This imbalance is captured by the binary entropy function 𝐻 𝑏 :</p><formula xml:id="formula_4">𝐻𝑖𝑗 = 𝐻 𝑏 (︂ |𝑋(𝑐𝑖)| |𝑋(𝑐𝑖)| + |𝑋(𝑐𝑗)| )︂ .</formula><p>The intuition behind 𝐻𝑖𝑗 is that we want the score to be influenced more if the two distinct clusters have approximately similar size compared to the case when one cluster contains, say, 95% and other 5% of embeddings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Cluster summarization</head><p>In order to produce a summary of each cluster, we list 10 words with the highest TF-IDF score (Term Frequency -Inverse Document Frequency) <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20]</ref>. TF-IDF is a popular score used in information retrieval that is intended to reflect how important a given word is to a document in a collection of documents. It is a product of two statistics: term frequency (how many times a given word appears in a document relative to all words in this document) and inverse document frequency (how rare is the word across all documents). In our case, we concatenate all sentences within one cluster together to form a document and then apply the TF-IDF to all clusters/documents of a given word. Before applying the TF-IDF, we remove the stop words.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Data and Experiments</head><p>In this section, we present the experimental results with discussion.</p><p>For this study, we selected 43 target words that contain 15 monosemous words and 28 polysemous words. For each target word, we conducted the clustering analysis based on the extracted embeddings. Then we calculated the dispersion score (Section 3.4) to measure how disperse are the clusters of a target word, see Table <ref type="table" target="#tab_0">1</ref>.</p><p>Comparing the dispersion scores of monosemous words and polysemous words in Figure <ref type="figure" target="#fig_0">1</ref> and Table <ref type="table">2</ref>, we can see that polysemous words have a larger mean and median. These results are in line with intuition. There should be distinct clusters of meanings for a polysemous word and the distance between these clusters should be greater than that between clusters for monosemous words. Although the polysemous word group has a larger standard deviation, it might be caused by some outliers.</p><p>For a more rigorous comparison, we ran a statistical test. We first looked at the distributions of the scores; see The overview of target words. NM: number of meanings, NPOS: number of POS. The category of monosemous words consists of words which have the value 1 in the SemCor NM column.</p><p>Figure <ref type="figure" target="#fig_1">2</ref>. The dispersion score distributions of monosemous and polysemous words seem not to follow the normal distribution. Therefore, we applied the Rank Sum Test to see whether there were significant differences between these two groups. With the statistic = −1.4015, p − value = 0.1611, the statistical test shows that there are no significant differences between the dis- descriptive statistics mono poly mean 0.0013 0.0016 median 0.0012 0.0014 standard deviation 0.0007 0.0008</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Descriptive statistics of the dispersion scores of monosemous and polysemous words. persion scores of monosemous and polysemous words (p − value &gt; 0.05). This result contradicts our intuition and the results from descriptive statistics. Therefore, in terms of the dispersion score, we cautiously conclude that it is unclear whether there are real differences between the two groups of words. With more samples and experiments in the future, we might be able to reach a more reliable conclusion. Furthermore, we would like to know whether there is a correlation between the dispersion score and the number of meanings a word has. Table <ref type="table" target="#tab_0">1</ref> presents the number of meanings of the target words. We believe that there are two different kinds of meaning. Static meanings (in an index such as WordNet or a dictionary) and dynamic meanings (in actual texts). Table <ref type="table">3</ref> and Figure <ref type="figure" target="#fig_2">3</ref> show that there are no strong correlations. The dispersion of clusters (representing different usages) does not correlate with the number of meanings (and POS) a word has. Word A, for example, may only have two meanings while word B may have ten. However, the cluster distances of word A may be greater than that of word B. The reasons may be because word A has two very distinct meanings and contexts, whereas word B has ten meanings and contexts that are more similar. A closer look at the clusters will help us understand the factors that influence dispersion scores.</p><p>NM_SemCor NM_WordNet NPOS_WordNet DS 0.1371 0.3924 0.1499</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>The correlation coefficient. DS: dispersion score, NM: number of meanings, NPOS: number of POS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Closer Look at the Selected Words</head><p>Looking at the monosemous words in Table <ref type="table" target="#tab_0">1</ref> (those having the value 1 in the SemCor NM column), we can see that there are two outliers ( "lots" and "developer" ) that have the dispersion score much higher than other words in this category. In Figure <ref type="figure" target="#fig_3">4</ref>, we show the UMAP visualization of these two words together with two words from the polysemous category ("stick" and "sheet"). The clusters are colored according to the labels assigned by the clustering algorithm. Next to each cluster, we display 10 words (or 5 for the word "stick") with the highest TF-IDF score. As can be seen in the plot for the word "lots", there are three distinct clusters. Two of them larger and one smaller. The two larger clusters correspond to the following meanings: lots as "parcels of land" and lots as in "lots of people, money, etc." and the smaller cluster contains sentences with "parking lots". Clusters in the other three plots can be interpreted in a similar way. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>After taking a closer look at the discovered clusters of each word, we can see that it is not clear when to distinguish one meaning as separate from the other. For example, for the word developer, there is a well-separated cluster corresponding to the sentences containing the phrase "game developer" and another cluster coresponding to sentences about software developers. Similar nuances can also be seen in several other words. This observation questions the completeness of manually defined lists of word meanings, such as those given by WordNet and other sources. One could also realize that the clusters are largely determined by the given corpus, which is a small snapshot of the language used at a specific time and place. It reflects distinctions that are important to the people who wrote the texts contained in the corpus. Such distinctions arise because of real needs of the people using the language (e.g., Inuits having a large number of distinct words for different types of snow). As can be seen in Figure <ref type="figure" target="#fig_3">4</ref>, neural language models can discover these distinctions just by learning to predict a word from its context. We also mention a few problematic points in our method. The most problematic point is that the dispersion score is unstable with respect to larger changes of hyperparameters of the clustering algorithm. We tried to design the score to be stable with respect to splits of larger clusters into multiple smaller ones, but more work would need to be done in order to really achieve this stability.</p><p>Next, as discovered by Timkey et al. <ref type="bibr" target="#b20">[21]</ref>, the similarity of embeddings created by transformer-based language models may be greatly influenced by very few dimensions of the embedding. These dimensions apparently distort the cosine similarity and disable distinguishing nuanced meanings. Timkey et al. suggest to normalize the embeddings before measuring the cosine similarity as a simple way to mitigate this problem. In our experiments, we have not seen this problem, as the clusters were often well separated, but we plan to use the proposed normalization in the future.</p><p>Lastly, the range of selected words is very limited due to the requirement of similar frequency and no subword tokenization, as mentioned in Section 3.1. In the future, we plan to conduct a more extensive analysis without these limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this contribution, we provided a quantitative and qualitative analysis of the semantic vector space induced by a neural language model and a corpus. We showed that the contextual embeddings created by the language model often form well-separated clusters that correspond to different meanings of the word. As part of our analysis, we introduced a score that reflects how dispersed is the collection of clusters for a given word. Our analysis shows that the score is not directly correlated with the number of meanings as defined by WordNet. After closer inspection of several words, we concluded that it is not clear when one meaning should be separated from the other and that manually defined lists of different meanings of the word are not complete or fine-grained enough. Our analysis also shows the possibility of developing applications that will create a list of different usages of the word in an automatic data-driven way. We envision that such applications may be useful for foreign language learners.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Boxplot of the dispersion score of monosemous and polysemous words.</figDesc><graphic coords="5,119.80,84.19,142.36,90.53" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Distributions of cluster dispersion scores.</figDesc><graphic coords="5,119.80,323.53,142.35,90.06" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The scatter plot of the dispersion scores.</figDesc><graphic coords="5,312.79,84.19,183.03,118.70" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: This figure shows a UMAP visualization of embeddings of four selected words. The embeddings are colored according to the class assigned by the clustering algorithm. The dark red color corresponds to the cluster '-1', which contains outliers. The clustering was done in 50-dimensional space and therefore the 2D visualization may distort the geometry used for the clustering. Next to each cluster, we display 10 (or 5 in the case of the word 'stick') words with the highest TF-IDF score.</figDesc><graphic coords="6,89.29,84.19,416.72,404.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc></figDesc><table><row><cell>word</cell><cell cols="3">SemCor WordNet NM NPOS NM</cell><cell>score</cell></row><row><cell>keyboard</cell><cell>1</cell><cell>1</cell><cell>2</cell><cell>0.0009</cell></row><row><cell>mystery</cell><cell>1</cell><cell>1</cell><cell>2</cell><cell>0.0013</cell></row><row><cell>buying</cell><cell>1</cell><cell>2</cell><cell>6</cell><cell>0.0012</cell></row><row><cell>conversation</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>0.0008</cell></row><row><cell>lots</cell><cell>1</cell><cell>3</cell><cell>11</cell><cell>0.0025</cell></row><row><cell>basically</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>0.0009</cell></row><row><cell>clothes</cell><cell>1</cell><cell>2</cell><cell>4</cell><cell>0.0006</cell></row><row><cell>patron</cell><cell>1</cell><cell>1</cell><cell>3</cell><cell>0.0016</cell></row><row><cell>obviously</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>0.0007</cell></row><row><cell>quest</cell><cell>1</cell><cell>2</cell><cell>7</cell><cell>0.0004</cell></row><row><cell>celebrity</cell><cell>1</cell><cell>1</cell><cell>2</cell><cell>0.0012</cell></row><row><cell>sky</cell><cell>1</cell><cell>2</cell><cell>2</cell><cell>0.0010</cell></row><row><cell>successive</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>0.0015</cell></row><row><cell>developer</cell><cell>1</cell><cell>1</cell><cell>2</cell><cell>0.0030</cell></row><row><cell>everyday</cell><cell>1</cell><cell>1</cell><cell>3</cell><cell>0.0015</cell></row><row><cell>companion</cell><cell>2</cell><cell>2</cell><cell>4</cell><cell>0.0015</cell></row><row><cell>tag</cell><cell>4</cell><cell>2</cell><cell>10</cell><cell>0.0036</cell></row><row><cell>quiet</cell><cell>10</cell><cell>4</cell><cell>13</cell><cell>0.0004</cell></row><row><cell>depression</cell><cell>4</cell><cell>1</cell><cell>10</cell><cell>0.0013</cell></row><row><cell>coin</cell><cell>2</cell><cell>2</cell><cell>3</cell><cell>0.0015</cell></row><row><cell>afternoon</cell><cell>2</cell><cell>1</cell><cell>2</cell><cell>0.0017</cell></row><row><cell>carefully</cell><cell>2</cell><cell>1</cell><cell>2</cell><cell>0.0010</cell></row><row><cell>installation</cell><cell>2</cell><cell>1</cell><cell>3</cell><cell>0.0011</cell></row><row><cell>initiative</cell><cell>2</cell><cell>2</cell><cell>3</cell><cell>0.0014</cell></row><row><cell>cruise</cell><cell>2</cell><cell>2</cell><cell>5</cell><cell>0.0014</cell></row><row><cell>export</cell><cell>2</cell><cell>2</cell><cell>4</cell><cell>0.0014</cell></row><row><cell>topic</cell><cell>2</cell><cell>1</cell><cell>2</cell><cell>0.0017</cell></row><row><cell>tight</cell><cell>7</cell><cell>2</cell><cell>16</cell><cell>0.0020</cell></row><row><cell>sheet</cell><cell>3</cell><cell>2</cell><cell>10</cell><cell>0.0026</cell></row><row><cell>girlfriend</cell><cell>2</cell><cell>1</cell><cell>2</cell><cell>0.0012</cell></row><row><cell>rap</cell><cell>2</cell><cell>2</cell><cell>10</cell><cell>0.0006</cell></row><row><cell>seal</cell><cell>5</cell><cell>2</cell><cell>15</cell><cell>0.0020</cell></row><row><cell>evident</cell><cell>2</cell><cell>1</cell><cell>2</cell><cell>0.0013</cell></row><row><cell>sweet</cell><cell>9</cell><cell>3</cell><cell>16</cell><cell>0.0008</cell></row><row><cell>span</cell><cell>3</cell><cell>2</cell><cell>7</cell><cell>0.0031</cell></row><row><cell>spin</cell><cell>2</cell><cell>2</cell><cell>13</cell><cell>0.0018</cell></row><row><cell>stem</cell><cell>4</cell><cell>2</cell><cell>10</cell><cell>0.0032</cell></row><row><cell>conductor</cell><cell>3</cell><cell>1</cell><cell>4</cell><cell>0.0011</cell></row><row><cell>employ</cell><cell>3</cell><cell>2</cell><cell>3</cell><cell>0.0015</cell></row><row><cell cols="2">configuration 2</cell><cell>1</cell><cell>2</cell><cell>0.0002</cell></row><row><cell>stick</cell><cell>6</cell><cell>2</cell><cell>25</cell><cell>0.0026</cell></row><row><cell>comment</cell><cell>4</cell><cell>2</cell><cell>6</cell><cell>0.0009</cell></row><row><cell>confidence</cell><cell>3</cell><cell>1</cell><cell>5</cell><cell>0.0012</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">A part-of-speech (POS) is a category of words that have similar grammatical properties. For example, noun, verb, adjec-</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1">tive, adverb, pronoun, preposition, etc. For more details, see https://en.wikipedia.org/wiki/Part_of_speech</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://huggingface.co/roberta-base.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">For UMAP: n_neighbors = 30, min_dist = 0.0, and for HDBSCAN: min_samples = 40, min_cluster_size =</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="50" xml:id="foot_4">.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5">For example, there is a small cluster in the embeddigs of the word 'tag' which contains only phrases 'list by a tag'.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Unsupervised crosslingual representation learning at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wenzek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Guzmán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<idno>CoRR abs/1911.02116</idno>
		<ptr target="http://arxiv.org/abs/1911.02116.arXiv:1911.02116" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Wikicorpus: A word-sense disambiguated multilingual Wikipedia corpus</title>
		<author>
			<persName><forename type="first">S</forename><surname>Reese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Boleda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cuadros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Padró</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rigau</surname></persName>
		</author>
		<ptr target="http://www.lrec-conf.org/proceedings/lrec2010/pdf/222_Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC&apos;10), European Language Resources Association (ELRA)</title>
				<meeting>the Seventh International Conference on Language Resources and Evaluation (LREC&apos;10), European Language Resources Association (ELRA)<address><addrLine>Valletta, Malta</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">WiC: the word-in-context dataset for evaluating contextsensitive meaning representations</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Pilehvar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Camacho-Collados</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of NAACL-HLT</title>
				<meeting>NAACL-HLT</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1267" to="1273" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<title level="m">Efficient estimation of word representations in vector space</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Wiedemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Remus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chawla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Biemann</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1909.10430</idno>
		<title level="m">Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Analysis and evaluation of language models for word sense disambiguation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Loureiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Rezaee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">T</forename><surname>Pilehvar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Camacho-Collados</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="387" to="443" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">How does BERT capture semantics? A closer look at polysemous words</title>
		<author>
			<persName><forename type="first">D</forename><surname>Yenicelik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kilcher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP</title>
				<meeting>the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="156" to="162" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Let&apos;s play mono-poly: BERT can reveal words&apos; polysemy level and partitionability into senses</title>
		<author>
			<persName><forename type="first">A</forename><surname>Garí Soler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Apidianaki</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="825" to="844" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Visualizing and measuring the geometry of BERT</title>
		<author>
			<persName><forename type="first">E</forename><surname>Reif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wattenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">B</forename><surname>Viegas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Coenen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pearce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">What does BERT learn about the structure of language?</title>
		<author>
			<persName><forename type="first">G</forename><surname>Jawahar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Seddah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL 2019-57th Annual Meeting of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A semantic concordance</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Leacock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tengi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T</forename><surname>Bunker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Human Language Technology: Proceedings of a Workshop Held at</title>
				<meeting><address><addrLine>Plainsboro, New Jersey</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1993">March 21-24, 1993, 1993</date>
			<biblScope unit="page" from="303" to="308" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">WordNet: a lexical database for English</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="39" to="41" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Kudo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Richardson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1808.06226</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Mcinnes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Healy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Melville</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.03426</idno>
		<title level="m">UMAP: Uniform manifold approximation and projection for dimension reduction</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Density-based clustering based on hierarchical density estimates</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Campello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Moulavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Pacific-Asia Conference on Knowledge Discovery and Data Mining</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="160" to="172" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">SLINK: an optimally efficient algorithm for the single-link cluster method</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sibson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Computer Journal</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="30" to="34" />
			<date type="published" when="1973">1973</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Rajaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Ullman</surname></persName>
		</author>
		<title level="m">Mining of Massive Datasets</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A statistical interpretation of term specificity and its application in retrieval</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">S</forename><surname>Jones</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Documentation</title>
		<imprint>
			<date type="published" when="1972">1972</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A statistical approach to mechanized encoding and searching of literary information</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Luhn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IBM Journal of Research and Development</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="309" to="317" />
			<date type="published" when="1957">1957</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Timkey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Van Schijndel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2109.04404</idno>
		<title level="m">All bark and no bite: Rogue dimensions in transformer language models obscure representational quality</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
