Sense Tree: Discovery of New Word Senses with Graph-based Scoring Jan Ehmüller1 , Lasse Kohlmeyer1 , Holly McKee1 , Daniel Paeschke1 , Tim Repke2 , Ralf Krestel2 , and Felix Naumann2 Hasso Plattner Institute, University of Potsdam, Germany 1 first.last@student.hpi.uni-potsdam.de,2 first.last@hpi.uni-potsdam.de Abstract. Language is dynamic and constantly evolving: both the us- age context and the meaning of words change over time. Identifying words that acquired new meanings and the point in time at which new word senses emerged is elementary for word sense disambiguation and entity linking in historical texts. For example, cloud once stood mostly for the weather phenomenon and only recently gained the new sense of cloud computing. We propose a clustering-based approach that com- putes sense trees, showing how meanings of words change over time. The produced results are easy to interpret and explain using a drill down mechanism. We evaluate our approach qualitatively on the Corpus of Historic American English (COHA), which spans two hundred years. 1 The Evolution of Language As language evolves, words develop new senses. Detecting these new senses over time is useful for several uses, such as word sense disambiguation and entity linking in historic texts. Current machine learning models that represent and disambiguate word senses or link entities are trained on recent datasets. There- fore, these models are not aware of how language changed over the last decades or even centuries. When disambiguating words in historic texts, it is not possi- ble for such a model to know which senses a word could have had at that time. For instance, the word cloud was once used mostly in the newspaper section of weather forecasts. Nowadays, it is increasingly used in the context of cloud computing. Hence, when disambiguating cloud in texts from the 19th century with a model trained on current data, the model would not be aware that the meaning of cloud computing did not yet exist at that point in time. New word senses enter colloquial language from many aspects of human life, such as popular culture, new technologies, world events, and social movements. In 2019 alone, over 1,100 new words and meanings were added to the Merriam- Webster Dictionary. Because the usage of context words differs over time, we can discover word senses by taking advantage of this temporal aspect. Discover- ing new or different senses of a word is known as word sense induction (WSI). Copyright © 2020 by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). An example for WSI can be found in the context of depression. Depression is commonly known in the sense of a mental health disorder, but also in the sense of economic crisis in the term great depression. The aim of WSI is to find out which senses the word depression has. Word sense disambiguation (WSD), on the other hand, is known as the automated disambiguation of a word sense within a text [15]. While WSI aims to discover different senses for one word, WSD aims to decide which sense a word has in a specific context, such as in a sentence. We propose a WSI method that detects new senses by creating multiple co-occurrence graphs over time, and extracts word senses based on so-called ego-networks. For a given word, it uses graph clustering to extract word senses from the word’s ego-network. These word senses are matched over time to create a forest of so-called “sense trees”. This forest can be explored to find out if and when a word gained new senses over time. With the help of linguists, we annotated a list of 112 words for evaluation1 and tested our approach using the Corpus of Historical American English (COHA), with 400 million words the largest corpus of its type [4]. 2 Related Work Research on word sense induction and word sense disambiguation addresses how to improve information retrieval in search engines, information extraction for specific domains, entity linking over time, machine translation, and lexicogra- phy [15]. Our research focuses on WSI and aims to discover the emergence of new word meanings over time. Approaches can be divided roughly into vector space models that include word embeddings [5], graph clustering techniques [19], which include word clustering [5], and co-occurrence graphs [14]. Word embeddings are vector representations of words in a semantic space that allow for word meanings to be induced by context. Word embeddings are trained by a neural network that learns to predict a word based on its context words or vice versa. Hence, words with similar contexts have a similar vector and are thus in close proximity to each other. The initially proposed word2vec model [13] computes only one vector and thus, allows for only one meaning per word. Natural language is ambiguous and most words are polysemous — they have multiple senses. Sense embeddings, as proposed by Song et al. [17], allow for multiple senses by representing each sense instance as its own vector. However, neither traditional word embeddings, like word2vec, nor sense em- beddings consider the temporal aspect and assume that words are static across time. This creates a challenge in the face of dynamic and changing natural lan- guage [1]. Word embeddings can be used for temporal tasks if multiple embed- dings are trained separately for separate time periods. As those embeddings are trained separately, they do not lie in the same semantic embedding space [10]. To ensure that they are in the same space and thus are comparable, they need to be aligned with each other [1]. To resolve such alignment issues, dynamic or 1 https://hpi.de/naumann/s/language-evolution diachronic word embeddings were introduced [10]. Kim et al. solve this by train- ing their model on the earliest time periods first. Using the obtained weights as the initial state for the next training phase, they move through subsequent periods, allowing the embeddings to gain complexity and pick up new senses [9]. Yao et al. present another idea, called dynamic word embeddings, where align- ment is enforced by simultaneously training the word embeddings for different time periods [21]. The disadvantage of their approach is that it addresses either only the temporal semantic shift or multi-sense aspect of words. In contrast, our approach takes both aspects into consideration. Another method for extracting word senses is through the use of graph clustering or community detection. This method builds a graph based on co- occurrence features from a corpus. Such graphs represent the relation of words to each other, and allows for the extraction of sense clusters through graph clus- tering. Automatic word sense change detection based on curvature clustering can help understand the different senses in historic archives [18]. Their manual evaluation chose 23 terms with known sense changes (e.g., “gay”). Hope and Keller introduce the soft clustering algorithm MaxMax [7]. They identify word senses by transforming the co-occurrence graph around a given word into an unweighted, directed graph. Mitra et al. present an approach that builds graphs based on distributional thesauri for separate time periods [14]. From those graphs they extract so-called “ego-networks”. With the randomized graph-clustering algorithm Chinese Whis- pers, sense clusters are induced for specific words from their ego-network [2]. A similar and more recent approach by Ustalov et al., performs the clustering step with the meta-algorithm watset. This algorithm uses hard clustering algo- rithms, such as Louvain or Chinese Whispers, to perform a soft clustering [19]. Hard clustering algorithms assign nodes to exactly one cluster, whereas soft clustering produces a probability distribution of cluster assignments. Besides embeddings and graph models, topic modeling has been applied to WSI as well. Lau et al. model word senses as topics of a word by application of latent Dirichlet allocation (LDA) as well as non-parametric hierarchical Dirichlet processes (HDP). They also applied their approach on the field of novelty sense detection by comparing induced senses of words of a diachronic corpus containing two time periods for a self developed dataset of 10 words [11]. Jatowt and Duh provide a framework for discovering and visualizing semantic changes w.r.t. individual words, word pairs and word sentiment. They use n-gram frequencies, positional information and Latent Semantic Analyses to construct word vectors for each decade of Google Books and COHA. To derive changes of word senses, they calculate the cosine similarity of vector representations of the same word at different time points. The authors show results of a case study for semantic changes of single words, inter-decade word similarity and contrasting word pairs in 16 experiments with mostly different words [8] Our approach is similar to that of Mitra et al. [14], but differs in that we build a simpler co-occurrence graph and compare three different clustering algorithms. Additionally, our approach interprets the results of comparing sense clusters automatically to rank a word according to the likelihood of having gained a new sense over time. Furthermore, we use more fine-grained time slices, to more precisely narrow in on the point in time where a new sense emerges, whereas Mitra et al. merely oppose two time slices at once. This point can be visualized by our drill down, which shows the forest of sense trees built from ego-networks. This ability makes our approach valuable as an exploration tool in the field of historical and diachronic linguists, specifically for hypothesis testing. 3 A Forest of Word Senses Our approach can be divided into the three parts visualized in Figure 1 using a fictitious example: (i) construction of a weighted co-occurrence graph, (ii) ex- traction of word senses from a word’s ego-network in the form of sense clusters, and (iii) matching them over time to create a forest of sense trees. We separate the corpus into n time slices Ct , which are handled as subcorpora. Similar to Mitra et al. [14], our analysis is limited to nouns. For each subcorpus, we con- struct a filtered co-occurrence graph and extract sense clusters for each word. Finally, these clusters are connected across time slices to form sense trees. 3.1 Building a Co-Occurrence Graph To build a weighted co-occurrence graph for time slice Ct , we need to count how often words appear in the same context in Ct . Two nouns co-occur if they are part of the same sentence and the word distance between them is ≤ nwindow . Therefore, the parameter nwindow controls the complexity of the resulting co- occurrence graph. A smaller window size results in fewer co-occurrence edges and hence in a sparser graph. Having computed the co-occurrences, we can create the graph Gt = (Vt , Et ) for each time slice Ct . The sets of nodes and edges of Gt for time slice Ct are defined as Vt = {u ∈ Ct | u is noun ∧ tf(u) ≥ αtf } (1) Et = {{u, v} | u, v ∈ Vt ∧ u 6= v ∧ cooc (u, v) ≥ αcooc } where tf(u) is the term frequency of u in the given time slice and cooc(u, v) is the number of times words u and v appear within the same window. We use the threshold parameters αtf and αcooc to exclude rarely occurring words, as, depending on the window size nwindow , the number of edges in this graph would rapidly increase. Oftentimes, a corpus has an unbalanced distribution of data across all time slices. As a result, some have much higher raw co-occurrence values. This especially affects frequently occurring words with generic meanings, such as man and time. Our initial filtering does not account for that. We use the point-wise mutual information (PMI) [20] pmi(u, v) = cooc(u, v)/ (tf(u) ∗ tf(v)) to reduce the importance of frequent words. After filtering more edges and pos- sible unconnected nodes, the final co-occurence graph graph G0t = (Vt0 , Et0 ) for time slice Ct , as depicted as one column in Figure 1a, is defined by (a) Co-occurrence graph (b) Ego-network for the word mouse (c) Forest of sense trees Fig. 1: Example for the construction of a simple sense forest of the word mouse [ Vt0 = Et0 Et0 = {{u, v} ∈ Et | pmi (u, v) ≥ αpmit } (2) where αpmi is a threshold parameter to remove the most loosely associated words in the given time slice. 3.2 Word Sense Extraction We use G0t to extract information about word contexts in time slice Ct . We hypothesize that the context of a word is an indicator for its senses as suggested by Lindén and hence use clustering to extract those senses [12]. The context of word w can be extracted in the form of an ego-network, which contains all neighbors of w and edges among them, but not w itself. Figure 1b shows such a network for the word mouse. The different colors indicate the different clusters of a time slice that were produced by a graph-clustering algorithm. Formally, we define the ego-network G b w = (Vbw , E bw ) of word w in time slice Ct as Vbw = {u ∈ Vt0 | {u, w} ∈ Et0 } bw = {{u, v} ∈ Et0 | u ∈ Vbw ∧ v ∈ Vbw } E (3) To extract different senses of w, we cluster the nodes of its ego-network Gbw . In Section 4 we compare different clustering strategies. Each of these clustering algorithms produces a set of p disjoint clusters Swt = {c1 , . . . , cp } from the ego- network of word w in time slice Ct . Following Mitra et al. [14], we assume that each of the resulting clusters represents a “sense cluster”. Ideally, each meaning of w in Ct is represented by exactly one sense cluster. By relaxing this condition and allowing more than one sense cluster for each sense, we are able to get better and more fine-grained results by the clustering algorithms. 3.3 Matching Word Senses Over Time In the last step, sense clusters are matched across time slices. We use the Jaccard similarity to compare sets of words, which are given by the clusters of word w from two neighboring time slices Ct and Ct−1 . Let ci ∈ Swt be a sense cluster for word w in time slice Ct . After a pairwise comparison between all sense clusters across two time slices, we use a greedy approach to iteratively match those with the highest score. However, if there is no cluster c0j ∈ Swt−1 that shares any words with ci ∈ Swt , it remains unmatched. In this case, it would become the root cluster of a new sense tree. A disadvantage of this matching strategy is that a word sense might simply not occur in some time slices and thus interrupt the lineage of that word sense. This may happen with sense clusters whose words have low frequencies, such as words that appear in specific scientific literature. Because they cannot always be matched between neighboring time slices, we match them across a longer time span. Sense clusters in Swt are matched not only to the sense clusters in Swt−1 , but also to any other previous sense cluster in Sw1 , . . . , Swt−2 that remained unmatched. We call this matching strategy leaf-matching. t−2 [ t−1 [ 0  0 Sw t−1 = Swt−1 ∪ c ∈ Swi | c0 not matched to any c00 ∈ 0 Sw j i=1 j=i+1 (4) We compare for each time slice and produce a forest of sense trees Fw = (Vw , Ew ): n [ Vw = Swi ; Ew = {(c, c0 ) | c ∈ Swi ∧ c0 ∈ Swj ∧ i < j ∧ c matched c0 } i=1 (5) Fw contains sense clusters without incoming edges. These clusters are root clusters and represent the beginning of a sense tree. Given a root cluster r, the respective sense tree Fwr = (Vwr , Ewr ) is defined as follows: Vwr = {c ∈ Vw | ∃p = ((r, c01 ), . . . , (c0k , c))} (6) Ewr = {(c, c0 ) ∈ Ew | c, c0 ∈ Vwr } Such a sense tree represents a distinct sense of the word w. For example, in Figure 1c, the matched clusters make up three sense trees, referring to three different meanings of the word mouse. 4 Comparison of Algorithms for Sense Clustering In this section we introduce the following algorithms for the clustering step in Section 3.2: Chinese Whispers [2], Girvan-Newman [6], and Louvain [3]. We also describe the insights gained by setting up experiments using our drill down to inspect the resulting clusters of these algorithms. Chinese Whispers is an agglomerative clustering algorithm introduced by Bie- mann [2]. It does not have a fixed number of clusters, which is a property that suits word senses whose number is not known apriori. Its only parameter is the number of iterations. Depending on the size of the graph, a small number of iterations might never produce larger sense clusters. We chose 1 000 iterations to ensure that the algorithm converges as suggested by Biemann [2]. Inspecting the computed clusters, we discovered that Chinese Whispers pro- duces one very large sense cluster that contains nearly every word in the ego- network, along with very few other small sense clusters. Since the large sense cluster contains more than one meaning, the results are not fitting our use case. Girvan-Newman is a community detection algorithm named after its authors [6]. The algorithm is a hierarchical method that iteratively removes edges from the graph to find communities. It always removes the edge with the highest between- ness centrality or a custom metric, such as the co-occurrence frequency. Hence, the initially computed sense cluster contains all nodes of the ego-network. With each iteration, it produces more detailed sense clusters. This algorithm produces also a variable number of clusters and hence fits well for the task of identifying an unknown number of word senses. However, in our configuration, the compu- tational costs of Girvan-Newman are around 25-30 times higher than those of Louvain and Chinese Whispers. Due to this high computational cost, we choose only three iterations. Because the fine-granularity of the hierarchy depends on the number of iterations, Girvan-Newman effectively produces one large clus- ter that contains most words of an ego-network in addition to many one-node sense clusters. As with Chinese Whispers, these results do not fit the use case of identifying multiple meanings of a word. Louvain is a hierarchical community detection algorithm proposed by Blondel et al. [3]. It uses the Louvain modularity, which measures the difference of edge density inside communities to the edge density outside communities, to iden- tify communities in a graph. Similarly to the previous algorithms, its number of detected clusters is not fixed. From the computed hierarchy it greedily chooses the graph partition that optimizes the algorithm’s modularity measure. While investigating the computed sense clusters, we found that this partitioning pro- duced a fairly balanced amount of sense clusters in terms of the cluster size. The results are applicable for our use case, since in most cases, sense clusters can be assigned to exactly one meaning of a word. However, in the later time slices with an increasing number of documents and thus co-occurrences, the computed sense clusters are not partitioned as well. In some cases it produces quite large sense clusters that contain multiple meanings of a word. 5 Evaluation The evaluation of WSI approaches is an open challenge, as there are no standard- ized experimental settings or gold standards. Kutozov et al. address the need for standardized, robust test sets of semantic shifts in their 2018 survey [10]. Other researchers created manually selected word lists, which can vary widely and make it difficult to compare the accuracy of approaches [14,21]. In this sec- tion, we introduce our evaluation data, which we compiled by aggregating the different approaches used in related work and annotated with the help of expert linguists. We also discuss the impact of the hyperparameters of our approach and demonstrate the effectiveness qualitatively. 5.1 Compiled Word List for Evaluation The word list compiled by Yarowsky [22] is used in several publications on word sense disambiguation [16]. Yarowsky developed an algorithm that disambiguates the following twelve words with two distinct meanings: axes, bass, crane, drug, duty, motion, palm, plant, poach, sake, space, and tank. We extended this list by adding new dictionary entries, novel technical terms and words based on related work [15]. It also contains words that did not gain new meanings in the last 200 years. The words were annotated with respect to new sense gain using word en- tries from the Oxford English and Merriam-Webster Dictionaries, which include references to the first known use of words with a particular meaning. WordNet2 , a commonly used lexical resource for computational linguistics, groups words with regard to word form as well as meaning. To merge the result obtained from the two different approaches we use a logical OR-operation. Thus, a word is considered to have gained a new sense if it is labeled positively from either of our two approaches. In total we compiled a set of 112 words with either a single sense, multiple, but stable senses, or words that gained at least one new sense. Each candidate word was annotated by 15 linguists (C1 and C2 level) as “Gained an additional sense since 1800” or “No additional sense since 1800”. We measure an overall agreement of 61% and a free-marginal Fleiss kappa of 0.42 (fixed-marginal: 0.22). Based on a simple majority vote, 42 words gained an additional sense, whereas 70 did not. For some words, such as android, beef, or pot, we saw a high anno- tator agreement over 75%. When using this as a threshold, 14 words gained an additional sense, whereas 31 did not. The linguists where particularly undecided on the words cat, honey, power, and state. The annotations are on our website.[1] 5.2 Corpus of Historical American English (COHA) For demonstrating our proposed approach, we use the Corpus of Historical Amer- ican English (COHA)3 . It is one of the largest temporal corpora over one of the 2 https://wordnet.princeton.edu/ 3 https://www.english-corpora.org/coha/ Table 1: Comparison of graph clustering algorithms Algorithm Precision@5 Precision@10 Precision@20 Chinese Whispers 0.4 0.4 0.35 Girvan-Newman 0.2 0.5 0.55 Louvain 0.6 0.6 0.4 longest time ranges [4]. COHA spans texts from the years 1810 to 2009 and con- tains more than 100,000 single texts in fiction, popular magazines, newspapers and non-fiction books with a total of 400 million words. We split the corpus by decade to generate the time slices, since vastly differ- ent vocabularies would result in different word contexts. Ideally, the vocabulary and word distribution is relatively stable across all time slices. Since the number of tokens increases with each decade, we measure the vocabulary overlap. In our measurements (not shown due space constraints), neighboring decades share 20- 30% of their vocabulary, while the decades that are further away from each other share only 5-15% of their vocabulary for both measures. Very high values pro- duced by the cosine similarity highlight that frequently occurring words appear in most decades. We can conclude that using a frequency- and co-occurrence- based approach to extract information about changes of word senses over time is feasible. For a deeper statistical analysis of the corpus, we refer readers to the work by Jatowt et al. [8] 5.3 Hyperparameter Evaluation For evaluation, we derive a score of how likely it is, that a word gained a new sense over time. Therefore we count the number of sense trees for a word and their distribution over time. Sense trees with a length of 1 are ignored. This count is used to rank words, such that word that gained a new sense are at the top. Using our annotated data we calculate the precision@k with k ∈ {5, 10, 20}. The following parameter settings were found by optimizing our approach on the annotated word list. We set the pointwise mutual information (PMI) threshold parameter αpmi = 0.01, we used the Jaccard similarity as similarity measure, and we used leaf-matching to match sense clusters across time slices. We compare the three graph clustering algorithms introduced in Section 3: Chinese Whispers, Girvan-Newman, and Louvain. Table 1 presents the results with these algorithms. The highest precision values are highlighted bold. Louvain outperforms the other two algorithms at k = 5 and k = 10. Girvan- Newman has a much lower precision at k = 5, but performs much better at the values 10 and 20 for k. Chinese Whispers does not perform as well. We suggest Louvain as clustering algorithm, because it fits best for our use case that a sense cluster should only represent a single meaning. 1810 1850 1900 1950 2000 layer, cell, brain, gram, organism, spirit, weight, body, secretion, tissue, cytoplasm, accumulation, ribonucleic, nucleic, gland, constituent thiamin, diet, vitamin, mineral, content, flour, coefficient, ascorbic, protein diethylamide Fig. 2: Sense tree for the word acid. Green boxes represent the initial cluster for a sense, matched clusters are connected by straight lines. 5.4 Qualitative Evaluation To qualitatively evaluate our approach, we take a closer look at the drill down example for the word monitor. Our approach produced five sense trees, each of which can be matched to a single meaning of the word: hall monitor starting in the 1830s, the warship of that name starting in the 1860s (two sensetrees), monitor in the technological sense of a screen starting in the 1920s, and the surveillance sense starting in the 1960s. The time periods at which new senses emerged are accurate. For example, the warship was created in the 1860s and that is also the time slice in which we detect that sense. One weakness of the approach is that although the earliest sense tree can be interpreted easily as sense of hall monitor, a closer look reveals that the clusters are only matched by the word master. This shows that the matching can be influenced easily by just a few words. Another weakness is that in fact two sense trees represent the meaning of warship. The later sense trees represent distinct senses of monitor : the technological sense and the surveillance sense. However, the branches of these sense trees are not separated sufficiently. Our graph representation of a text corpus can be used to visually explore linguistic features. Figure 2 shows the forest of sense trees for the word acid. Each of the boxes represents a set of words ci ∈ Swt as defined before. Each sense tree begins with a green box, the following clusters that were matched across time slices are connected by straight lines. Interestingly, we can see the discovery of DNA in the late 19th century and LSD in the 1960s. 5.5 Limitations Although our approach is able to identify different senses of some words, there is still room for improvement: graph clustering algorithms and matching strategies, the evaluation of different window sizes used during the co-occurrence graph cre- ation, and the conduction of a survey to obtain a word dataset that can be used as gold standard for evaluating temporal WSI approaches. Useful features make our approach available as an explorative tool, adjustments in the used language models to make our approach applicable to historic language, a quantitative comparison to word embeddings, verification of the stability of our approach, a custom slicing of time periods, sensitivity for differently spelled variants of the same word, and detecting not only the birth of new word senses but also the death of word senses. However, our approach struggles to create well-partitioned sense clusters for different senses, which also affects the matching strategies. Additionally, these strategies do not prevent sense drifting and sense trees may change their meaning significantly over multiple time slices. 6 Conclusion We proposed an approach to identify words that gained new meanings over time. Additionally, our approach is interpretable and produces intermediate results that can be used to investigate and understand how the sense gain score for a specific word was constructed. We presented a drill down into specific words with two different visualizations that allow key components of our approach to be easily understood. It enables seeing both the created sense clusters and sense trees, and thus allows one to find the point in time at which new senses of a word emerged. We applied our approach to COHA, which spans 200 years and is the largest corpus of its kind. We showed anecdotal evidence for the functioning of our approach by manually annotating and evaluating 109 words. We also evaluated our approach qualitatively by using our drill down to inspect the intermediate results of the word monitor. We found that our approach was able to successfully identify word senses in sense trees. References 1. Bamler, R., Mandt, S.: Dynamic word embeddings. In: Proceedings of the Inter- national Conference on Machine Learning (ICML). pp. 380–389. JMLR Inc. and Microtome Publishing (2017) 2. Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its ap- plication to natural language processing problems. In: Proceedings of the Workshop on Graph-based Methods for NLP. pp. 73–80. ACL (2006) 3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Ex- periment 2008(10) (2008) 4. Davies, M.: Expanding horizons in historical linguistics with the 400-million word corpus of historical american english. Corpora 7(2), 121–157 (2012) 5. Feuerbach, T., Riedl, M., Biemann, C.: Distributional semantics for resolving bridg- ing mentions. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP). pp. 192–199. ACL (2015) 6. Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002) 7. Hope, D., Keller, B.: MaxMax: A graph-based soft clustering algorithm applied to word sense induction. In: Proceedings of the International Conference on Com- putational Linguistics and Intelligent Text Processing (CICLing). pp. 368–381. Springer-Verlag (2013) 8. Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: Proceedings of the IEEE/ACM Joint Conference on Digital Libraries (JCDL). pp. 229–238 (2014) 9. Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D., Petrov, S.: Temporal analysis of lan- guage through neural language models. In: Proceedings of the Workshop on Lan- guage Technologies and Computational Social Science. pp. 61–65. ACL (2014) 10. Kutuzov, A., Øvrelid, L., Szymanski, T., Velldal, E.: Diachronic word embeddings and semantic shifts: a survey. In: Proceedings of the International Conference on Computational Linguistics (COLING). pp. 1384–1397. ACL (2018) 11. Lau, J.H., Cook, P., McCarthy, D., Newman, D., Baldwin, T.: Word sense induc- tion for novel sense detection. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). p. 591–601. ACL (2012) 12. Lindén, K.: Evaluation of linguistic features for word sense disambiguation with self-organized document maps. Computers and the Humanities 38, 417–435 (2004) 13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word rep- resentations in vector space. In: Proceedings of the International Conference on Learning Representations (ICLR). pp. 1–12 (2013) 14. Mitra, S., Mitra, R., Maity, S.K., Riedl, M., Biemann, C., Pawan, G., Mukherjee, A.: An automatic approach to identify word sense changes in text media across timescales. Natural Language Engineering 21(5), 773–798 (2015) 15. Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys 41(2), 1–69 (2009) 16. Rapp, R.: Word sense discovery based on sense descriptor dissimilarity. In: Pro- ceedings of Machine Translation Summit (MTSummit). pp. 315–322. European Association for Machine Translation (2003) 17. Song, L., Wang, Z., Mi, H., Gildea, D.: Sense embedding learning for word sense induction. In: Proceedings of the Joint Conference on Lexical and Computational Semantics (*SEM). The *SEM Organizing Committee (2016) 18. Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the digital humanities. In: Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL). pp. 246–257. Springer-Verlag (2017) 19. Ustalov, D., Panchenko, A., Biemann, C., Ponzetto, S.P.: Watset: Local-global graph clustering with applications in sense and frame induction. Computational Linguistics 45(3), 423–479 (2019) 20. Yang, H., Callan, J.: A metric-based framework for automatic taxonomy induc- tion. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP). pp. 271–279. ACL (2009) 21. Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). pp. 673–681. ACM (2018) 22. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised meth- ods. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). pp. 189–196. ACL (1995)