=Paper=
{{Paper
|id=Vol-2989/long_paper26
|storemode=property
|title=Adjusting Scope: A Computational Approach to Case-Driven
Research on Semantic Change
|pdfUrl=https://ceur-ws.org/Vol-2989/long_paper26.pdf
|volume=Vol-2989
|authors=Lauren Fonteyn,Enrique Manjavacas
|dblpUrl=https://dblp.org/rec/conf/chr/FonteynM21
}}
==Adjusting Scope: A Computational Approach to Case-Driven
Research on Semantic Change==
Adjusting Scope: A Computational Approach to Case-Driven Research on Semantic Change Lauren Fonteyn, Enrique Manjavacas Leiden University Centre for Linguistics, Department of English Language and Culture, Arsenaalstraat 1, 2311CT, Leiden, the Netherlands Abstract Computational studies of semantic change are often wide in scope, aiming to capture and quantify semantic change in language at large in a data-driven, ‘hands-off’ way. Case-driven, corpus-linguistic studies of semantic change, by contrast, generally aim to tackle questions about the development of specific linguistic phenomena. Due to its narrower scope, case-driven research is more restricted in terms of the data it may employ, and at the same time it requires a more fine-grained description of the targeted linguistic developments. As a result, case-driven studies face particular methodological challenges that are not at play in more wide-scoped approaches. The aim of this paper is to set out a ‘hands-off’ computational procedure to study specific cases of semantic change. The case we address is the development of the phrasal expression to death from a literal, resultative phrase (e.g. he was beaten to death) into an intensifier (e.g. We were just pleased to death to see her). We deploy hierarchical clustering algorithms over distributed meaning representations in order to capture the evolution of the semantic space of verbs that collocate with to death. We then describe the arising diachronic processes by means of monotonic effects, providing a more accurate picture than customary linear regression models. The methodology we outline may help tackle some common challenges in the use of vector representations to study similar cases of semantic change. We end the discussion by pinpointing (remaining) challenges that case-driven research may encounter. Keywords Linguistics, Semantic Change, Grammaticalization, Distributional Semantics, Bayesian Modeling 1. Introduction Over the past decade, computational approaches to semantic change have experienced a surge in popularity. This is largely due to the rise of an increasingly powerful body of models that aim to approximate the meaning of words over time by encoding their linguistic context (or ‘distributional properties’) into (diachronic) word embeddings [see, among many others: 47, 19, 35, 18, 1, 44, 46, 29, 26, 13, 51, 11, 48, 5, 50]. A characteristic of many of these studies is that their research questions are very wide in scope: their aim is not to address questions about any specific word or construction, but rather to capture and quantify some aspect of semantic change at large in a data-driven way. As such, these studies tend to approach semantic change in bulk, with sample sizes ranging from hundreds [e.g. 35, 13, 48] to thousands of linguistic items [e.g. 18], and with specific examples of semantic change predominantly serving as (straightforward) illustrations of a more general pattern or trend. CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The Netherlands £ l.fonteyn@hum.leidenuniv.nl (L. Fonteyn); enrique.manjavacas@gmail.com (E. Manjavacas) DZ 0000-0001-5706-8418 (L. Fonteyn); 0000-0002-3942-7680 (E. Manjavacas) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) 280 Yet, vector-based models are also used in more narrow-scoped, case-driven research. In this type of research, which is perhaps most common in (corpus) linguistics, the aim is to approach a specific case study of semantic change in a largely automated, data-driven manner. The mo- tivation for doing so is that introspective data annotation (which is not only labour-intensive, but potentially problematically subjective [47]) is avoided or minimized. Additionally, the use of vector-based models allows researchers to operationalize theoretical concepts in quantifiable terms in order to verify or falsify hypotheses on the nature and causes of semantic change in the case under scrutiny [e.g. 21, 10, 41, 40]. Despite the fact that the two approaches have an obvious common ground, case-driven investigations are clearly distinct from wide-scope computational studies in a number of ways. Most importantly, case-driven research generally emerges from a desire to tackle questions about the development of specific linguistic phenomenon (often during a specific time window). Consequently, compared to the wide-scope computational research into semantic change, case- driven research is relatively inflexible in terms of the data it may employ to attain its goals. Furthermore, the very reason why the researcher is compelled to undertake case-driven research is that the specific phenomenon under scrutiny constitutes a complex challenge. As such, while it is not uncommon for case-driven studies to use computational models and metrics proposed in wide-scoped studies on semantic change, the specific cases they scrutinize may present methodological challenges that are either not at play, or glossed over in, the studies they draw from. The aim of the present contribution is to tackle a case-driven study by means of computa- tional methods. In doing so, our work ties in with earlier explorative work aimed at pinpointing where challenges may lie for case-driven research [e.g. 49]. The specific case of semantic change we address is the historical development of the phrasal expression to death from a literal, re- sultative phrase (e.g. he was beaten to death) into an intensifier (e.g. We were bored/pleased to death) [23, 33]. 1.1. Aims More specifically, we aim to delineate a step-wise procedure that: 1. minimizes manual work, so that it is more feasible for case-driven research to maximally exploit available data and maximize the data-driven character of case-driven research; 2. flags and discusses remaining pitfalls and challenges future case-driven work may en- counter. 1.2. Outline To analyse the development of to death with minimal manual interference, we suggest a pro- cedure consisting of the following steps: 1. Surveying work on the linguistic construction (and related cases) under scrutiny to (i) delineate a time window, and (ii) formulate hypotheses or expectations that can be verified by means of computational methods (Section 2); 2. Compiling (and curating) a sufficiently large diachronic corpus collection (Section 3), from which examples of the construction can be sampled (Section 3.1); 3. Computing and evaluating distributed meaning representations (Section 3.2); 281 4. Conducting a diachronic cluster analysis, in which we optimize the number of clusters across time for silhouette score in order to trace changes in to death’s contextual distri- bution (Section 4.1); 5. Conducting a sentiment analysis to capture to death’s decreasing negativity (Section 4.2); 6. Assessing the output of the statistical model against the formulated expectations (Section 5). After describing the procedure and results, we highlight and discuss the following remaining pitfalls (Section 6): 1. Because case-driven research aims to examine a fixed (set of) linguistic construction(s) in a specific time window, researchers may run into issues of data sparsity and balance that may be difficult to circumvent. 2. The extent to which a reliable, completely automated, ‘hands-off’ approach is possible and extendable to new cases not yet analysed remains an open question. While we are optimistic about incorporating computational models into the study of case-driven semantic change, manual interference may still be desirable or even required. 2. Related Work The methods adopted in the present study are similar to those used in large-scale computational studies of semantic change. Previous work has suggested various ways of improving the models that generate (diachronic) word embeddings [e.g. 46, 44], determining (predictive) laws of (lexical) semantic change at large [e.g. 19, 14], and developing statistical measures that help detect different types of semantic change (e.g. specification vs. broadening; cultural change vs. linguistic change) in a data-driven manner [e.g. 47, 35, 18, 11, 13, 48, 16]. In other work, computational models are applied to map changes in specific word classes (e.g. intensifiers; [31], or (groups of) concepts in particular lexical domains (e.g. ‘racism’, ‘knowledge’; [49, 2]) or registers (e.g. ‘scientific language’; [5, 50]). In terms of its focus, aims, scope and granularity, this study is reminiscent of research in corpus linguistics and construction grammar, where a single case of linguistic change is considered. The development of to death from a phrase that expresses the result of an action (e.g. He was beaten/stabbed/shot to death) to an intensifying or ‘amplifying’ expression (e.g. We were thrilled/pleased/shocked to death to see you; [42]) has been described as a process of grammaticalization, which took place over the course of the Early and Late Modern English period (ca. 1500 - Present). As explained by Margerie [33], this grammaticalization process, in which to death developed a less literal and more ‘grammatical’ reading of amplification, crucially involved ‘host-class expansion’ [22]. More specifically, the development can be broken down into three stages. STAGE 1 Initially, to death functioned as adverbial complement of verbs expressing physical harm, which may result in death (e.g. beat, bleed, burn, etc.). STAGE 2 Over the course of the 16th and 17th century, to death sporadically started occurring in contexts where a literal, death-resulting reading is ruled out (e.g. That book bored me to death). It was not until the 18th century, however, that to death was frequently used in such non-literal, intensifying cases [33, 23]. Notably, as is common in intermediate stages of grammaticalization [25, 30], to death still retained some of its original meaning of a 282 ‘negative end result’ [33, p. 129]. At this stage, the vast majority of its collocate verbs have negative connotations (e.g. bore, scare, worry). STAGE 3 Despite its persistent preferences for negative situations, to death started to expand further [33]. In the 19th and 20th century, to death began to combine with more positively oriented verbs (e.g. amuse, love, thrill). The expansion process spawned by the grammaticalization of to death would seem to lend itself well to computational analysis. A template for the general research design can be found in the work of Perek [41, 40]. With an eye on quantifying processes related to host class expansion, Perek relies on semantic vector representations of the verb types occurring in the constructions open verb slot of the hell-construction (e.g. [beat/scare/hug] the hell out of someone) and the way-construction (e.g. [swim/beat/smile] one’s way to something), employing cluster density measures in order to quantify the diachronic process. Crucially, Perek demonstrates that, from a linguistic perspective, it is important to ap- proach processes of host class expansion in a way that distinguishes changes in lexical diversity (measured by the number of unique lexical items that occur in a construction) from seman- tic diversity (measured by the semantic similarity between those lexical items). This is also relevant for the study of to death, because changes in lexical diversity alone may not be in- dicative of linguistic change, but of cultural change. It may be the case, for instance, that different modes of execution have become prevalent or obsolete, or that the specificity and lexical diversity with which causes of death are described may increase or decrease as the topic becomes more or less taboo. In these scenarios, the set of lexical items to death collocates with may indeed shrink or expand, but the semantics of the phrase do remain stable. At the same time, such cultural change may happen alongside the grammaticalization of to death into an intensifier. Thus, the reality of case-driven research may be that the distinction between cultural and linguistic change is not a matter of ’either/or’ [18], but of ’and’. The distributed meaning representations that are fed into the clustering algorithm by Perek [41, 40] do, however, fail to distinguish synonymy (e.g. hate & despise) from antonymy (e.g. hate & love). Hence, they will not capture the the final stage of expansion of to death and other intensifying constructions, which commonly involves an erosion of its original negative (or positive) polarity [30]. 3. Data For the purposes of the present study, we gathered a collection of diachronic English corpora, spanning the period from 1550 to 1949. These corpora include Early English Books Online (EEBO), the Corpus of Late Modern English Texts (version 3.1; CLMET3.1), the Evans Early American Imprints Collection (EVANS), Eighteenth Century Collections Online (ECCO), the Corpus of Historical American English (COHA), and the Hansard corpus (Hansard). In terms of text types, these corpora are varied, covering an array of literary works, religious and legal text and news reports. The sole exception is Hansard, which offers transcriptions of British parliamentary debates (starting in 1800). All corpora were submitted to the following pre-processing pipeline. First, we applied a language identification module in order to sort out foreign text. We relied on two language identification modules – Google’s Compact Language Identifier (v3)1 and FastText Language 1 Our code repository is accessible through the following url: https://github.com/google/cld3/releases/tag/ 3.0.13. 283 Table 1 Distribution of to death per bin (by corpus) and verb type frequency in the sample (last row). 1550 1600 1650 1700 1750 1800 1850 1900 Total CLMET3.1 39 45 182 100 26 392 COHA 488 700 774 2764 ECCO 78 395 12 485 EEBO 800 800 794 413 2 2859 EVANS 6 211 360 116 693 Total 800 800 800 741 800 800 800 800 7193 Type Freq 87 101 93 95 97 150 131 135 372 Identification system [17] – which we combined to maximize the retrieval precision of the foreign text. For a given fragment of 500 characters, we flagged the text as foreign if both systems indicated a language other than English as the highest probability language. Manual inspection of a random sampled indicated a sufficiently low false positive rate in order for the filtering to be effective (while throwing out an insignificant amount of English text). Second, we tokenized and sentence-tokenized the remaining text using the Punkt tokenizers provided by the NLTK package [4]. After tokenization, we enriched all text with part-of-speech tags, using an in-house tagger for historical English. The tagger was trained on the PCEEME [36] – a corpus of letters from 1410 to 1695 that amounts to about 2.2M labelled tokens – using a Neural Conditional Random Field (CRF) tagger implemented with PIE [32], and obtained an overall test-set accuracy above 96%. The resulting patchwork corpus consists of a total of 3.9B tokens, which we utilized in various ways in subsequent steps of the research process. 3.1. Dataset: to death The attestations of to death were retrieved from the corpus collection (excepting the specialized Hansard corpus). As is common in linguistic research, the data was divided into fixed-width bins. Each bin represents a 50-year period, which results in a total of 8 bins. As not all corpora in the collection are balanced in terms of the amount of text a single author may contribute, we applied an additional sampling step to ensure that no author dominated more than 25% of the instances in a particular bin. The total number of instances retrieved from each corpus per bin is listed in Table 1. In the bin covering the period between 1700 and 1749, the total corpus size (and hence, the token frequency of to death) was substantially lower than for other bins. To ensure that any observed differences in the number of verb types that collocate with to death across bins is not affected by large differences in sample size, we decided to cap the maximum number of tokens sampled per bin at 800. After removing any duplicates, we identified the verb that collocates with each instance of to death by relying on part-of-speech tags. Each instance of to death was assumed to collocate with the verb in closest proximity (using a window of 15 words). In a number of cases, the tagger failed to find a collocate verb. These cases included instances where the copula be was used in combination with an adjective (e.g. be frozen/sick to death), which were subsequently corrected and included in the dataset. Cases where to death functioned as a prepositional modifier of a noun (e.g. on her way to death), fixed expressions (e.g. from birth to death, be 284 Table 2 Word embedding benchmark results for the utilized word embedding space in comparison to off-the-shelf Present-day English spaces. MEN WS353 SimLex999 MTurk RW RG65 Mean Glove 0.608 0.399 0.331 0.513 0.283 0.736 0.478 Word2Vec 0.708 0.605 0.414 0.645 0.378 0.746 0.583 Ours 0.555 0.504 0.338 0.481 0.241 0.731 0.475 nigh to death), and cases where the verb was illegible (e.g. And when my mother euen before my sighte, Was (-) to death; 1550, EEBO) were discarded. In total, 109 examples were discarded. 3.2. Word Embeddings In order to capture semantic similarity between to death’s collocate verbs across time, we rely on distributed meaning representations computed by the word2vec algorithm [34]. We use the entire corpus collection introduced in Section 3. Besides the pre-processing pipeline outlined in Section 3, we applied the following additional pre-processing steps with the goal of improving the quality of the resulting embeddings: we lower-cased the corpora, applied NFKD unicode normalization, removed non-alphanumeric tokens, replaced numbers by a code (e.g.), dropped punctuation, and substituted the long “s” character (ſ) with modern day “s”. We trained distributed representations with a size of 200 using the gensim library [43]. We employed the skip-gram objective, approximated with negative sampling and optimized using a learning rate of 0.025 over 5 epochs, discarding words with frequencies lower than 50 and a window size of 20 tokens. In order to validate the resulting embedding space, we ran a number of semantic similarity benchmarks, which allow us to contextualize the quality of our embeddings within the state- of-the-art. The employed benchmark datasets comprise of sets of Present-day English word pairs, each of which has been manually assigned a similarity score. The evaluation proceeds by correlating these human judgments with the cosine similarities between the corresponding vector representations, using the Spearman correlation coefficient.2 We compared our embed- ding space with (i) 200 dimensions Glove vectors [39] trained on 6B Wikipedia tokens,3 as well as (i) 300 dimensions word2vec vectors trained on the Google News dataset (about 100B tokens), restricting the vocabulary of the embedding spaces to the intersection across spaces and using the average word embedding vector for out-of-vocabulary words.4 As Table 2 shows, our embedding space generates scores comparable to the Glove space, while lying behind those generated by the word2vec space. Considering that our embedding space is trained on a smaller dataset and covers a large period of historical English, we take these results to validate the semantic similarity properties of the inferred word representations. For a sanity check, Table 3 shows the 20 nearest neighbours of a selection of verbs from our dataset of to death collocates based on cosine distance. 2 While it is obviously not ideal to evaluate our model with respect to a Present-day English reference point, no human similarity judgements of this scale are available for historical English. In order to conduct at least some sort of sanity check, we used the off-the-shelf Present-day English spaces. 3 The embeddings are available through the following url: https://nlp.stanford.edu/projects/glove/. 4 We use the software package word-embedding-benchmarks [27] in order to streamline the evaluation of the embedding spaces. 285 Table 3 Top 10 nearest neighbours (cosine) of burn, stab, whip (physical actions) and amuse, scare, vex (mental verbs) in collocate dataset. physical actions mental verbs burn stab whip amuse scare vex 1 beat (0.57) strangle (0.59) cudgel (0.69) delude (0.73) frighten (0.78) afflict (0.72) 2 kill (0.57) knife (0.59) bludgeon (0.66) flatter (0.63) terrify (0.73) perplex (0.72) 3 consume (0.56) bleed (0.58) lash (0.66) perplex (0.61) startle (0.67) harass (0.71) 4 scorch (0.55) slash (0.58) kick (0.59) terrify (0.60) worry (0.55) annoy (0.69) 5 shoot (0.55) bang (0.56) cuff (0.57) frighten (0.60) drive (0.54) oppress (0.69) 6 spoil (0.53) kill (0.55) spur (0.57) tickle (0.58) sweep (0.52) fret (0.67) 7 smother (0.53) poison (0.55) flog (0.56) harass (0.54) delude (0.51) grieve (0.64) 8 smoke (0.53) bite (0.55) bang (0.55) tire (0.54) astonish (0.51) terrify (0.61) 9 hunt (0.53) cudgel (0.54) goad (0.55) annoy (0.52) annoy (0.50) pester (0.60) 10 hang (0.53) prick (0.54) scourge (0.54) vex (0.51) amuse (0.50) worry (0.58) 4. Method A basic way of quantifying the host class expansion of to death is by examining the change in diversity in the set of attested collocate verbs over time. One such index of diversity is given by type frequency – shown in the last row of Table 1. However, while such diversification is potentially indicative of host class expansion, changes in type frequencies (or lexical diversity) need not indicate that to death has indeed undergone semantic change; as argued in Section 2, they may equally be indicative of cultural change. To probe into the host class expansion of to death, Section 4.1 operationalizes the process as a change in the structure of the semantic space that the collocate verbs of to death occupy. We rely on hierarchical cluster analysis over distributed meaning representation in order to not only incorporate a notion of lexical diversity into the analysis but, crucially, also take ‘semantic diversity’ into account. As explained in Section 2, the host class expansion of to death also involved increased co- occurrence with verbs with progressively more positive connotations. In order to capture this process, Section 4.2 devises a way to quantify the average polarity of verbs over time using word embeddings, and statistically describe any existing ‘positivization’ process. 4.1. Cluster Analysis At any given period, we inspect the semantic space delineated by the distribution of attested verbs using a hierarchical cluster analysis. A known problem with automated cluster analysis of semantic spaces is that the induced semantic clusters are not always easy to interpret. As a result, their application in subsequent steps of the research workflow may require manual fine- tuning and post-filtering [41, 40] to ensure that clusters are meaningful before any measures of interest can be computed. In contrast, the chosen procedure dispenses with manual fine- tuning and inspection of the resulting clusters. First, we identify a clustering metric that aligns with the expectations of the host-class expansion process. Secondly, we automatically find the hyper-parameter values that optimize the selected clustering metric, and, finally, we treat these optimal values as statistical correlates of the host-class expansion process that we are ultimately interested in describing. As to death develops new, non-literal meanings, we expect the semantic space defined by the 286 verbs appearing in this construction to expand, with existing clusters of collocates becoming denser and new clusters representing novel semantic fields starting to form. The silhouette score [45] – a common clustering evaluation metric – may help determine whether this expectation holds up. More specifically, we use the optimal number of clusters based on the silhouette score as the target statistic for monitoring the process. For a word wi assigned to cluster Ci , the silhouette score decomposes into the quantities a(wi ) – shown in Equation 1 – and b(wi ) – shown in Equation 2. By measuring the average intra-cluster distance between a word and all other words in the same cluster, a(wi ) captures the tightness of the clusters induced by a clustering algorithm. In contrast, b(wi ) measures the distance to the nearest point in a different cluster. The dataset-level aggregated b(wi ), thus, captures the overall separation between clusters. 1 ∑ a(wi ) = cosdist (wi , wj ) (1) |Ci | − 1 j∈Ci ,i̸=j ∑ b(wi ) = min cosdist (wi , wj ) (2) k̸=i j∈Ck The final silhouette score for a given instance is computed by an aggregation of both quan- tities, dividing by a normalizing factor to ensure a constant output range between -1 and 1 – as shown in Equation 3. b(wi ) − a(wi ) s(wi ) = (3) max(a(wi ), b(wi )) One risk that can be linked to the presented methodology is that the optimal number of clusters may increase because the number of unique verb types in the sample has increased (as shown in Section 3) – i.e. regardless of the semantic composition of the space representing that bin. Thus, increases in the optimal number of clusters can be due to sampling artifacts – an issue that becomes even more likely with fat-tailed distributions that are common in linguistic data. Moreover, even in the absence of sampling artifacts, we must ensure that we are not simply measuring increases in type frequency-based diversity, which, as already argued, are not necessarily indicative of linguistic change. In order to remedy the afore-mentioned issue, we employ the following bootstrap procedure. For each period, we sample 500 verbs with replacement from the multinomial distribution observed in the dataset and compute the optimal number of clusters based on silhouette score. Repeating this process a 1,000 times per period yields a dataset with 8,000 observations (i.e. for 8 periods), which we submit to statistical analysis in order to quantify the effect of time on the optimal number of clusters. Crucially, we record the total number of distinct verbs sampled in each bootstrap iteration, which allows us to statistically control for the effect of population size on the obtained optimal number of clusters. We rely on hierarchical (agglomerative) clustering using the cosine similarity and complete linkage,5 and optimize the number of clusters by inspecting the silhouette scores at different nodes in the induced merge tree until reaching the merge step that maximizes the silhouette score.6 5 We made these choices on the basis of a single manual scan of the interpretability of the clusters induced from the verbs in the entire dataset. 6 We use the reference implementations provided by the Python library scikit-learn [38]. 287 4.2. Sentiment Analysis Homing in on the increasing positivity of to death, we leverage the embedding space described in Section 3.2 in order to capture the sentiment polarity of the sampled verbs. Differences in sentiments are not straightforwardly captured by means of hierarchical clustering, as antonyms are represented by highly similar vectors. In Table 3, for instance, the positive mental verb amuse, is recognized as being similar to more negative mental verbs like delude and terrify, as well as its antonyms annoy and vex. The cluster analysis is therefore supplemented by means of sentiment scores. A first approach to induce word-level sentiment scores is to exploit the proximity of a given verb vector to the vector for the words ‘good’ and ‘bad’. The closer to the vector for ‘good’ the more positive the sentiment of that verb. However, similar confounding effects from antonyms make this approach unfeasible. Indeed, in common word embedding spaces the vectors for ‘good’ and ‘bad’ tend to be located in the proximity of each other, and, thus, lack discriminative power for classifying words with respect to their sentiment. In order to tackle this issue, post-hoc modifications of the embedding space such as retro- fitting [15] or word embedding refinement [53] could allow us to leverage sentiment lexicons in order to ensure the desired property. In the present work, however, we dispense with the manual work that such approach would require and resort to a second-order approach that induces sentiment scores on the basis of the proximity of verbs to a filtered list of nearest neighbors of ‘good’ and ‘bad’. By manually filtering these lists, we avoid terms that may confound the polarities, while still keeping the manual work to a small amount. More specifically, we sift through the vocabulary in ranked order by cosine similarity to “good” and “bad”, and discard confounding words until reaching a total of 20 words per polarity.7 For a given word wi , we, then, compute its sentiment score as shown in Equation 4: 1 ∑ 1 ∑ S(wi ) = cos(wi , wj ) − cos(wi , wj ) (4) |Ngood | |Nbad | wj ∈Ngood wj ∈Nbad where Ngood and Nbad refer, respectively, to the filtered set of nearest neighbours of ‘good’ and ‘bad’. To test the effect of time on the polarity of to death’s collocates, we assign each verb in the dataset to the bin where they are first attested. Given that grammaticalizing structures often retain their original function, it may well be that the well-established negative use of to death vastly outnumbers and hence overshadows cases where to death has expanded to intensify new, more positive verbs. Thus, we suggest that working with the sentiment of collocate verbs that were first attested in a given bin – rather than the distribution of sentiment in each bin – captures the ongoing changes more directly and robustly. 4.3. Statistical Modeling In order to assess the effect of time on the semantic structure of the attested verbs, as well as on the overall sentiment, we fit linear regression models regressing the target outcome – i.e. optimal number of clusters or sentiment score – on the time period. We use a Gaussian likelihood for both outcomes. 7 These filtered nearest neighbors were checked in order to avoid too specific terms with unstable sentiment polarity over time. For example, the top 5 neighbours of ‘good’ were ‘better’, ‘excellent’, ‘great’, ‘well’ and ‘best’, while the top 5 neighbours of ‘bad’ were ‘dangerous’, “ill”, ‘inefficient’, ‘wrong’ and ‘hard’. 288 A further modeling choice we make is to incorporate time period as a monotonic effect – and not as, for instance, an ordinary linear predictor. This choice is motivated by the fact that diachronic processes in language structure often result in patterns that resemble s-curves [12, 6]. In these patterns, the magnitude of the predictor varies over time, a fact that cannot be described by ordinary linear predictors. In contrast, a monotonic predictor shares the assumption with a linear predictor that the direction of the effect is constant – strictly positive or negative – while allowing differences in the effect over adjacent time periods. Our implementation of the monotonic predictor follows Bürkner and Charpentier [8]. For a given predictor with n possible categories (in our case, this corresponds to 8 time bins) to be modelled ∑n−1as a monotonic effect, this approach introduces n-1 ζi parameters such that ζi ∈ [0, 1] and i=1 ζi = 1, keeping ζ0 fixed at 0. For a given observation of the j th time bin, the monotonic predictor term η is given by Equation 5: ∑ j η=b ζi (5) i=1 Here, b corresponds to the ordinary linear predictor, representing in this case the direction and size of the effect on the outcome, and the individual ζi represent the normalized distances between consecutive predictor categories. The predictor term η is then included in a linear model in the usual way: y = a + η × x. When fitted, this kind of monotonic predictor can easily be interpreted by inspecting the values assigned to the ζi parameters, since these correspond to the relative increase of each category with respect to the total increase involved by the monotonic predictor. 5. Results We deploy a Bayesian regression framework which allows us to inspect the uncertainty in the statistical parameters of interest in an probabilistic intuitive manner. We fit our models using the Hamiltonian Monte-Carlo sampler provided by the stan library [9] through the R language package brms [7]. 5.1. Cluster Analysis In order to test the monotonicity of the effect, we compare a linear model of the effect of time period on the optimal number of clusters – LINEAR(P) – with the monotonic effect model – MONO(P). Moreover, in order to control for the size of the sampled population on the outcome, we fit additional models including the number of unique verbs in the bootstrap sample as predictor – LINEAR(P)+S and MONO(P)+S. We compare the four models using the Widely Applicable Information Criterion (WAIC), which estimates the plausibility of the models in terms of both predictive performance and model complexity (cf. overfitting). The results of the comparison are shown in the top row of Table 4. Including time period as a monotonic effect improves the predictive power of the model over the linear effect. Moreover, controlling for sample size is even more important, as evidenced by the fact that including it results in a larger improvement in WAIC than modeling period as a monotonic effect. Using the most strongly predictive model – i.e. MONO(P)+S – we can visualize the (monotonic) effect of time period on optimal number of clusters using the posterior predictive distribution. 289 Table 4 Comparison of statistical models of optimal number of clusters and polarity using the WAIC criterion (lower is better). Besides absolute WAIC, we also show an estimate of the effective number of parameters (P), the difference in WAIC (WAIC∆(SE)) and the model weight (Weight), quantifying the relative value of each model with respect to the remaining models. Outcome Model WAIC (SE) P WAIC∆(SE) Weight MONO(P)+S 50,777 (137) 5.73 1.00 LINEAR(P)+S 50,995 (136) 4.30 -218.42 (25.10) 0.00 Clusters MONO(P) 52,211 (132) 4.86 -1,434.14 (71.02) 0.00 LINEAR(P) 55,780 (128) 2.75 -5,003.14 (117.48) 0.00 MONO(P) 784.82 (25.08) 3.73 0.79 Polarity LINEAR(P) 787.52 (25.03) 3.18 -2.7 (1.38) 0.21 Credible Interval 0.99 0.89 0.5 59 71 98 Optimal number of clusters Optimal number of clusters Optimal number of clusters 34 46 28 27 32 44 26 25 30 42 24 23 1550 1600 1650 1700 1750 1800 1850 1900 1550 1600 1650 1700 1750 1800 1850 1900 1550 1600 1650 1700 1750 1800 1850 1900 Period Period Period Figure 1: Posterior predictive distribution of the optimal number of clusters by period, showing different credible intervals, while varying the sample size over 59 (left), 71 (middle) and 98 (right) items, corresponding respectively to the 10%, 50% and 90% percentiles. The visualization is based on 200 samples from the MCMC posterior draws. Figure 1 depicts the posterior predictive distribution of the optimal number of clusters using a counter-factual triptych plot, statistically controlling for the sample size at different percentiles. Overall, we observe a clear monotonic effect, resembling an s-curve, with a leap starting in the 1750 bin. The shape of the effect remains stable across the three sample size percentiles. Due to the positive linear effect of sample size on optimal number of clusters, the range of the outcome (i.e. the y-axis) increases across plots in the triptych. Moreover, the distribution of uncertainty varies from plot to plot. At smaller sample sizes, the uncertainty in the predicted number of clusters is larger towards the later time bins, whereas for larger sample sizes the most uncertain predictions come from the earlier bins. This is likely due to the fact – depicted in Figure 2 – that the sample size in pre-1800 bins is always smaller than in post-1800 bins. However, by counter-factually controlling for sample size, we can observe that the statistical model predicts a constant effect shape regardless of the sample size. 5.2. Sentiment Analysis Similarly to the experiments in Section 5.1, we now compare the effect of time period on sentiment using a linear predictor – LINEAR(P) – and a monotonic effect – MONO(P). We use 290 20 10 Residuals Period < 1800 0 FALSE TRUE −10 −20 60 80 100 120 Sample Size Figure 2: Residuals of the model by sample size. Color highlights are used to distinguish pre-1800 and post-1800 observations. Despite the increase in sample size starting in 1800, residuals do not seem to be correlated with sample size. Credible Interval 0.99 0.89 0.5 0.6 3 2 0.3 1 Sentiment Sentiment 0 0.0 −1 −0.3 −2 −3 1600 1650 1700 1750 1800 1850 1900 1600 1650 1700 1750 1800 1850 1900 Period Period Figure 3: Posterior predictive distribution of the statistical model of sentiment using time period as mono- tonic effect (left), posterior predictive distribution with overlaid empirical observations (right). the standardized average sentiment polarity of the verbs as the outcome. The results in terms of WAIC are shown in the bottom row of Table 4. Modeling time with a monotonic effect produces an improvement over the linear predictor, although in this case the difference with respect to the linear effect model is smaller than in the cluster analysis experiments. The left plot shown in Figure 3 does indicate a slight jump starting in the 1750 bin. However, the large credible intervals observed do not rule out a merely linear effect. Moreover, as the plot in the right hand-side of Figure 3 shows, a considerable amount of variance in the dataset is left unexplained by the model. While statistically controlling for other predictors – such as, for example, document topic or genre – could improve the fit, the current model does show a predominantly linear upward effect of moderate size – about 1 standard deviation – of time on average sentiment. 6. Discussion The results of the statistical analyses are in line with expectations in that the optimal number of verb clusters increases substantially over the course of the 18th century, when the meaning of to death expanded to non-literal, intensifying uses (STAGE 2). The predicted shift away from negative polarity (STAGE 3) also appears to be captured by the statistical model, albeit 291 weakly. Still, as even in present-day English to death is predominantly attested with negative collocates [33], the weak trend aligns well with the pathway outlined in Section 2. All in all, then, the procedure adopted here is promising for future case-driven, ‘hands-off’ investigations. With an eye on aiding future applications of the models and methods adopted in the present study, we highlight some important remaining problems. 6.1. Data sparsity and balance §While there is no shortage of Historical English corpora, corpora that span all the way from the Early Modern period up to Present-day English are rare. A notable exception is the suite of the Penn-Helsinki Corpora [28], which, although wide in scope, is still a corpus collection very limited in size, and thus also in its use for the ‘data-hungry’ models that are currently employed in computational studies of semantic change.8 To maximize sample sizes, this study (following Margerie [33]) resorted to combining large corpora covering different time windows. An issue with this patchwork solution is that individual time bins are likely not represented by a comparable number of texts and text types, which may have consequences at later steps in the procedure. In the present case, the patchwork corpus suffered from data sparsity in the 1700-1749 bin, which in turn forced us to cap the maximum number of tokens per bin. Furthermore, because of the inconsistency with which text types are labelled across the different corpora, it is very difficult if not impossible to smoothly ensure register and genre consistency across bins. For the present case, such text type inconsistency is indeed very unfortunate: the time bin in which the host class expansion of to death has taken off also appears to be the time bin in which the COHA corpus starts, which introduces newspaper and magazine texts into the sample.9 At the same time, some of the text collections included in the patchwork corpus (such as ECCO) may contain reprints of older texts, which may have led to an overrepresentation of older usages in certain time bins. As such, a limitation of the procedure presented here is that it devotes relatively limited attention to balancing data and/or controlling for text and text type variation across time bins. A possible solution could be to refrain from working with corpus patchworks, and turn to the Google Books Corpus (1500 - 2008) or other large library dumps. Yet, even then issues of overrepresentation (and mislabelling) of texts and text types may remain [52, 37].10 In short, a substantial challenge for case-driven research is that the careful data curation it requires may lead to an impasse. Even when following strict procedural guidelines and sanity checks [49, 14], artefact results are still possible when data is uncurated or poorly balanced (as also discussed for lexical semantic change in Hengchen et al. [20]). Furthermore, introducing such balance may not be easy (or even possible), and it may also impact sample size (which complicates the study of more infrequent phenomena). 8 A somewhat larger corpus covering a very wide time span is the OED quotation database, estimated at 35M words [24]. Besides its modest size (and very few attestations of to death [33]), the OED quotations database is affected by balancing issues similar to those described for the Google Books Corpus. 9 In the 1800 bin, no tokens are included from newspaper texts, and 82 out of 800 tokens (10.25%) were found in magazine texts. 10 Additionally, even diachronic trends in balanced diachronic corpora may in a strict sense also be artefacts, as genres and registers are also subject to change. With respect to newspaper and magazine text, for instance, it has been shown that the changing “readerships and purposes of magazines versus newspapers result in different historical-linguistic patterns of use” [3]. 292 6.2. Minimizing manual interference The discussion of to what extent manual interference is needed or desirable in case-driven studies of semantic change is far from trivial, and, ultimately still undecided. In the spirit of the ‘data-drivenness’ discourse in preceding work [e.g. 41, 16], the procedure presented here aimed to minimize manual filtering and annotation – but such manual interference has not been entirely absent. In collecting the collocate verbs of to death, for instance, a substantial number of cases involved structures where the collocate of interest is not the verb be but its accompanying adjective. Similarly, cases where the verb form in closest proximity to to death (e.g. we could prevent Scipio from pummelling the dreaded wizard to death, COHA 1840) were corrected manually. Furthermore, there are various points where further ‘manual meddling’ could be considered. In many instances that were retained in the dataset, to death has neither resultative nor intensifying meaning.11 Given the limited relevance and potential effects on the output of the statistical analyses of irrelevant cases, it may be worth flagging or even excluding them from the dataset, as done in Margerie [33] and Perek [40]. Yet, such actions do involve elaborate manual annotation, and potentially introduce annotator judgments into the procedure that may diminish the ‘data-driven’ character of the study.12 Finally, with respect to the cluster analysis, a fully hands-off approach also implies that we trust the word embedding space to reflect meaningful semantic parameters, and that the resulting clusters capture, at least roughly, relevant properties of the underlying process. In the present case, it is reassuring to see that the narrative that emerges from our data analysis appears to align with what earlier linguistic research has proposed. However, it is not guar- anteed that the verb clusters that were fed into the statistical analysis correspond with the semantic verb classes proposed in earlier research (e.g. actions of physical harm vs. mental verbs), or even with any groupings that are meaningful to humans. Additionally, while the bootstrapping procedure described in Section 4.1 renders the procedure more robust, it also makes it more difficult to examine which verbs constitute what cluster at which points in time. Given this lack of full transparency, the (as of yet unanswered) question becomes how to progress towards a method that reliably and robustly supports exploratory data analysis in cases not yet analyzed [also see 49, 20], and to what extent limiting manual involvement to an absolute minimum is warranted in specific case-driven studies. 7. Conclusion and Future Outlook Drawing on the vast (and growing) body of computational research on semantic change, this study examined how computational models can be employed to track the host class expansion of grammaticalizing constructions, such as to death. By adjusting our scope to one specific and 11 For instance, positive verb collocates such as love are attested earlier than expected in examples such as He swore he wou’d love me to death (EEBO, 1700), where to death most likely functions as a time adverbial (‘he swore he’d love me until death’) and not as a resultative (‘he swore me he’d love me resulting in death’) or an intensifier (‘he swore he’d love me a lot’). These structures could, of course, have contributed to to death’s acquisition of intensifying meaning (loving someone until death implies loving them a lot), and hence be relevant to include. In other cases, however, the relevance of the query hit in relation to the semantic development described appears to be much less clear (e.g. the first who turns his back to death; EEBO, 1800). 12 A possible, yet costly solution here would be to rely on multiple annotators, preferably with expertise in the historical language variety at hand [20]. 293 relatively complex case, the procedure we outlined caters to case-driven research, which oper- ates at a level of specificity and granularity that is not abundantly common in computational approaches to semantic change. Besides outlining the procedure, we flagged its current lim- itations and issues, which will hopefully entice further case-driven computational humanities research that will help reflect on and ultimately tackle the challenges that remain. Acknowledgments The work for this study has been made possible by the Platform Digital Infrastructure (Social Sciences and Humanities) fund (PDI-SSH). We want to thank the anonymous reviewers for their valuable suggestions, as well as Folgert Karsdorp for his advise regarding monotonic effects. References [1] R. Bamler and S. Mandt. “Dynamic word embeddings”. In: Proceedings of the 34th inter- national conference on machine learning. Ed. by D. Precup and Y. W. Teh. Vol. 70. Pro- ceedings of machine learning research. Pmlr, 2017, pp. 380–389. url: http://proceedings. mlr.press/v70/bamler17a.html. [2] A. Betti, M. Reynaert, T. Ossenkoppele, Y. Oortwijn, A. Salway, and J. Bloem. “Ex- pert Concept-Modeling Ground Truth Construction for Word Embeddings Evaluation in Concept-Focused Domains”. In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Com- putational Linguistics, 2020, pp. 6690–6702. doi: 10.18653/v1/2020.coling-main.586. [3] D. Biber and B. Gray. “Being Specific about Historical Change: The Influence of Sub- Register”. In: Journal of English linguistics 41.2 (2013), pp. 104–134. [4] S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., 2009. [5] Y. Bizzoni, S. Degaetano-Ortlieb, P. Fankhauser, and E. Teich. “Linguistic Variation and Change in 250 Years of English Scientific Writing: A Data-Driven Approach”. In: Frontiers in Artificial Intelligence 3 (2020), p. 73. doi: 10.3389/frai.2020.00073. [6] R. A. Blythe and W. Croft. “S-curves and the mechanisms of propagation in language change”. In: Language 88.2 (2012), pp. 269–304. doi: 10.1353/lan.2012.0027. [7] P. C. Bürkner. “Advanced Bayesian Multilevel Modeling with the R Package Brms”. In: R Journal (2018). doi: 10.32614/rj-2018-017. [8] P. C. Bürkner and E. Charpentier. Modeling Monotonic Effects of Ordinal Predictors in Bayesian Regression Models. 2018. doi: 10.31234/osf.io/9qkhj. url: psyarxiv.com/9qkhj. [9] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell. “Stan: A probabilistic programming language”. In: Journal of statistical software 76.1 (2017), pp. 1–32. [10] D. Correia Saavedra. “Measurements of Grammaticalization: Developing a quantitative index for the study of grammatical change”. PhD dissertation. Neuchâtel & Antwerpen: l’Université de Neuchâtel & Universiteit Antwerpen, 2019. 294 [11] M. Del Tredici, R. Fernández, and G. Boleda. “Short-term meaning shift: A distributional exploration”. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 2069–2075. doi: 10.18653/v1/N19-1210. [12] D. Denison. “Log (ist) ic and simplistic S-curves”. In: Motives for language change 54 (2003), p. 70. [13] H. Dubossarsky, S. Hengchen, N. Tahmasebi, and D. Schlechtweg. “Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 457–470. doi: 10.18653/v1/P19- 1044. [14] H. Dubossarsky, D. Weinshall, and E. Grossman. “Outta control: Laws of semantic change and inherent biases in word representation models”. In: Proceedings of the 2017 conference on empirical methods in natural language processing. Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 1136–1145. doi: 10.18653/v1/D17- 1118. [15] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. “Retrofitting Word Vectors to Semantic Lexicons”. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies. Denver, Colorado: Association for Computational Linguistics, 2015, pp. 1606–1615. doi: 10.3115/v1/N15-1184. [16] M. Giulianelli, M. Del Tredici, and R. Fernández. “Analysing lexical semantic change with contextualised word representations”. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Online: Association for Computational Linguistics, 2020, pp. 3960–3973. doi: 10.18653/v1/2020.acl-main.365. [17] E. Grave. Language Identification. 2017. url: https://fasttext.cc/blog/2017/10/02/blog- post.html. [18] W. L. Hamilton, J. Leskovec, and D. Jurafsky. “Cultural shift or linguistic drift? Com- paring two computational measures of semantic change”. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Austin, Texas: Associ- ation for Computational Linguistics, 2016, pp. 2116–2121. doi: 10.18653/v1/D16-1229. [19] W. L. Hamilton, J. Leskovec, and D. Jurafsky. “Diachronic word embeddings reveal statistical laws of semantic change”. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers). Berlin, Germany: As- sociation for Computational Linguistics, 2016, pp. 1489–1501. doi: 10.18653/v1/P16- 1141. [20] S. Hengchen, N. Tahmasebi, D. Schlechtweg, and H. Dubossarsky. “Challenges for com- putational lexical semantic change”. In: Zenodo, 2021. doi: 10.5281/zenodo.5040322. [21] M. Hilpert and D. Correia Saavedra. “Using token-based semantic vector spaces for corpus-linguistic analyses: From practical applications to tests of theoretical claims”. In: Corpus Linguistics and Linguistic Theory 0.0 (2017). doi: 10.1515/cllt-2017-0009. 295 [22] N. P. Himmelmann. Lexicalization and grammaticization: opposite or orthogonal?” In What Makes Grammaticalization: A Look from Its Components and Its Fringes. Ed. by W. Bisang, N. P. Himmelmann, and B. Wiemer. Berlin: Mouton de Gruyter, 2004. [23] J. Hoeksema and D. Jo Napoli. “Just for the hell of it: A comparison of two taboo- term constructions”. In: Journal of Linguistics 44.2 (2008), pp. 347–378. doi: 10.1017/ s002222670800515x. [24] S. Hoffmann. “Using the OED quotations database as a corpus – a linguistic appraisal”. In: ICAME journal 28 (2004), pp. 17–30. [25] P. Hopper. “On some principles of grammaticalisation.” In: Approaches to grammatical- ization. Ed. by E. C. Traugott and B. Heine. Vol. 1. Amsterdam: John Benjamins, 1991, pp. 17–35. [26] R. Hu, S. Li, and S. Liang. “Diachronic sense modeling with deep contextualized word embeddings: An ecological view”. In: Proceedings of the 57th annual meeting of the as- sociation for computational linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 3899–3908. doi: 10.18653/v1/P19-1379. [27] S. Jastrzebski, D. Leśniak, and W. M. Czarnecki. “How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks”. In: arXiv preprint arXiv:1702.02170 (2017). [28] A. Kroch. Penn Parsed Corpora of Historical English. Philadelphia, 2020. url: https: //www.ling.upenn.edu/hist-corpora/. [29] A. Kutuzov, L. Øvrelid, T. Szymanski, and E. Velldal. “Diachronic word embeddings and semantic shifts: a survey”. In: Proceedings of the 27th international conference on computational linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018, pp. 1384–1397. url: https://www.aclweb.org/anthology/C18-1117. [30] G. Lorenz. “Really worthwhile or not really significant ?: A corpus-based approach to the delexicalization and grammaticalization of intensifiers in Modern English”. In: Typological Studies in Language. Ed. by I. Wischer and G. Diewald. Vol. 49. Amsterdam: John Benjamins Publishing Company, 2002, pp. 143–161. doi: 10.1075/tsl.49.11lor. [31] Y. Luo, D. Jurafsky, and B. Levin. “From Insanely Jealous to Insanely Delicious: Com- putational Models for the Semantic Bleaching of English Intensifiers”. In: Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change. Florence, Italy: Association for Computational Linguistics, 2019, pp. 1–13. doi: 10.18653/v1/W19-4701. [32] E. Manjavacas, Á. Kádár, and M. Kestemont. “Improving Lemmatization of Non-Standard Languages with Joint Learning”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 1493–1503. url: https : / / www . aclweb . org / anthology/N19-1153. [33] H. Margerie. “Grammaticalising constructions: to death as a peripheral degree modifier”. In: Folia Linguistica Historica 45.Historica vol. 32 (2011). doi: 10.1515/flih.2011.005. 296 [34] T. Mikolov, K. Chen, G. Corrado, and J. Dean. “Efficient Estimation of Word Represen- tations in Vector Space”. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings. Ed. by Y. Bengio and Y. LeCun. 2013. url: http://arxiv.org/abs/1301.3781. [35] S. Mitra, R. Mitra, M. Riedl, C. Biemann, A. Mukherjee, and P. Goyal. “That’s sick dude!: Automatic identification of word sense change across different timescales”. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long papers). Baltimore, Maryland: Association for Computational Linguis- tics, 2014, pp. 1020–1029. doi: 10.3115/v1/P14-1096. [36] T. Nevalainen, H. Raumolin-Brunberg, J. Keränen, M. Nevala, A. Nurmi, M. Palander- Collin, A. Taylor, S. Pintzuk, A. Warner, et al. “Parsed Corpus of Early English Corre- spondence (PCEEC)”. In: Oxford Text Archive Core Collection (2006). [37] E. A. Pechenick, C. M. Danforth, and P. S. Dodds. “Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution”. In: Plos One (2015), p. 24. [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. “Scikit-learn: Machine learning in Python”. In: the Journal of machine Learning research 12 (2011), pp. 2825–2830. [39] J. Pennington, R. Socher, and C. Manning. “GloVe: Global Vectors for Word Representa- tion”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1532–1543. doi: 10.3115/v1/D14-1162. [40] F. Perek. “Recent change in the productivity and schematicity of the way -construction: A distributional semantic analysis”. In: Corpus Linguistics and Linguistic Theory 14.1 (2018), pp. 65–97. doi: 10.1515/cllt-2016-0014. [41] F. Perek. “Using distributional semantics to study syntactic productivity in diachrony: A case study”. In: Linguistics 54.1 (2016). doi: 10.1515/ling-2015-0043. [42] R. Quirk, S. Greenbaum, G. Leech, and J. Svartvik. A Comprehensive Grammar of the English Language. London: Longman, 1985. [43] R. Rehurek and P. Sojka. “Software Framework for Topic Modelling with Large Corpora”. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (2010), pp. 45–50. [44] A. Rosenfeld and K. Erk. “Deep neural models of semantic shift”. In: Proceedings of the 2018 conference of the north American chapter of the association for computational lin- guistics: Human language technologies, volume 1 (long papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 474–484. doi: 10.18653/v1/N18- 1044. [45] P. J. Rousseeuw. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis”. In: Journal of computational and applied mathematics 20 (1987), pp. 53– 65. 297 [46] M. Rudolph and D. Blei. “Dynamic embeddings for language evolution”. In: Proceedings of the 2018 world wide web conference. Www ’18. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2018, pp. 1003–1011. doi: 10.1145/3178876.3185999. url: https://doi.org/10.1145/3178876.3185999. [47] E. Sagi, S. Kaufmann, and B. Clark. “Tracing semantic change with Latent Seman- tic Analysis”. In: Current Methods in Historical Semantics. Ed. by K. Allan and J. A. Robinson. Berlin, Boston: De Gruyter, 2011. doi: 10.1515/9783110252903.161. [48] D. Schlechtweg, B. McGillivray, S. Hengchen, H. Dubossarsky, and N. Tahmasebi. “SemEval- 2020 Task 1: Unsupervised Lexical Semantic Change Detection”. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation. Barcelona (online): International Com- mittee for Computational Linguistics, 2020, pp. 1–23. url: https://aclanthology.org/ 2020.semeval-1.1. [49] P. Sommerauer and A. Fokkens. “Conceptual Change and Distributional Semantic Mod- els: an Exploratory Study on Pitfalls and Possibilities”. In: Proceedings of the 1st Interna- tional Workshop on Computational Approaches to Historical Language Change. Florence, Italy: Association for Computational Linguistics, 2019, pp. 223–233. doi: 10.18653/v1/ W19-4728. [50] K. Sun, H. Liu, and W. Xiong. “The evolutionary pattern of language in scientific writ- ings: A case study of Philosophical Transactions of Royal Society (1665–1869)”. In: Sci- entometrics 126.2 (2021), pp. 1695–1724. doi: 10.1007/s11192-020-03816-8. [51] N. Tahmasebi, L. Borin, and A. Jatowt. “Survey of Computational Approaches to Lexical Semantic Change”. In: arXiv:1811.06278 [cs] (2019). url: http://arxiv.org/abs/1811. 06278. [52] N. Younes and U.-D. Reips. “Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms”. In: Plos One (2019), p. 17. [53] L.-C. Yu, J. Wang, K. R. Lai, and X. Zhang. “Refining Word Embeddings for Senti- ment Analysis”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 534–539. doi: 10.18653/v1/D17-1056. 298