=Paper=
{{Paper
|id=Vol-3290/short_paper2780
|storemode=property
|title=Introducing Functional Diversity: A Novel Approach to Lexical
Diversity in (Historical) Corpora
|pdfUrl=https://ceur-ws.org/Vol-3290/short_paper2780.pdf
|volume=Vol-3290
|authors=Folgert Karsdorp,Enrique Manjavacas,Lauren Fonteyn
|dblpUrl=https://dblp.org/rec/conf/chr/KarsdorpMF22
}}
==Introducing Functional Diversity: A Novel Approach to Lexical
Diversity in (Historical) Corpora==
Introducing Functional Diversity: A Novel Approach to Lexical Diversity in (Historical) Corpora Folgert Karsdorp1,∗ , Enrique Manjavacas2 and Lauren Fonteyn2 1 KNAW Meertens Institute, Amsterdam, the Netherlands 2 Leiden University, Leiden, the Netherlands Abstract The question how we can reliably estimate the lexical diversity of a particular text (collection) has o昀琀en been asked by linguists and literary scholars alike. This short paper introduces a way of operationaliz- ing functional diversity measurements by means of token-based embeddings, and argues that functional diversity is not only a practically advantageous, but also a theoretically relevant addition to the Com- putational Humanities Research toolkit. By means of an experiment on the historical ARCHER corpus, we show that lexical diversity at the level of functional groups is less sensitive to orthographic varia- tion, and provides insight into an important and o昀琀en disregarded dimension of vocabulary diversity in textual data. Keywords Lexical diversity, Functional diversity, Historical text, Hill numbers 1. Introduction With the present paper, we wish to make a case for the practical and theoretical advantages of adopting the framework of attribute diversity – which distinguishes categorical diversity from the higher-order concept of functional diversity – into Humanities research on lexical diversity. Given two sets of unique word types, set A{cat, dog, bird, rabbit} and set B{cat, progesterone, re- member, blue}, approaches focusing solely on categorical lexical diversity will suggest A and B are equally diverse. However, an approach that takes the semantic distance between the items into account will also capture the higher functional-semantic or attribute diversity of set B. To help establish the latter approach in Humanities Research, we propose a way of operationaliz- ing functional diversity estimates by means of token-based embeddings. The question whether we can estimate lexical richness or diversity is a pertinent one in Humanities. In Linguistics, attempts have been made to estimate the vocabulary size of a par- ticular language [12, 11], or how many words an average speaker of a particular language CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium ∗ Corresponding author. £ folgert@karsdorp.io (F. Karsdorp); e.m.a.manjavacas.arevalo@hum.leidenuniv.nl (E. Manjavacas); l.fonteyn@hum.leidenuniv.nl (L. Fonteyn) ç https://www.karsdorp.io/ (F. Karsdorp); https://www.universiteitleiden.nl/medewerkers/enrique-manjavacas-arevalo/ (E. Manjavacas); https://www.universiteitleiden.nl/en/staffmembers/lauren-fonteyn (L. Fonteyn) ȉ 0000-0002-5958-0551 (F. Karsdorp); 0000-0002-3942-7680 (E. Manjavacas); 0000-0001-5706-8418 (L. Fonteyn) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 114 knows at di昀昀erent ages [5, 25]. In a similar vein, researchers have also attempted to estimate (and compare) the richness of the active vocabulary of particular authors [e.g. 16, 18] or liter- ary works across time [e.g., 23], or the ‘productivity’ of linguistic structures (i.e., how many di昀昀erent word types are used in a particular linguistic context [2, 3]) for di昀昀erent individuals [e.g., 24, 1] or across time [e.g., 21]. To attain these goals, researchers o昀琀en resort to corpus research, using text (excerpt) collections of varying sizes with diversity measures that rely on the number of word tokens, unique word types, and/or hapax legomena (i.e., words that oc- cur only once), such as (variations on) Mean Word Frequency (MWF) and Type-Token Ratio (TTR) [for examples, see 27], realized/potential/expanding productivity [2], or measures that originate in Shannon entropy [26]. There is, however, a practical problem that arises with any measure of diversity that relies on hapaxes and/or unique types. In many digitized text corpora, the number of unique character strings cannot be equated to the number of unique words. A substantial amount of variation in how word types are represented in a corpus may be due to OCR errors (e.g., in historical texts, the long S character <ſ> is o昀琀en mistaken foror , which means the word type strength could be represented in a corpus as at least three di昀昀erent character strings: <ſ罴rength>, and ). Furthermore, some types of corpora contain texts where authors do not (consistently) adhere to (present-day) standard spelling conventions, such as corpora of (informal) language on social media or any historical corpora that pre-date the establishment of uniform spelling conventions. This introduces a dimension of variation that makes it di昀케cult to accurately count the number of actual hapax legomena or unique types. Of course, at least some of this unwanted variation can be tackled in corpus pre-processing through (semi-)automated spelling normalisation, but this too can prove challenging given that neither OCR errors nor non-standard spelling variation are entirely or even largely systematic. In this paper, we argue that there are substantial advantages to relying on functional diver- sity measures (rather than, or as a complement to lexical diversity measures) to estimate and compare the ‘lexical richness’ of (collections of) text. More speci昀椀cally: • We demonstrate that functional diversity estimates are a昀昀ected to a much lesser extent by spelling errors and inconsistencies than lexical diversity estimates. As such, there is a clear practical advantage to relying on functional diversity. • We suggest that, even in corpora that are free from orthographic noise, there is a theo- retical advantage to examining higher-order diversity at the level of functional groups. We propose that a theoretically relevant distinction can be made when making claims about ‘vocabulary richness’ or lexical diversity by taking the semantic similarity between words into account.1 This higher-order, functional-semantic dimension of diversity is theoretically relevant, as it helps characterize diversity in terms of depth and width, and o昀昀ers a perspective on diversity that is not captured by more traditional, exclusively categorical measures. 1 The distinction between lower-order and higher-order diversity proposed here is reminiscent of the distinction between ‘productivity’ and ‘schematicity’ in [21, 13]. 115 2. Measuring Diversity Functional Diversity For our measurements of functional diversity in (historical) corpora, we apply the framework of attribute diversity, which was originally developed in the context of ecological diversity [8, 6]. In ecology too, it is important to not only account for categorical diversity (the taxonomic model of species diversity), but also for attribute variation between and within species. A昀琀er all, certain species (e.g., ducks vs. geese) are more similar to each other than others (e.g., ducks vs. sheep). This is not captured by taxonomic diversity, which treats all species as equally distant. In the framework of attribute diversity [8, 6], categorical diversity is considered a special case of functional diversity, where each type (or species) is considered its own functional group and all groups are functionally equally di昀昀erent. In this extreme case, each functional di昀昀erence results in the de昀椀nition of a new functional group which is equivalent to a categorical type. More precisely, the threshold �㔏 for de昀椀ning a new functional group is set to the smallest pair- wise distance between types. The framework allows researchers to specify functional groups at higher distinctiveness thresholds �㔏 . �㔏 then speci昀椀es the distance threshold beyond which types are considered equally distant and thus belong to di昀昀erent functional groups. As �㔏 tends to in昀椀nity, types become functionally indistinct and belong to the same functional group. Each type ÿ contributes to the frequency of a functional group. Let �㕛ÿ be the frequency of type ÿ and �㕎ÿ the frequency of a functional group, then �㕣ÿ (�㔏 ) can be de昀椀ned as the proportional contribution of type ÿ to a group for a given threshold level �㔏 . Functional diversity, then, is de昀椀ned as the sum over the proportional contributions �㕣ÿ (�㔏 ) of each type ÿ = 1, 2, … , ā: ā �㔹 �㔷 = ∑ �㕣ÿ (�㔏 ), (1) ÿ=1 where �㕣ÿ (�㔏 ) = �㕛ÿ /�㕎ÿ . Note that when each type belongs to its own functional group, i.e., when the de昀椀nition of functional groups and types coincide, �㕣ÿ equals unity. In this case, �㕛ÿ = �㕎ÿ (�㔏 ) and thus the functional diversity is equal to the number of types ā. When functional groups and types do not coincide, certain functional groups consist of more than one type, which in turn may belong to more than one group. To account for such many-to-many type-function relations, the abundance �㕎ÿ at threshold �㔏 is computed as the number of tokens of type ÿ plus a fraction of the tokens of any other type Ā that is functionally indistinctive from type ÿ: ā �㕑ÿĀ (�㔏 ) �㕎ÿ (�㔏 ) = �㕛ÿ + ∑ (1 − ) �㕛Ā (2) Ā≠ÿ �㔏 here, �㕑ÿĀ (�㔏 ) refers to the distance between type ÿ and Ā, which is set to �㔏 if �㕑ÿĀ > �㔏 and �㕑ÿĀ otherwise. Functional Hill numbers Eq. 1 describes the functional richness of a collection, or the num- ber of functional groups given a distinctiveness threshold �㔏 . Richness is just one of many di- versity measures which treats each functional group as equally important. However, certain 116 functional groups may be more prominent than others which better is captured by other diver- sity measures, like Shannon entropy or the Gini-Simpson index. To account for other aspects of diversity, Chao and colleagues [8, 6] integrate functional diversity into a mathematically uni昀椀ed family of diversity indexes called Hill Numbers [14]. Hill numbers are parameterized only by �㕞, which determines the sensitivity to the relative frequency �㕝ÿ of variant type ÿ [14, 7, 17, 9]: 1 ā 1−�㕞 �㕞 �㔷((�㕝 , … , �㕝 )) = (∑ �㕝 �㕞 ) (3) 1 ā ÿ ÿ=1 The diversity values at certain orders �㕞 correspond to well-known diversity indices. The num- ber of unique types (also called the ‘richness’ of a sample is equal to 0 �㔷. With �㕞 = 0 no weight is given to the relative frequency of the types, or, conversely, maximum weight is given to rare types. By setting �㕞 to 1, the weight of each type is proportional to its relative frequency. Note, however, that 1 �㔷 is unde昀椀ned. Yet, the limit lim�㕞→1 exists, which is equal to the exponent of Shannon entropy [cf. 7, 17, 9]. With �㕞 > 1, disproportionally more weight is given to more frequent types. For instance, the Hill number of order �㕞 = 2 is equal to the inverse of the Gini- Simpson index, which expresses the probability that two random tokens are of the same type. An interesting property of Hill numbers is that all diversity indices are expressed in terms of the e昀昀ective number of types: the number of equally frequent types required to obtain a par- ticular observed diversity value. Because the indices are on the same scale, they can easily be represented in ‘diversity pro昀椀les’, which chart the diversity at di昀昀erent order �㕞. These pro昀椀les, then, can be used to characterized the evenness of some collection. Pro昀椀les with steep declines indicate a large disparity in the frequencies of the types, wheres 昀氀at pro昀椀les indicate a more even distribution among types. By incorporating functional diversity into the Hill number framework, Chao, Chiu and col- leagues [8, 6] show how to estimate the e昀昀ective number of equally distinct functional groups at a given distinctiveness threshold �㔏 and diversity order �㕞. The ‘e昀昀ective number’, sometimes called ‘true diversity’, represents the number of types in an idealized reference sample that all have the same frequency and distance between them of at least �㔏 . Expanding on Eq. 1, the functional diversity of order �㕞 is de昀椀ned as follows: 1 ā �㕞 1−�㕞 �㕞 �㔹 �㔷(Δ(�㔏 )) = (∑ �㕣 (�㔏 ) ( �㕎ÿ (�㔏 ) ) ) , (4) ÿ ÿ=1 �㕛 where �㕛 refers to the total number of tokens in the collection. Example To obtain a better intuition of what functional diversity measures entail, and specif- ically how the measure responds to the parameter �㔏 , we present the following example.2 Consider these four words and their corresponding frequencies: apricot (�㕛1 = 20), pineap- ple (�㕛2 = 15), digital (�㕛3 = 10), information (�㕛4 = 5). For each word, Table 1 lists whether it co-occurs with any of ten context words. Each word can thus be represented as a binary 2 Our example is a translation of [6] into a linguistic context. 117 Table 1 Co-occurrence table supporting the example in Figure 1 which illustrates how functional diversity can be calculated at di昀昀erent levels of �㔏 . boil data sugar pizza water hat tourist kiosk camera photo apricot 1 0 1 0 1 0 0 0 0 0 pineapple 1 0 1 1 1 1 0 0 0 0 digital 0 1 0 0 0 0 0 0 1 1 information 0 1 0 0 0 0 1 1 0 0 context vector, which can be used to compute the distance between two words. For example, computing the pairwise distances between all four words using the Jaccard distance yields the following distance matrix Δ: apricot 0 0.4 1 1 pineapple £ ¤0.4 0 1 1¦§ information ¤ 1 1 0 0.6§ digital ¥1 1 0.6 0¨ In Figure 1, we calculate functional diversity for di昀昀erent distinctiveness thresholds �㔏 . We begin in the top row with �㔏 = �㕑min , which is equal to the minimum distance between di昀昀erent word types (i.e., intra-type distances are not considered in this example). At �㕑min , �㔏 equals 0.4, which means that word types with at least a distance of 0.4 between them are considered functionally equally distant. This translates by truncating all distances greater than �㔏 = 0.4 to 0.4 in the distance matrix (cf. the matrix in Figure 1). In this scenario, each word is functionally equally distant and thus each type has a proportional contribution �㕣ÿ of unity to its functional group, or, in other words, each type makes up for its own functional group. This is illustrated in Figure 1 with the circles whose size is proportional to their frequency. The circles do not overlap, which illustrates that they each comprise their own functional group. The functional diversity at �㕞 = 0, then, is �㔹 �㔷 = 4, which is simply the sum over the proportional contributions �㕣ÿ (�㔏 ) of each type ÿ to a functional group (cf. Eq. 1). As the threshold value �㔏 increases, an increasing number of types becomes functionally indistinguishable. In other words, with higher values of �㔏 , functional groups consist of more types. Chiu and Chao [8, 6] suggest to use Rao’s quadratic entropy �㕄 for �㔏 , which is a similarity- sensitive diversity measure representing the average distance between two randomly selected instances in a collection [22]. �㕄, herea昀琀er denoted as �㕑mean , is expressed as: ā ā �㕄 = �㕑mean = ∑ ∑ �㕑ÿĀ �㕝ÿ �㕝Ā , (5) ÿ=1 Ā=1 where �㕑ÿĀ refers to the distance between types ÿ and Ā, and �㕝ÿ and �㕝Ā to their relative frequencies. As shown in Figure 1, setting �㔏 at �㕑mean = 0.54 decreases the functional diversity to �㔹 �㔷 = 3.58. At the threshold of 0.54, apricot and pineapple become functionally less distinct, contributing to a shared functional group (illustrated by the overlapping circles). By contrast, 118 Figure 1: Example of how functional diversity is operationalized. The figure, inspired by [6], shows for increasing values of �㔏 (i.e., �㕑min , �㕑mean , �㕑max , �㕑∞ ), the corresponding truncated distance matrix Δ(�㔏 ), an illustration of the overlap between functional groups, the proportional contribution of each word type to a functional group, and the total functional diversity. at �㔏 = �㕑mean , digital and information remain functionally equally distant and as such belong to their own functional group. Note that with �㔹 �㔷 = 3.58, the functional diversity at �㔏 = �㕑mean is larger than 3. This is because the co-occurrence pro昀椀le of apricot and pineapple is consid- ered partially overlapping but not identical. When �㔏 is set to the maximum distance in the distance matrix (�㔏 = �㕑max = 1, however, digital and information contribute to a shared func- tional group. Note that when there are many di昀昀erent word types, �㔏 = �㕑max is less informative than �㔏 = �㕑mean , because functional diversity is then o昀琀en close to unity [cf. 6]. Finally, as �㔏 tends to in昀椀nity, all words become part of the same functional group, which is expressed by a 119 functional diversity of �㔹 �㔷 = 1. 3. Data and pre-processing Archer Corpus For our experiments, we use ARCHER 3.2 [28], a corpus of historical English registers (3.3M words). The corpus covers a period of almost 400 years (1600-1999), and con- tains texts from 12 di昀昀erent genres or registers: advertisements, drama, 昀椀ction, sermons, jour- nals, legal text, medicine, news, early prose, science, letters, and diaries. In terms of spelling, ARCHER 3.2 contains the original spelling of published editions normalized with VARD2 [4]. In contrast to many other historical corpora, ARCHER 3.2 is a well-balanced, cleaned (and relatively small) corpus, and hence it constitutes the ideal starting point for our experiment. Simulating Errors To mimic di昀昀erent degrees of text noise, we ‘pollute’ each text in the clean ARCHER corpus by simulating errors. In this simulation procedure, each token of each text is modi昀椀ed with a probability �㕝. The modi昀椀cation involves replacing each letter by a random ASCII letter with probability �㕠. With �㕠 = 0.2, a word like diversity is replaced with diversizy. We experiment with �㕝 ∈ 0, 0.1, 0.2, 0.35, 0.5, 0.75, and chart the import of having a more distorted text on the stability of the diversity measures. 3 In all experiments �㕠 is set to 0.2. With this procedure, each text is manipulated 昀椀ve times per �㕝 value. The reported diversity measurements are computed by taking the mean diversity over these 昀椀ve di昀昀erent texts. Embeddings For the present study, we use token-based embeddings to obtain semantic sim- ilarity estimates between the words in a given text. These embeddings are computed on the basis of MacBERTh [20, 19], a Large Language Model that follows the architecture of BERT-base uncased [10], which is pre-trained on historical English (1450-1950) using a custom vocabulary. Token-based embeddings are expected to be more robust than type-based embeddings in the presence of noise, since they take the sentential context in which the target word appears into account. This means that they can associate (even lower frequency) variants of the same word with each other, where the sentential context is expected to match. Moreover, thanks to the built-in adaptive tokenization approach, MacBERTh is also able to compute embeddings for words that were not seen during training, which is an invaluable feature for texts with large amounts of orthographic variation.4 In order to compute the type-level distance matrix between all word types in a corpus, we 3 Note that texts resulting from �㕝 = 0.75 are perhaps less realistic than lower values. To illustrate, the OCR error rate in Eighteenth Century Collections Online (ECCO) has, for instance, been estimated to at approximately 25% [15]. 4 The purpose of this paper is to introduce the attribute diversity framework into lexical diversity research. As a 昀椀rst operationalization, we resorted to token-based embeddings, which is a theoretically sound choice (as these models are sensitive to the fact that words can have multiple meanings) that comes with certain practical advan- tages (with respect to lower-frequency and ‘unseen’ items). We are, however, interested in trying out other ways of operationalizing the concept of functional groups in future work. One possibility, for instance, would be to test and compare di昀昀erent implementations of implement semantic similarity, comparing type and token-based approaches. 120 Figure 2: Relative change in the number of functional groups a昀琀er modifying texts with probability �㕝 with respect to their unaltered counterparts (�㕝 = 0). The le昀琀 panel displays the results for �㕞 = 0 (corresponding to functional richness), and the two smaller panels on the right present results for higher diversity orders (i.e., �㕞 ∈ {1, 2}, which put increasing weight on the frequency of word types. 昀椀rst compute the token-embeddings of all words it contains.5 If the same token appears mul- tiple times in the input corpus, we compute a single embedding by averaging over the embed- dings of all occurrences. Finally, we rely on the cosine distance function in order to obtain a distance value between 0 and 2.6 4. Results 4.1. Functional diversity is a昀昀ected less by increased orthographic variation Figure 2 shows the relative change in the number of functional groups a昀琀er modifying texts with a text modi昀椀cation probability �㕝 with respect to their original, unmodi昀椀ed counterparts (�㕝 = 0). The le昀琀 panel shows the values for �㕞 = 0 at three di昀昀erent thresholds of �㔏 . As expected, the number of functional entities of functional entities at �㔏 = �㕑min increases more or less 5 Note that due to the tokenization approach of MacBERTh, input words are o昀琀en split into smaller units (sub- tokens). In order to compute a single embedding in such cases, we average over the embeddings of the di昀昀erent sub-tokens. 6 More speci昀椀cally, the cosine distance is de昀椀ned as 1 minus the cosine similarity of two given vectors—the latter being bounded between -1 and 1. 121 Figure 3: Box plots showing the reduction from �㕑min to �㕑mean in number of functional groups for adver- tisements and fiction texts. linearly with the probability �㕝 of modifying words. Indeed, the probability of a modi昀椀cation yielding a orthographically unique letter combination is high, and each unique combination is taken to account for a new word type (i.e., a new functional group). The relative change in the number of functional groups is much less strong for �㔏 = �㕑mean , where the number of functional groups at extreme values of �㕝 is still relatively close to the number of groups at �㕝 = 0. Note that the same holds for �㔏 = �㕑max , which also remains stable with larger values of noise. However, as explained above, with �㔏 = �㕑max , estimates are o昀琀en close to unity, which makes the stability of �㔏 = �㕑max less surprising. The two right panels present the same results for higher diversity orders �㕞. These plots show that when more weight is given to high frequency entities, functional diversity is also better able to cope with orthographic variation than lexical diversity at �㔏 = �㕑min . 4.2. Functional diversity is a theoretically relevant complement to lexical diversity To get a 昀椀rmer grip on what could be gained from integrating functional diversity estimates into discussions of lexical richness, we automatically identi昀椀ed text pairs with approximately the same number of unique word types (�㕑min ), but a diverging number of functional groups at �㕑mean . In each of these text pairs, one text is functionally less ‘condensed’, using the same num- ber of unique lexical items to cover a broader functional range. A commonly occurring type of text pairing, in that respect, is that of a 昀椀ction text with a text containing a collection of ad- vertisements, where advertisements consistently cover a smaller number of functional groups despite being as lexically diverse as the paired 昀椀ction text. The relatively strong reduction from �㕑min to �㕑mean in advertising, illustrated in Figure 3, is intuitive, as advertisements o昀琀en present a list of (functionally closely related) services and/or goods (see Figure 4), resulting in a more condensed diversity that suggests depth rather than breadth. For 昀椀ction, by contrast, there is no reason to expect a similar reduction. Pairings of two texts from the same genre also emerged. A telling example is the pairing of Isabel Clarendon (1886), a 昀椀ction text by naturalist/realist author George Gissing, with Caprice (1917) by Ronald Firbank (see Figure 5). The excerpts in the corpus from both texts have roughly the same number of unique word types (Caprice: 1374 vs. Isabel Clarendon: 1377), but the types 122 Figure 4: Excerpts from an advertisement collection (le昀琀; filename ‘1860illn_a6b’) and fiction text (right; filename ‘1891barr_f6b’) pairing with comparable estimates at �㕑min but diverging estimates at �㕑mean . in Caprice – a minimalist novel that, unlike realist work, predominantly consists of dialogue and contains only limited descriptions of setting and character – cover a considerably smaller number of functional groups at �㔏 = �㕑mean (213 vs. 368). Interestingly, with 5260 word tokens, the excerpt of Isabel Clarendon has a lower TTR than the excerpt of Caprice, which comprises 3753 tokens. Hence, the TTRs would suggest that Caprice covers more ground in fewer words. The functional diversity estimate, however, paints a di昀昀erent picture, which adds a theoretically relevant dimension to investigations into the lexical richness of texts. 5. Conclusion In this short paper, we introduce a way of incorporating the notion of functional diversity into lexical diversity measurements in (historical) corpora by means of token-based embeddings. 123 Figure 5: Comparison of the Hill number profiles at �㔏 = �㕑min and �㔏 = �㕑mean for two fiction texts Isabel Clarendon (1886) and Caprice (1917). Our experiment shows that considering lexical diversity at the level of functional groups has the practical advantage of being less sensitive to orthographic noise in the data, and the theo- retical advantage of adding an important and o昀琀en disregarded dimension (capturing depth vs. breadth) of vocabulary diversity in textual data. As such, the framework of attribute diversity commonly used in Ecology should be considered an important addition to the Computational Humanities research toolkit. Acknowledgments The training of MacBERTh has been made possible by the Platform Digital Infrastructure (So- cial Sciences and Humanities) fund (PDI-SSH). We thank Melvin Wevers (University of Ams- terdam) for his constructive feedback. References [1] L. Anthonissen. Individuality in Language Change. Berlin, Boston: De Gruyter Mouton, 2021. doi: doi:10.1515/9783110725841. [2] R. H. Baayen. “Corpus linguistics in morphology: Morphological productivity”. In: Cor- pus Linguistics: An International Handbook. Ed. by A. Lüdeling and M. Kytö. Vol. 2. Berlin, New York: De Gruyter Mouton, 2009, pp. 899–919. doi: doi:10.1515/9783110213881.2.899. [3] J. Barðdal. Productivity: Evidence from Case and Argument Structure in Icelandic. Amster- dam, Philadelphia: John Benjamins, 2008. 124 [4] A. Baron and P. Rayson. “VARD2: A tool for dealing with spelling variation in historical corpora”. In: Postgraduate conference in corpus linguistics. 2008. [5] M. Brysbaert, M. Stevens, P. Mandera, and E. Keuleers. “How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word De昀椀nition, the Degree of Language Input and the Participant’s Age”. In: Frontiers in Psychology 7 (2016). doi: 10.3 389/fpsyg.2016.01116. [6] A. Chao, C.-H. Chiu, S. Villéger, I.-F. Sun, S. Thorn, Y.-C. Lin, J.-M. Chiang, and W. B. Sherwin. “An Attribute-diversity Approach to Functional Diversity, Functional Beta Di- versity, and Related (Dis)Similarity Measures”. In: Ecological Monographs 89.2 (2019). doi: 10.1002/ecm.1343. [7] A. Chao, N. J. Gotelli, T. C. Hsieh, E. L. Sander, K. H. Ma, R. K. Colwell, and A. M. Elli- son. “Rarefaction and Extrapolation with Hill Numbers: A Framework for Sampling and Estimation in Species Diversity Studies”. In: Ecological Monographs 84.1 (2014), pp. 45– 67. [8] C.-H. Chiu and A. Chao. “Distance-Based Functional Diversity Measures and Their De- composition: A Framework Based on Hill Numbers”. In: PLoS ONE 9.7 (2014). Ed. by F. de Bello, e100014. doi: 10.1371/journal.pone.0100014. [9] A. J. Daly, J. M. Baetens, and B. De Baets. “Ecological diversity: measuring the unmea- surable”. In: Mathematics 6.7 (2018), p. 119. [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Asso- ciation for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [11] B. Efron and R. Thisted. “Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know?” In: Biometrika 63.3 (1976), p. 435. doi: 10.2307/2335721. [12] A. Ellegård. “Estimating Vocabulary Size”. In: Word 16.2 (1960), pp. 219–244. doi: 10.108 0/00437956.1960.11659728. [13] L. Fonteyn and E. Manjavacas. “Adjusting scope: a computational approach to case- driven research on semantic change”. In: Proceedings of the Workshop on Computational Humanities Research (CHR 2021). Vol. 2898. CEUR Workshop Proceedings. Amsterdam, 2021, pp. 280–298. url: http://ceur-ws.org/Vol-2989/long%5C%5Fpaper26.pdf. [14] M. O. Hill. “Diversity and Evenness: A Unifying Notation and Its Consequences”. In: Ecology 54.2 (1973), pp. 427–432. [15] M. J. Hill and S. Hengchen. “Quantifying the Impact of Dirty OCR on Historical Text Analysis: Eighteenth Century Collections Online as a Case Study”. In: Digital Scholarship in the Humanities 34.4 (2019), pp. 825–843. doi: 10.1093/llc/fqz024. [16] D. L. Hoover. “Another Perspective on Vocabulary Richness”. In: Computers and the Hu- manities 37.2 (2003), pp. 151–178. doi: 10.1023/a:1022673822140. [17] L. Jost. “Entropy and diversity”. In: Oikos 113.2 (2006), pp. 363–375. 125 [18] M. Kubát and J. Milička. “Vocabulary Richness Measure in Genres”. In: Journal of Quan- titative Linguistics 20.4 (2013), pp. 339–349. doi: 10.1080/09296174.2013.830552. [19] E. Manjavacas and L. Fonteyn. “Adapting vs. Pre-training Language Models for Historical Languages”. In: Journal of Data Mining & Digital Humanities Nlp4dh (2022). doi: 10.462 98/jdmdh.9152. [20] E. Manjavacas and L. Fonteyn. “MacBERTh: Development and Evaluation of a Histori- cally Pre-trained Language Model for English (1450-1950)”. In: Proceedings of the Work- shop on NLP4DH ICON 2021. online: NLP Association of India (NLPAI), 2021. [21] F. Perek. “Recent change in the productivity and schematicity of the way -construction: A distributional semantic analysis”. In: Corpus Linguistics and Linguistic Theory 14.1 (2018), pp. 65–97. doi: 10.1515/cllt-2016-0014. [22] C. R. Rao. “Diversity and dissimilarity coe昀케cients: a uni昀椀ed approach”. In: Theoretical population biology 21.1 (1982), pp. 24–43. [23] A. Riba and J. Ginebra. “Diversity of vocabulary and homogeneity of literary style”. In: Journal of Applied Statistics 33.7 (2006), pp. 729–741. doi: 10.1080/02664760600708970. [24] H.-J. Schmid and A. Mantlik. “Entrenchment in Historical Corpora? Reconstructing Dead Authors’ Minds from their Usage Pro昀椀les”. In: Anglia 133.4 (2015), pp. 583–623. doi: doi: 10.1515/ang-2015-0056. [25] J. Segbers and S. Schroeder. “How many words do children know? A corpus-based esti- mation of children’s total vocabulary size”. In: Language Testing 34.3 (2017), pp. 297–320. doi: 10.1177/0265532216641152. [26] C. E. Shannon. “A Mathematical Theory of Communication”. In: Mobile Computing and Communications Review 5 (I 1948), p. 53. [27] F. J. Tweedie and R. H. Baayen. “How Variable May a Constant be? Measures of Lexical Richness in Perspective”. In: Computers and the Humanities 32.5 (1998), pp. 323–352. doi: 10.1023/a:1001749303137. [28] N. Yáñez-Bouza. ARCHER 3.2: A Representative Corpus of Historical English Registers. h ttps : / / www . projects . alc . manchester . ac . uk / archer / wp - content / uploads / 2020 / 06 /ARCHER_poster.pdf. 2013. 126