=Paper=
{{Paper
|id=Vol-3290/short_paper2780
|storemode=property
|title=Introducing Functional Diversity: A Novel Approach to Lexical
    Diversity in (Historical) Corpora
|pdfUrl=https://ceur-ws.org/Vol-3290/short_paper2780.pdf
|volume=Vol-3290
|authors=Folgert Karsdorp,Enrique Manjavacas,Lauren Fonteyn
|dblpUrl=https://dblp.org/rec/conf/chr/KarsdorpMF22
}}
==Introducing Functional Diversity: A Novel Approach to Lexical
    Diversity in (Historical) Corpora==
<pdf width="1500px">https://ceur-ws.org/Vol-3290/short_paper2780.pdf</pdf>
<pre>
Introducing Functional Diversity: A Novel Approach
to Lexical Diversity in (Historical) Corpora
Folgert Karsdorp1,∗ , Enrique Manjavacas2 and Lauren Fonteyn2
1
    KNAW Meertens Institute, Amsterdam, the Netherlands
2
    Leiden University, Leiden, the Netherlands


                                         Abstract
                                         The question how we can reliably estimate the lexical diversity of a particular text (collection) has o昀琀en
                                         been asked by linguists and literary scholars alike. This short paper introduces a way of operationaliz-
                                         ing functional diversity measurements by means of token-based embeddings, and argues that functional
                                         diversity is not only a practically advantageous, but also a theoretically relevant addition to the Com-
                                         putational Humanities Research toolkit. By means of an experiment on the historical ARCHER corpus,
                                         we show that lexical diversity at the level of functional groups is less sensitive to orthographic varia-
                                         tion, and provides insight into an important and o昀琀en disregarded dimension of vocabulary diversity
                                         in textual data.

                                         Keywords
                                         Lexical diversity, Functional diversity, Historical text, Hill numbers


1. Introduction
With the present paper, we wish to make a case for the practical and theoretical advantages of
adopting the framework of attribute diversity – which distinguishes categorical diversity from
the higher-order concept of functional diversity – into Humanities research on lexical diversity.
Given two sets of unique word types, set A{cat, dog, bird, rabbit} and set B{cat, progesterone, re-
member, blue}, approaches focusing solely on categorical lexical diversity will suggest A and B
are equally diverse. However, an approach that takes the semantic distance between the items
into account will also capture the higher functional-semantic or attribute diversity of set B. To
help establish the latter approach in Humanities Research, we propose a way of operationaliz-
ing functional diversity estimates by means of token-based embeddings.
   The question whether we can estimate lexical richness or diversity is a pertinent one in
Humanities. In Linguistics, attempts have been made to estimate the vocabulary size of a par-
ticular language [12, 11], or how many words an average speaker of a particular language

CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium
∗
 Corresponding author.
£ folgert@karsdorp.io (F. Karsdorp); e.m.a.manjavacas.arevalo@hum.leidenuniv.nl (E. Manjavacas);
l.fonteyn@hum.leidenuniv.nl (L. Fonteyn)
ç https://www.karsdorp.io/ (F. Karsdorp);
https://www.universiteitleiden.nl/medewerkers/enrique-manjavacas-arevalo/ (E. Manjavacas);
https://www.universiteitleiden.nl/en/staffmembers/lauren-fonteyn (L. Fonteyn)
ȉ 0000-0002-5958-0551 (F. Karsdorp); 0000-0002-3942-7680 (E. Manjavacas); 0000-0001-5706-8418 (L. Fonteyn)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                        114
knows at di昀昀erent ages [5, 25]. In a similar vein, researchers have also attempted to estimate
(and compare) the richness of the active vocabulary of particular authors [e.g. 16, 18] or liter-
ary works across time [e.g., 23], or the ‘productivity’ of linguistic structures (i.e., how many
di昀昀erent word types are used in a particular linguistic context [2, 3]) for di昀昀erent individuals
[e.g., 24, 1] or across time [e.g., 21]. To attain these goals, researchers o昀琀en resort to corpus
research, using text (excerpt) collections of varying sizes with diversity measures that rely on
the number of word tokens, unique word types, and/or hapax legomena (i.e., words that oc-
cur only once), such as (variations on) Mean Word Frequency (MWF) and Type-Token Ratio
(TTR) [for examples, see 27], realized/potential/expanding productivity [2], or measures that
originate in Shannon entropy [26].
   There is, however, a practical problem that arises with any measure of diversity that relies on
hapaxes and/or unique types. In many digitized text corpora, the number of unique character
strings cannot be equated to the number of unique words. A substantial amount of variation
in how word types are represented in a corpus may be due to OCR errors (e.g., in historical
texts, the long S character <ſ> is o昀琀en mistaken for <f> or <l>, which means the word type
strength could be represented in a corpus as at least three di昀昀erent character strings: <ſ罴rength>,
<frength> and <lrength>). Furthermore, some types of corpora contain texts where authors
do not (consistently) adhere to (present-day) standard spelling conventions, such as corpora of
(informal) language on social media or any historical corpora that pre-date the establishment of
uniform spelling conventions. This introduces a dimension of variation that makes it di昀케cult to
accurately count the number of actual hapax legomena or unique types. Of course, at least some
of this unwanted variation can be tackled in corpus pre-processing through (semi-)automated
spelling normalisation, but this too can prove challenging given that neither OCR errors nor
non-standard spelling variation are entirely or even largely systematic.
   In this paper, we argue that there are substantial advantages to relying on functional diver-
sity measures (rather than, or as a complement to lexical diversity measures) to estimate and
compare the ‘lexical richness’ of (collections of) text. More speci昀椀cally:

       • We demonstrate that functional diversity estimates are a昀昀ected to a much lesser extent
         by spelling errors and inconsistencies than lexical diversity estimates. As such, there is
         a clear practical advantage to relying on functional diversity.
       • We suggest that, even in corpora that are free from orthographic noise, there is a theo-
         retical advantage to examining higher-order diversity at the level of functional groups.
         We propose that a theoretically relevant distinction can be made when making claims
         about ‘vocabulary richness’ or lexical diversity by taking the semantic similarity between
         words into account.1 This higher-order, functional-semantic dimension of diversity is
         theoretically relevant, as it helps characterize diversity in terms of depth and width, and
         o昀昀ers a perspective on diversity that is not captured by more traditional, exclusively
         categorical measures.


1
    The distinction between lower-order and higher-order diversity proposed here is reminiscent of the distinction
    between ‘productivity’ and ‘schematicity’ in [21, 13].


                                                        115
2. Measuring Diversity
Functional Diversity For our measurements of functional diversity in (historical) corpora,
we apply the framework of attribute diversity, which was originally developed in the context
of ecological diversity [8, 6]. In ecology too, it is important to not only account for categorical
diversity (the taxonomic model of species diversity), but also for attribute variation between
and within species. A昀琀er all, certain species (e.g., ducks vs. geese) are more similar to each
other than others (e.g., ducks vs. sheep). This is not captured by taxonomic diversity, which
treats all species as equally distant.
   In the framework of attribute diversity [8, 6], categorical diversity is considered a special case
of functional diversity, where each type (or species) is considered its own functional group and
all groups are functionally equally di昀昀erent. In this extreme case, each functional di昀昀erence
results in the de昀椀nition of a new functional group which is equivalent to a categorical type.
More precisely, the threshold �㔏 for de昀椀ning a new functional group is set to the smallest pair-
wise distance between types. The framework allows researchers to specify functional groups
at higher distinctiveness thresholds �㔏 . �㔏 then speci昀椀es the distance threshold beyond which
types are considered equally distant and thus belong to di昀昀erent functional groups. As �㔏 tends
to in昀椀nity, types become functionally indistinct and belong to the same functional group.
   Each type ÿ contributes to the frequency of a functional group. Let �㕛ÿ be the frequency of
type ÿ and �㕎ÿ the frequency of a functional group, then �㕣ÿ (�㔏 ) can be de昀椀ned as the proportional
contribution of type ÿ to a group for a given threshold level �㔏 . Functional diversity, then, is
de昀椀ned as the sum over the proportional contributions �㕣ÿ (�㔏 ) of each type ÿ = 1, 2, … , ā:

                                                        ā
                                            �㔹 �㔷 = ∑ �㕣ÿ (�㔏 ),                                    (1)
                                                       ÿ=1

where �㕣ÿ (�㔏 ) = �㕛ÿ /�㕎ÿ . Note that when each type belongs to its own functional group, i.e., when
the de昀椀nition of functional groups and types coincide, �㕣ÿ equals unity. In this case, �㕛ÿ = �㕎ÿ (�㔏 )
and thus the functional diversity is equal to the number of types ā. When functional groups
and types do not coincide, certain functional groups consist of more than one type, which in
turn may belong to more than one group. To account for such many-to-many type-function
relations, the abundance �㕎ÿ at threshold �㔏 is computed as the number of tokens of type ÿ plus a
fraction of the tokens of any other type Ā that is functionally indistinctive from type ÿ:

                                                   ā          �㕑ÿĀ (�㔏 )
                                   �㕎ÿ (�㔏 ) = �㕛ÿ + ∑ (1 −                ) �㕛Ā                    (2)
                                                  Ā≠ÿ              �㔏

here, �㕑ÿĀ (�㔏 ) refers to the distance between type ÿ and Ā, which is set to �㔏 if �㕑ÿĀ > �㔏 and �㕑ÿĀ
otherwise.

Functional Hill numbers Eq. 1 describes the functional richness of a collection, or the num-
ber of functional groups given a distinctiveness threshold �㔏 . Richness is just one of many di-
versity measures which treats each functional group as equally important. However, certain


                                                   116
functional groups may be more prominent than others which better is captured by other diver-
sity measures, like Shannon entropy or the Gini-Simpson index. To account for other aspects
of diversity, Chao and colleagues [8, 6] integrate functional diversity into a mathematically
uni昀椀ed family of diversity indexes called Hill Numbers [14]. Hill numbers are parameterized
only by �㕞, which determines the sensitivity to the relative frequency �㕝ÿ of variant type ÿ [14, 7,
17, 9]:
                                                                                  1
                                                                        ā        1−�㕞
                                           �㕞 �㔷((�㕝 , … , �㕝 )) = (∑ �㕝 �㕞 )                      (3)
                                                    1        ā          ÿ
                                                                    ÿ=1

The diversity values at certain orders �㕞 correspond to well-known diversity indices. The num-
ber of unique types (also called the ‘richness’ of a sample is equal to 0 �㔷. With �㕞 = 0 no weight
is given to the relative frequency of the types, or, conversely, maximum weight is given to rare
types. By setting �㕞 to 1, the weight of each type is proportional to its relative frequency. Note,
however, that 1 �㔷 is unde昀椀ned. Yet, the limit lim�㕞→1 exists, which is equal to the exponent of
Shannon entropy [cf. 7, 17, 9]. With �㕞 > 1, disproportionally more weight is given to more
frequent types. For instance, the Hill number of order �㕞 = 2 is equal to the inverse of the Gini-
Simpson index, which expresses the probability that two random tokens are of the same type.
An interesting property of Hill numbers is that all diversity indices are expressed in terms of
the e昀昀ective number of types: the number of equally frequent types required to obtain a par-
ticular observed diversity value. Because the indices are on the same scale, they can easily be
represented in ‘diversity pro昀椀les’, which chart the diversity at di昀昀erent order �㕞. These pro昀椀les,
then, can be used to characterized the evenness of some collection. Pro昀椀les with steep declines
indicate a large disparity in the frequencies of the types, wheres 昀氀at pro昀椀les indicate a more
even distribution among types.
   By incorporating functional diversity into the Hill number framework, Chao, Chiu and col-
leagues [8, 6] show how to estimate the e昀昀ective number of equally distinct functional groups
at a given distinctiveness threshold �㔏 and diversity order �㕞. The ‘e昀昀ective number’, sometimes
called ‘true diversity’, represents the number of types in an idealized reference sample that all
have the same frequency and distance between them of at least �㔏 . Expanding on Eq. 1, the
functional diversity of order �㕞 is de昀椀ned as follows:
                                                                                         1
                                                             ā                   �㕞     1−�㕞
                                       �㕞 �㔹 �㔷(Δ(�㔏 )) = (∑ �㕣 (�㔏 ) ( �㕎ÿ (�㔏 ) ) )          ,   (4)
                                                               ÿ
                                                           ÿ=1              �㕛

where �㕛 refers to the total number of tokens in the collection.

Example To obtain a better intuition of what functional diversity measures entail, and specif-
ically how the measure responds to the parameter �㔏 , we present the following example.2
Consider these four words and their corresponding frequencies: apricot (�㕛1 = 20), pineap-
ple (�㕛2 = 15), digital (�㕛3 = 10), information (�㕛4 = 5). For each word, Table 1 lists whether
it co-occurs with any of ten context words. Each word can thus be represented as a binary

2
    Our example is a translation of [6] into a linguistic context.


                                                              117
Table 1
Co-occurrence table supporting the example in Figure 1 which illustrates how functional diversity can
be calculated at di昀昀erent levels of �㔏 .
                  boil   data    sugar   pizza    water         hat     tourist   kiosk   camera   photo
   apricot           1      0        1       0          1        0           0       0         0       0
   pineapple         1      0        1       1          1        1           0       0         0       0
   digital           0      1        0       0          0        0           0       0         1       1
   information       0      1        0       0          0        0           1       1         0       0


context vector, which can be used to compute the distance between two words. For example,
computing the pairwise distances between all four words using the Jaccard distance yields the
following distance matrix Δ:

                                apricot       0         0.4        1       1
                                pineapple £ ¤0.4         0         1       1¦§
                                information ¤ 1          1         0      0.6§
                                digital     ¥1           1        0.6      0¨


     In Figure 1, we calculate functional diversity for di昀昀erent distinctiveness thresholds �㔏 . We
begin in the top row with �㔏 = �㕑min , which is equal to the minimum distance between di昀昀erent
word types (i.e., intra-type distances are not considered in this example). At �㕑min , �㔏 equals
0.4, which means that word types with at least a distance of 0.4 between them are considered
functionally equally distant. This translates by truncating all distances greater than �㔏 = 0.4 to
0.4 in the distance matrix (cf. the matrix in Figure 1). In this scenario, each word is functionally
equally distant and thus each type has a proportional contribution �㕣ÿ of unity to its functional
group, or, in other words, each type makes up for its own functional group. This is illustrated
in Figure 1 with the circles whose size is proportional to their frequency. The circles do not
overlap, which illustrates that they each comprise their own functional group. The functional
diversity at �㕞 = 0, then, is �㔹 �㔷 = 4, which is simply the sum over the proportional contributions
�㕣ÿ (�㔏 ) of each type ÿ to a functional group (cf. Eq. 1).
     As the threshold value �㔏 increases, an increasing number of types becomes functionally
indistinguishable. In other words, with higher values of �㔏 , functional groups consist of more
types. Chiu and Chao [8, 6] suggest to use Rao’s quadratic entropy �㕄 for �㔏 , which is a similarity-
sensitive diversity measure representing the average distance between two randomly selected
instances in a collection [22]. �㕄, herea昀琀er denoted as �㕑mean , is expressed as:
                                                    ā       ā
                                    �㕄 = �㕑mean = ∑ ∑ �㕑ÿĀ �㕝ÿ �㕝Ā ,                                       (5)
                                                   ÿ=1 Ā=1

where �㕑ÿĀ refers to the distance between types ÿ and Ā, and �㕝ÿ and �㕝Ā to their relative frequencies.
    As shown in Figure 1, setting �㔏 at �㕑mean = 0.54 decreases the functional diversity to
�㔹 �㔷 = 3.58. At the threshold of 0.54, apricot and pineapple become functionally less distinct,
contributing to a shared functional group (illustrated by the overlapping circles). By contrast,


                                                 118
Figure 1: Example of how functional diversity is operationalized. The figure, inspired by [6], shows for
increasing values of �㔏 (i.e., �㕑min , �㕑mean , �㕑max , �㕑∞ ), the corresponding truncated distance matrix Δ(�㔏 ), an
illustration of the overlap between functional groups, the proportional contribution of each word type
to a functional group, and the total functional diversity.


at �㔏 = �㕑mean , digital and information remain functionally equally distant and as such belong to
their own functional group. Note that with �㔹 �㔷 = 3.58, the functional diversity at �㔏 = �㕑mean
is larger than 3. This is because the co-occurrence pro昀椀le of apricot and pineapple is consid-
ered partially overlapping but not identical. When �㔏 is set to the maximum distance in the
distance matrix (�㔏 = �㕑max = 1, however, digital and information contribute to a shared func-
tional group. Note that when there are many di昀昀erent word types, �㔏 = �㕑max is less informative
than �㔏 = �㕑mean , because functional diversity is then o昀琀en close to unity [cf. 6]. Finally, as �㔏
tends to in昀椀nity, all words become part of the same functional group, which is expressed by a


                                                        119
functional diversity of �㔹 �㔷 = 1.


3. Data and pre-processing
Archer Corpus For our experiments, we use ARCHER 3.2 [28], a corpus of historical English
registers (3.3M words). The corpus covers a period of almost 400 years (1600-1999), and con-
tains texts from 12 di昀昀erent genres or registers: advertisements, drama, 昀椀ction, sermons, jour-
nals, legal text, medicine, news, early prose, science, letters, and diaries. In terms of spelling,
ARCHER 3.2 contains the original spelling of published editions normalized with VARD2 [4].
In contrast to many other historical corpora, ARCHER 3.2 is a well-balanced, cleaned (and
relatively small) corpus, and hence it constitutes the ideal starting point for our experiment.

Simulating Errors To mimic di昀昀erent degrees of text noise, we ‘pollute’ each text in the
clean ARCHER corpus by simulating errors. In this simulation procedure, each token of each
text is modi昀椀ed with a probability �㕝. The modi昀椀cation involves replacing each letter by a
random ASCII letter with probability �㕠. With �㕠 = 0.2, a word like diversity is replaced with
diversizy. We experiment with �㕝 ∈ 0, 0.1, 0.2, 0.35, 0.5, 0.75, and chart the import of having a
more distorted text on the stability of the diversity measures. 3 In all experiments �㕠 is set to 0.2.
With this procedure, each text is manipulated 昀椀ve times per �㕝 value. The reported diversity
measurements are computed by taking the mean diversity over these 昀椀ve di昀昀erent texts.

Embeddings For the present study, we use token-based embeddings to obtain semantic sim-
ilarity estimates between the words in a given text. These embeddings are computed on the
basis of MacBERTh [20, 19], a Large Language Model that follows the architecture of BERT-base
uncased [10], which is pre-trained on historical English (1450-1950) using a custom vocabulary.
Token-based embeddings are expected to be more robust than type-based embeddings in the
presence of noise, since they take the sentential context in which the target word appears into
account. This means that they can associate (even lower frequency) variants of the same word
with each other, where the sentential context is expected to match. Moreover, thanks to the
built-in adaptive tokenization approach, MacBERTh is also able to compute embeddings for
words that were not seen during training, which is an invaluable feature for texts with large
amounts of orthographic variation.4
   In order to compute the type-level distance matrix between all word types in a corpus, we


3
  Note that texts resulting from �㕝 = 0.75 are perhaps less realistic than lower values. To illustrate, the OCR error
  rate in Eighteenth Century Collections Online (ECCO) has, for instance, been estimated to at approximately 25%
  [15].
4
  The purpose of this paper is to introduce the attribute diversity framework into lexical diversity research. As a
  昀椀rst operationalization, we resorted to token-based embeddings, which is a theoretically sound choice (as these
  models are sensitive to the fact that words can have multiple meanings) that comes with certain practical advan-
  tages (with respect to lower-frequency and ‘unseen’ items). We are, however, interested in trying out other ways
  of operationalizing the concept of functional groups in future work. One possibility, for instance, would be to
  test and compare di昀昀erent implementations of implement semantic similarity, comparing type and token-based
  approaches.


                                                       120
Figure 2: Relative change in the number of functional groups a昀琀er modifying texts with probability
�㕝 with respect to their unaltered counterparts (�㕝 = 0). The le昀琀 panel displays the results for �㕞 = 0
(corresponding to functional richness), and the two smaller panels on the right present results for higher
diversity orders (i.e., �㕞 ∈ {1, 2}, which put increasing weight on the frequency of word types.


昀椀rst compute the token-embeddings of all words it contains.5 If the same token appears mul-
tiple times in the input corpus, we compute a single embedding by averaging over the embed-
dings of all occurrences. Finally, we rely on the cosine distance function in order to obtain a
distance value between 0 and 2.6


4. Results
4.1. Functional diversity is a昀昀ected less by increased orthographic variation
Figure 2 shows the relative change in the number of functional groups a昀琀er modifying texts
with a text modi昀椀cation probability �㕝 with respect to their original, unmodi昀椀ed counterparts
(�㕝 = 0). The le昀琀 panel shows the values for �㕞 = 0 at three di昀昀erent thresholds of �㔏 . As expected,
the number of functional entities of functional entities at �㔏 = �㕑min increases more or less
5
  Note that due to the tokenization approach of MacBERTh, input words are o昀琀en split into smaller units (sub-
  tokens). In order to compute a single embedding in such cases, we average over the embeddings of the di昀昀erent
  sub-tokens.
6
  More speci昀椀cally, the cosine distance is de昀椀ned as 1 minus the cosine similarity of two given vectors—the latter
  being bounded between -1 and 1.


                                                       121
Figure 3: Box plots showing the reduction from �㕑min to �㕑mean in number of functional groups for adver-
tisements and fiction texts.


linearly with the probability �㕝 of modifying words. Indeed, the probability of a modi昀椀cation
yielding a orthographically unique letter combination is high, and each unique combination
is taken to account for a new word type (i.e., a new functional group). The relative change
in the number of functional groups is much less strong for �㔏 = �㕑mean , where the number of
functional groups at extreme values of �㕝 is still relatively close to the number of groups at �㕝 = 0.
Note that the same holds for �㔏 = �㕑max , which also remains stable with larger values of noise.
However, as explained above, with �㔏 = �㕑max , estimates are o昀琀en close to unity, which makes
the stability of �㔏 = �㕑max less surprising. The two right panels present the same results for
higher diversity orders �㕞. These plots show that when more weight is given to high frequency
entities, functional diversity is also better able to cope with orthographic variation than lexical
diversity at �㔏 = �㕑min .

4.2. Functional diversity is a theoretically relevant complement to lexical
     diversity
To get a 昀椀rmer grip on what could be gained from integrating functional diversity estimates
into discussions of lexical richness, we automatically identi昀椀ed text pairs with approximately
the same number of unique word types (�㕑min ), but a diverging number of functional groups at
�㕑mean . In each of these text pairs, one text is functionally less ‘condensed’, using the same num-
ber of unique lexical items to cover a broader functional range. A commonly occurring type
of text pairing, in that respect, is that of a 昀椀ction text with a text containing a collection of ad-
vertisements, where advertisements consistently cover a smaller number of functional groups
despite being as lexically diverse as the paired 昀椀ction text. The relatively strong reduction from
�㕑min to �㕑mean in advertising, illustrated in Figure 3, is intuitive, as advertisements o昀琀en present
a list of (functionally closely related) services and/or goods (see Figure 4), resulting in a more
condensed diversity that suggests depth rather than breadth. For 昀椀ction, by contrast, there is
no reason to expect a similar reduction.
   Pairings of two texts from the same genre also emerged. A telling example is the pairing of
Isabel Clarendon (1886), a 昀椀ction text by naturalist/realist author George Gissing, with Caprice
(1917) by Ronald Firbank (see Figure 5). The excerpts in the corpus from both texts have roughly
the same number of unique word types (Caprice: 1374 vs. Isabel Clarendon: 1377), but the types


                                                 122
Figure 4: Excerpts from an advertisement collection (le昀琀; filename ‘1860illn_a6b’) and fiction text
(right; filename ‘1891barr_f6b’) pairing with comparable estimates at �㕑min but diverging estimates at
�㕑mean .


in Caprice – a minimalist novel that, unlike realist work, predominantly consists of dialogue
and contains only limited descriptions of setting and character – cover a considerably smaller
number of functional groups at �㔏 = �㕑mean (213 vs. 368). Interestingly, with 5260 word tokens,
the excerpt of Isabel Clarendon has a lower TTR than the excerpt of Caprice, which comprises
3753 tokens. Hence, the TTRs would suggest that Caprice covers more ground in fewer words.
The functional diversity estimate, however, paints a di昀昀erent picture, which adds a theoretically
relevant dimension to investigations into the lexical richness of texts.


5. Conclusion
In this short paper, we introduce a way of incorporating the notion of functional diversity into
lexical diversity measurements in (historical) corpora by means of token-based embeddings.


                                                 123
Figure 5: Comparison of the Hill number profiles at �㔏 = �㕑min and �㔏 = �㕑mean for two fiction texts Isabel
Clarendon (1886) and Caprice (1917).


Our experiment shows that considering lexical diversity at the level of functional groups has
the practical advantage of being less sensitive to orthographic noise in the data, and the theo-
retical advantage of adding an important and o昀琀en disregarded dimension (capturing depth vs.
breadth) of vocabulary diversity in textual data. As such, the framework of attribute diversity
commonly used in Ecology should be considered an important addition to the Computational
Humanities research toolkit.


Acknowledgments
The training of MacBERTh has been made possible by the Platform Digital Infrastructure (So-
cial Sciences and Humanities) fund (PDI-SSH). We thank Melvin Wevers (University of Ams-
terdam) for his constructive feedback.


References
 [1] L. Anthonissen. Individuality in Language Change. Berlin, Boston: De Gruyter Mouton,
     2021. doi: doi:10.1515/9783110725841.
 [2] R. H. Baayen. “Corpus linguistics in morphology: Morphological productivity”. In: Cor-
     pus Linguistics: An International Handbook. Ed. by A. Lüdeling and M. Kytö. Vol. 2. Berlin,
     New York: De Gruyter Mouton, 2009, pp. 899–919. doi: doi:10.1515/9783110213881.2.899.
 [3] J. Barðdal. Productivity: Evidence from Case and Argument Structure in Icelandic. Amster-
     dam, Philadelphia: John Benjamins, 2008.


                                                   124
 [4] A. Baron and P. Rayson. “VARD2: A tool for dealing with spelling variation in historical
     corpora”. In: Postgraduate conference in corpus linguistics. 2008.
 [5] M. Brysbaert, M. Stevens, P. Mandera, and E. Keuleers. “How Many Words Do We Know?
     Practical Estimates of Vocabulary Size Dependent on Word De昀椀nition, the Degree of
     Language Input and the Participant’s Age”. In: Frontiers in Psychology 7 (2016). doi: 10.3
     389/fpsyg.2016.01116.
 [6] A. Chao, C.-H. Chiu, S. Villéger, I.-F. Sun, S. Thorn, Y.-C. Lin, J.-M. Chiang, and W. B.
     Sherwin. “An Attribute-diversity Approach to Functional Diversity, Functional Beta Di-
     versity, and Related (Dis)Similarity Measures”. In: Ecological Monographs 89.2 (2019). doi:
     10.1002/ecm.1343.
 [7] A. Chao, N. J. Gotelli, T. C. Hsieh, E. L. Sander, K. H. Ma, R. K. Colwell, and A. M. Elli-
     son. “Rarefaction and Extrapolation with Hill Numbers: A Framework for Sampling and
     Estimation in Species Diversity Studies”. In: Ecological Monographs 84.1 (2014), pp. 45–
     67.
 [8] C.-H. Chiu and A. Chao. “Distance-Based Functional Diversity Measures and Their De-
     composition: A Framework Based on Hill Numbers”. In: PLoS ONE 9.7 (2014). Ed. by F.
     de Bello, e100014. doi: 10.1371/journal.pone.0100014.
 [9] A. J. Daly, J. M. Baetens, and B. De Baets. “Ecological diversity: measuring the unmea-
     surable”. In: Mathematics 6.7 (2018), p. 119.
[10]   J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidirec-
       tional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference
       of the North American Chapter of the Association for Computational Linguistics: Human
       Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Asso-
       ciation for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.
[11]   B. Efron and R. Thisted. “Estimating the Number of Unseen Species: How Many Words
       Did Shakespeare Know?” In: Biometrika 63.3 (1976), p. 435. doi: 10.2307/2335721.
[12]   A. Ellegård. “Estimating Vocabulary Size”. In: Word 16.2 (1960), pp. 219–244. doi: 10.108
       0/00437956.1960.11659728.
[13]   L. Fonteyn and E. Manjavacas. “Adjusting scope: a computational approach to case-
       driven research on semantic change”. In: Proceedings of the Workshop on Computational
       Humanities Research (CHR 2021). Vol. 2898. CEUR Workshop Proceedings. Amsterdam,
       2021, pp. 280–298. url: http://ceur-ws.org/Vol-2989/long%5C%5Fpaper26.pdf.
[14]   M. O. Hill. “Diversity and Evenness: A Unifying Notation and Its Consequences”. In:
       Ecology 54.2 (1973), pp. 427–432.
[15]   M. J. Hill and S. Hengchen. “Quantifying the Impact of Dirty OCR on Historical Text
       Analysis: Eighteenth Century Collections Online as a Case Study”. In: Digital Scholarship
       in the Humanities 34.4 (2019), pp. 825–843. doi: 10.1093/llc/fqz024.
[16]   D. L. Hoover. “Another Perspective on Vocabulary Richness”. In: Computers and the Hu-
       manities 37.2 (2003), pp. 151–178. doi: 10.1023/a:1022673822140.
[17]   L. Jost. “Entropy and diversity”. In: Oikos 113.2 (2006), pp. 363–375.


                                               125
[18]   M. Kubát and J. Milička. “Vocabulary Richness Measure in Genres”. In: Journal of Quan-
       titative Linguistics 20.4 (2013), pp. 339–349. doi: 10.1080/09296174.2013.830552.
[19]   E. Manjavacas and L. Fonteyn. “Adapting vs. Pre-training Language Models for Historical
       Languages”. In: Journal of Data Mining & Digital Humanities Nlp4dh (2022). doi: 10.462
       98/jdmdh.9152.
[20]   E. Manjavacas and L. Fonteyn. “MacBERTh: Development and Evaluation of a Histori-
       cally Pre-trained Language Model for English (1450-1950)”. In: Proceedings of the Work-
       shop on NLP4DH ICON 2021. online: NLP Association of India (NLPAI), 2021.
[21]   F. Perek. “Recent change in the productivity and schematicity of the way -construction: A
       distributional semantic analysis”. In: Corpus Linguistics and Linguistic Theory 14.1 (2018),
       pp. 65–97. doi: 10.1515/cllt-2016-0014.
[22]   C. R. Rao. “Diversity and dissimilarity coe昀케cients: a uni昀椀ed approach”. In: Theoretical
       population biology 21.1 (1982), pp. 24–43.
[23]   A. Riba and J. Ginebra. “Diversity of vocabulary and homogeneity of literary style”. In:
       Journal of Applied Statistics 33.7 (2006), pp. 729–741. doi: 10.1080/02664760600708970.
[24]   H.-J. Schmid and A. Mantlik. “Entrenchment in Historical Corpora? Reconstructing Dead
       Authors’ Minds from their Usage Pro昀椀les”. In: Anglia 133.4 (2015), pp. 583–623. doi: doi:
       10.1515/ang-2015-0056.
[25]   J. Segbers and S. Schroeder. “How many words do children know? A corpus-based esti-
       mation of children’s total vocabulary size”. In: Language Testing 34.3 (2017), pp. 297–320.
       doi: 10.1177/0265532216641152.
[26]   C. E. Shannon. “A Mathematical Theory of Communication”. In: Mobile Computing and
       Communications Review 5 (I 1948), p. 53.
[27]   F. J. Tweedie and R. H. Baayen. “How Variable May a Constant be? Measures of Lexical
       Richness in Perspective”. In: Computers and the Humanities 32.5 (1998), pp. 323–352. doi:
       10.1023/a:1001749303137.
[28]   N. Yáñez-Bouza. ARCHER 3.2: A Representative Corpus of Historical English Registers. h
       ttps : / / www . projects . alc . manchester . ac . uk / archer / wp - content / uploads / 2020 / 06
       /ARCHER_poster.pdf. 2013.


                                                   126

</pre>