1. Introduction

MakeItSample: a Python Library for Generating Typological Language Samples Based on the Diversity Value Metric

Luca Brigada Villa

0 0 Dipartimento di Studi Umanistici, Università di Pavia , Piazza del Lino, 2 - 27100 - Pavia , Italy

2025

This paper presents makeitsample, a Python library for generating typological language samples based on the diversity value (DV) metric. The library handles the construction of hierarchical language family trees from a list of CSV, the calculation of diversity values for each node in the trees, and the selection of languages based on their weight within the tree. The library aims to ease the process of creating typological language samples by providing an automated, scalable, and reproducible solution.

eol>typology sampling diversity value language family tree typological databases

1. Introduction

Linguistic typology is the study of structural patterns and variation across the world’s languages [1, 2]. Since there are over 7,000 known languages [3], full coverage of linguistic diversity in typological studies is unfeasible. Instead, researchers rely on language samples — subsets of languages selected to represent the world’s linguistic diversity as accurately as possible [4, 5]. However, the way these samples are constructed greatly impacts the validity of typological generalizations, as biased sampling can distort conclusions about universal tendencies and linguistic variation [6].

Several sampling strategies have been developed to improve representativeness in typological studies. Random sampling is a straightforward method, but it risks including many closely related languages, reducing genealogical and areal diversity [5, 7]. Stratified sampling mitigates this issue by ensuring balanced representation across language families and geographic regions [8], yet defining appropriate strata remains a challenge. For instance, genealogical classification varies between databases such as Glottolog [3] and Ethnologue [9], leading to inconsistencies in sampling.

Another approach is diversity-based sampling, which prioritizes structurally diverse languages rather than simply ensuring equal representation across language families or regions [6]. This method focuses on maximizing linguistic variation within a sample, making it particularly useful for detecting cross-linguistic patterns [10]. While promising, current implementations of diversitybased sampling often lack computational automation and clear reproducibility, limiting their practical application.

Despite eforts to refine sampling methods, typological research remains susceptible to several biases [11]: • Bibliographic bias: since typological studies rely on existing descriptions, well-documented languages are favored over lesser-described or endangered languages [12]. In addition to this, the quality of the descriptions may afect the results of the typological analysis, as some grammars may have been written with a specific theoretical framework in mind, or been written in the past and not updated to reflect current linguistic theories. • Genetic bias: samples may be unbalanced due to the overrepresentation of some language families, leading to an underestimation of linguistic diversity [4, 7]. • Areal bias: some geographic regions (e.g., Europe) are disproportionately represented in typological databases compared to highly diverse but underdocumented areas such as New Guinea and the Amazon [13, 14]. • Typological bias: this bias occurs when a sample contains a disproportionate number of languages with similar typological features, leading to overgeneralizations about linguistic universals [6]. For example, if a sample contains a large number of SVO languages, it may lead researchers to conclude that SVO is the most common word order across languages or that a feature associated with this order (e.g. adjective-noun order) is the most common across languages, even if this is not the case. This bias can also occur when researchers focus on a specific typological feature (e.g., case marking) and select languages that exhibit that feature. • Cultural bias: this bias occurs when language samples underrepresent the world’s cultural and linguistic diversity. It relates to the idea of linguistic relativity—the notion that language can influence how people think and perceive the world [15, 16]. While early theories assumed a strong, deterministic link, more recent research treats the connection between language and thought as testable. For instance, Lucy [17] showed that speakers of languages with obligatory number marking perceive and categorize objects diferently than speakers of classifier languages, illustrating how grammatical structures can reflect cultural patterns.

2. Methodology

In this section, I describe the methodology behind the diversity value (DV) metric and the sampling algorithm. I first introduce the family tree representation used to model genetic relationships between languages (Section 2.1). Then, I explain how DVs are calculated for each language family and subgroup (Section 2.2). Finally, I detail the sampling algorithm that selects languages based on their weight within the tree (Section 2.3).

2.1. The Family Tree Representation 2.2. Calculating the Diversity Value (DV)

The diversity value (DV) metric quantifies the structural diversity of a language family or subgroup based on the topological properties of its family tree. This metric was ifrst introduced by Rijkhof and Bakker [ 18] and later refined by Bakker [ 11] as a way to maximize the typological diversity of languages in a sample. The calculation involves the following steps: 1. Breadth-First Search (BFS): starting from a given node for which we want to calculate the DV (henceforth “root”), perform a BFS to determine the level of each node in the tree. The level of a node is the number of edges from the root to that node. 2. Level Counts: calculate the number of nodes at each level. This helps in understanding the distribution of nodes across diferent levels of the tree. 3. Contributions Calculation: for each level, calculate the contributions to the DV. The contribution of a level is determined by the number of nodes at that level and their distance from the starting node. The contributions are accumulated as we move from the root to the leaves of the tree. The contribution of level can be calculated as: Sometimes, family trees are shaped like the left side tree in Figure 2 in which a branch of the tree stops at a certain level without reaching the bottom of the tree (see the group 1 branch in Figure 2). If we apply the previous formula, we would get a negative factor while calculating the contribution of the bottom level, since would be lower than − 1. To avoid this, we add a number of pseudo-nodes to the tree (x nodes in Figure 2), so that the number of nodes at each level is always greater than or equal to the number of nodes at the level above. This is done by adding a number of pseudo-nodes equal to the diference between the number of nodes at the level above and the number of nodes at the current level. The pseudo-nodes are not included in the final sample, but they are necessary to ensure that the contributions are calculated correctly. The pseudo-nodes are added only to the levels that are not the last level of the tree. This way, we can ensure that the contributions are always positive and that the

DV is calculated correctly. 4. Mean of Contributions: the DV is the mean of the contributions calculated in the previous step. This average value represents the structural diversity of the language family or subgroup. The DV can be expressed as: =

1 ∑︁

=1 where is the depth of the tree rooted at the node for which we are calculating the DV, and is the contribution of level .

For language isolates, the DV is set arbitrarily to 1 (as suggested by Rijkhof and Bakker [ 19]), in order to avoid assigning a value of 0 to these languages and to ensure that they get the chance to be selected in the sampling algorithm.

By following these steps, we can compute the DV for any node in the family tree (except for nodes representing − ( − 1) languages which are not structurally diverse in the tree). = − 1 + ( − − 1) × The DV metric provides a principled way to quantify the typological diversity of languages and guide the selection where − 1 is the contribution of the level up- process in the sampling algorithm. wards (setting to 0 the contribution of the root As a matter of example, let us consider the example level) is the number of nodes at level , − 1 forest in Figure 2 and let us suppose that we want to is the number of nodes at the level above, and is calculate the DV of the family 1. The first step is to the maximum number of levels in the forest. If we define , i.e. the maximum number of levels under the are calculating the DV for the root of the family root node in the forest. In this case, = 3. Then, we tree, then is the maximum number of levels in proceed to calculate the contributions of each level. For any tree in the forest. If we are calculating the DV the first level, i.e. the one including group 1 and group 2, for a subgroup, then is the maximum number we have 1 = 2 and 0 = 1. 0 is set to 0, so we have: of levels in the sibling trees of the tree rooted at the subgroup (including the subgroup tree). 3 − (1 − 1) = 0 + (1 × 1) = 1.

3 1 = 0 + (2 − 1) ×

2.3. The Sampling Algorithm

1 (︂ 3

The modules rely mainly on two libraries: pandas [22, 23] for data manipulation and networkx [24] for graph representation and algorithms. The pandas library is

This algorithm can be applied to any node in the fam- used to read the input data and construct the family tree. ily tree. If we want to calculate the DV of a subgroup, The networkx library is used to represent the family we can simply set to the maximum number of levels tree as a graph and perform graph-based operations such in the sibling trees of the tree rooted at the subgroup as BFS traversal and DV calculation. (including the subgroup tree). For example, if we want to calculate the DV of group 1, we can set = 2 (since 3.2. The language_family_tree the maximum number of levels in the sibling trees is 2). Module Then, we can calculate the contributions as before, without considering the pseudo-nodes. The full calculation of The language_family_tree module is responsible for the DV of this node and all the other nodes in the forest constructing the family trees from the input data. It reads is not shown here for the sake of brevity, but it can be the CSV files and creates a hierarchical structure reprefound in Appendix A. senting the genetic relationships between languages. It group 2

3. Implementation

In this section, I describe the implementation of makeitsample, outlining the dependencies it utilizes (section 3.1), and the two modules of the library: language_family_tree (Section 3.2) and forest (Section 3.3). I also provide an overview of the commandline interface (Section 3.4) and the structure of the input data (Section 3.4.1).

3.1. Libraries 3.3. The forest Module

The forest module is responsible for managing multiple family trees and performing operations on them. It consists of a class called Forest that inherits from the list class. This class represents a collection of family trees and provides methods for reading a set of CSV files representing family trees from a directory, adding new LanguageFamilyTree objects to the forest, exporting the forest to a set of JSON or CSV files, calculating the diversity values of the trees in the forest, and selecting languages from the forest according to the sampling algorithm. consists of a class called LanguageFamilyTree inherited from the networkx.DiGraph class. This class rep- 3.4. Command-Line Interface resents the family tree as a directed graph, where each The command-line interface (CLI) of makeitsample is node corresponds to a language family, subgroup or lan- designed to be user-friendly and allows users to easily guage, and edges represent parent-child relationships. run the sampling pipeline from the command line. To The class provides methods for building the tree from a run the pipeline, users can use the following command: CSV input (formatted as described in Section 3.4.1), for exporting the tree to a JSON or CSV file, for converting it makeitsample [-h] [-n N] [-i INPUT] [-o to a dictionary, for calculating the diversity values of the OUTPUT] [-f {csv,json}] [-s SAMPLENAME] [-r nodes and for selecting a certain number of languages wRAhNeDreOMN_iSs EthEeD]sample size, INPUT is the input directory from the tree according to the sampling algorithm de- containing the CSV files, OUTPUT is the output direcscribed in Section 2.3. tory where the sample will be saved, f is the output

When importing the data, a function of format (csv or json), SAMPLENAME is the name of the LanguageFamilyTree refines the structure of sample file, and RANDOM_SEED is the random seed for the tree in order to avoid structures that would make reproducibility. impossible to be processed by the sampling algorithm.

This occurs when a subgroup contains both languages and other subgroups as children. To address this, an 3.4.1. Structure of the Input Data additional level is introduced in the tree to separate In order to run makeitsample, the input data must be the languages from the subgroups. This is achieved in a CSV format (as in the example in Table 1 in Appendix by creating new nodes that become parents to each B). The CSV files (one for each language family) should language and children to the node that was previously contain: their parent, as shown in Figure 5. This ensures the structure remains a tree, allowing the sampling algorithm to function correctly. • id: a column for the unique identifier of the lan

guage (e.g., ISO code), of the family or the group; • name: a column storing the name of the language, of the family or the group; • parent_id: a column storing the id of the par- journals/10.1075/sl.13.2.03dry. doi:https://doi.

ent node in the family; org/10.1075/sl.13.2.03dry. • type: a column storing the type of the node (the [5] R. D. Perkins, Statistical techniques for determinonly allowed values for this column are family, ing language sample size, Studies in Language. group or language). International Journal sponsored by the Foundation “Foundations of Language” 13 (1989) 293–

The user can also add other columns with additional 315. URL: https://www.jbe-platform.com/content/ information about the languages, families or groups. journals/10.1075/sl.13.2.04per. doi:https://doi. makeitsample will ignore these columns when con- org/10.1075/sl.13.2.04per. structing the family tree, but they will be included in the [6] B. Bickel, Distributional typology: Statistical inoutput file. quiries into the dynamics of linguistic diversity, in: B. Heine, H. Narrog (Eds.), The Oxford Handbook 4. Conclusions of Linguistic Analysis, 2nd ed., Oxford University Press, Oxford, 2015, pp. 901–923.

In this paper, I presented makeitsample, a Python pack- [7] J. Nichols, Linguistic Diversity in Space and Time, age that aims to ease the generation of typological lan- University of Chicago Press, Chicago, 1999. guage samples based on the diversity value (DV) metric. [8] M. S. Dryer, The greenbergian word order correlaI presented the modules of the library and the command- tions, Language: Journal of the Linguistic Society line interface, which allow to construct a set of hierarchi- of America 68 (1992) 81–138. cal language family trees, to calculate diversity values for [9] D. M. Eberhard, G. F. Simons, C. D. Fennig, Etheach node, and to select languages based on their weight nologue: Languages of the World, 28th ed., SIL Inwithin the tree. By automating the sampling process and ternational, Dallas, Texas, 2025. URL: http://www. accounting for linguistic diversity, the library and the ethnologue.com. command-line interface provide a principled and scalable [10] M. A. Cysouw, Quantitative methods in typology, solution to generating language samples for typological in: R. Köhler, G. Altmann, R. G. Piotrowski (Eds.), studies helping researchers create more representative Quantitative Linguistics: An International Handsamples and reduce genealogical biases in their analyses. book, De Gruyter, Berlin; New York, 2005, pp. 554–

The library is designed to be flexible and extensible, 578. allowing researchers to adapt it to their specific needs [11] D. Bakker, Language sampling, in: J. J. Song and incorporate additional sampling strategies or metrics. (Ed.), The Oxford Handbook of Linguistic TyAlthough user-friendly, the library is still in its early pology, Oxford University Press, Oxford, UK, stages and requires some knowledge of Python to be used 2010, pp. 100–128. URL: https://doi.org/10.1093/ efectively or at least some familiarity with the command oxfordhb/9780199281251.013.0007. doi:10.1093/ line. This might be a limitation for some users, and the oxfordhb/9780199281251.013.0007, online plan is to create a web interface to make it more accessible edition published on Oxford Academic, 18 Sept. to a wider audience. 2012. [12] N. Evans, S. C. Levinson, The myth of language universals: Language diversity and References its importance for cognitive science, Behavioral and Brain Sciences 32 (2009) 429–448. [1] B. Comrie, Language Universals and Linguistic Ty- URL: https://doi.org/10.1017/S0140525X0999094X. pology: Syntax and Morphology, University of doi:10.1017/S0140525X0999094X.

Chicago Press, Chicago, 1989. [13] B. Bickel, Typology in the 21st century: Ma[2] W. Croft, Typology and Universals, Cambridge Text- jor current developments, Linguistic Typolbooks in Linguistics, 2 ed., Cambridge University ogy 11 (2007) 239–251. URL: https://doi.org/ Press, Cambridge, 2002. 10.1515/LINGTY.2007.018. doi:10.1515/LINGTY. [3] H. Hammarström, R. Forkel, M. Haspelmath, 2007.018.

S. Bank, Glottolog 5.1, 2024. URL: http://glottolog. [14] T. Güldemann, The Languages and Linguistics org. doi:10.5281/zenodo.14006617. of Africa, De Gruyter Mouton, Berlin; Boston, [4] M. S. Dryer, Large linguistic areas and lan- 2018. URL: https://doi.org/10.1515/9783110421668. guage sampling, Studies in Language. In- doi:10.1515/9783110421668. ternational Journal sponsored by the Founda- [15] E. Sapir, Selected Writings in Language, Culture, tion “Foundations of Language” 13 (1989) 257– and Personality, University of California Press, 292. URL: https://www.jbe-platform.com/content/ Berkeley, CA, 1949.

[16] B. L. Whorf, Language, Thought, and Reality: SeA. Full Calculation of the DV for

the Example in Figure 2 tree 1 • 0 = 0 (contribution of the root level) = 1 (sibling of group 5)

Node count: • 0 = 1 (number of nodes at level 0) • 1 = 1 (number of nodes at level 1)

It behaves like a language isolate, so we set = 1.

B. Example of input CSV file

id Afro-Asiatic 36 1793 1063 1064 37 1704 gnc auj swn siz cnu jbe shi tzm zgh

name Afro-Asiatic

Berber Awjila-Sokna

Eastern

Siwa Northern

Atlas Guanche Awjilah Sawknah

Siwi

Chenoua Judeo-Berber

Tachelhit "Tamazight, Central Atlas" "Tamazight, Standard Moroccan"

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar and spelling check and Citation management. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s