<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MakeItSample: a Python Library for Generating Typological Language Samples Based on the Diversity Value Metric</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Brigada Villa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Studi Umanistici, Università di Pavia</institution>
          ,
          <addr-line>Piazza del Lino, 2 - 27100 - Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents makeitsample, a Python library for generating typological language samples based on the diversity value (DV) metric. The library handles the construction of hierarchical language family trees from a list of CSV, the calculation of diversity values for each node in the trees, and the selection of languages based on their weight within the tree. The library aims to ease the process of creating typological language samples by providing an automated, scalable, and reproducible solution.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;typology</kwd>
        <kwd>sampling</kwd>
        <kwd>diversity value</kwd>
        <kwd>language family tree</kwd>
        <kwd>typological databases</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Linguistic typology is the study of structural patterns
and variation across the world’s languages [1, 2]. Since
there are over 7,000 known languages [3], full coverage
of linguistic diversity in typological studies is unfeasible.
Instead, researchers rely on language samples — subsets
of languages selected to represent the world’s linguistic
diversity as accurately as possible [4, 5]. However, the
way these samples are constructed greatly impacts the
validity of typological generalizations, as biased sampling
can distort conclusions about universal tendencies and
linguistic variation [6].</p>
      <p>Several sampling strategies have been developed to
improve representativeness in typological studies. Random
sampling is a straightforward method, but it risks
including many closely related languages, reducing
genealogical and areal diversity [5, 7]. Stratified sampling mitigates
this issue by ensuring balanced representation across
language families and geographic regions [8], yet defining
appropriate strata remains a challenge. For instance,
genealogical classification varies between databases such
as Glottolog [3] and Ethnologue [9], leading to
inconsistencies in sampling.</p>
      <p>Another approach is diversity-based sampling, which
prioritizes structurally diverse languages rather than
simply ensuring equal representation across language
families or regions [6]. This method focuses on maximizing
linguistic variation within a sample, making it
particularly useful for detecting cross-linguistic patterns [10].
While promising, current implementations of
diversitybased sampling often lack computational automation and
clear reproducibility, limiting their practical application.</p>
      <p>Despite eforts to refine sampling methods, typological
research remains susceptible to several biases [11]:
• Bibliographic bias: since typological studies rely
on existing descriptions, well-documented
languages are favored over lesser-described or
endangered languages [12]. In addition to this, the
quality of the descriptions may afect the results
of the typological analysis, as some grammars
may have been written with a specific
theoretical framework in mind, or been written in the
past and not updated to reflect current linguistic
theories.
• Genetic bias: samples may be unbalanced due to
the overrepresentation of some language
families, leading to an underestimation of linguistic
diversity [4, 7].
• Areal bias: some geographic regions (e.g., Europe)
are disproportionately represented in typological
databases compared to highly diverse but
underdocumented areas such as New Guinea and the
Amazon [13, 14].
• Typological bias: this bias occurs when a
sample contains a disproportionate number of
languages with similar typological features, leading
to overgeneralizations about linguistic universals
[6]. For example, if a sample contains a large
number of SVO languages, it may lead researchers to
conclude that SVO is the most common word
order across languages or that a feature associated
with this order (e.g. adjective-noun order) is the
most common across languages, even if this is
not the case. This bias can also occur when
researchers focus on a specific typological feature
(e.g., case marking) and select languages that
exhibit that feature.
• Cultural bias: this bias occurs when language
samples underrepresent the world’s cultural and
linguistic diversity. It relates to the idea of
linguistic relativity—the notion that language can
influence how people think and perceive the world
[15, 16]. While early theories assumed a strong,
deterministic link, more recent research treats
the connection between language and thought
as testable. For instance, Lucy [17] showed that
speakers of languages with obligatory number
marking perceive and categorize objects
diferently than speakers of classifier languages,
illustrating how grammatical structures can reflect
cultural patterns.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In this section, I describe the methodology behind the
diversity value (DV) metric and the sampling algorithm.
I first introduce the family tree representation used to
model genetic relationships between languages (Section
2.1). Then, I explain how DVs are calculated for each
language family and subgroup (Section 2.2). Finally, I detail
the sampling algorithm that selects languages based on
their weight within the tree (Section 2.3).</p>
      <sec id="sec-2-1">
        <title>2.1. The Family Tree Representation</title>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Calculating the Diversity Value (DV)</title>
        <p>The diversity value (DV) metric quantifies the structural
diversity of a language family or subgroup based on the
topological properties of its family tree. This metric was
ifrst introduced by Rijkhof and Bakker [ 18] and later
refined by Bakker [ 11] as a way to maximize the
typological diversity of languages in a sample. The calculation
involves the following steps:
1. Breadth-First Search (BFS): starting from a given
node for which we want to calculate the DV
(henceforth “root”), perform a BFS to determine
the level of each node in the tree. The level of a
node is the number of edges from the root to that
node.
2. Level Counts: calculate the number of nodes at
each level. This helps in understanding the
distribution of nodes across diferent levels of the
tree.
3. Contributions Calculation: for each level,
calculate the contributions to the DV. The contribution
of a level is determined by the number of nodes
at that level and their distance from the starting
node. The contributions are accumulated as we
move from the root to the leaves of the tree. The
contribution  of level  can be calculated as:
Sometimes, family trees are shaped like the left
side tree in Figure 2 in which a branch of the tree
stops at a certain level without reaching the
bottom of the tree (see the group 1 branch in Figure
2). If we apply the previous formula, we would
get a negative factor while calculating the
contribution of the bottom level, since  would be
lower than − 1. To avoid this, we add a number
of pseudo-nodes to the tree (x nodes in Figure
2), so that the number of nodes at each level is
always greater than or equal to the number of
nodes at the level above. This is done by adding a
number of pseudo-nodes equal to the diference
between the number of nodes at the level above
and the number of nodes at the current level. The
pseudo-nodes are not included in the final sample,
but they are necessary to ensure that the
contributions are calculated correctly. The pseudo-nodes
are added only to the levels that are not the last
level of the tree. This way, we can ensure that
the contributions are always positive and that the</p>
        <p>DV is calculated correctly.
4. Mean of Contributions: the DV is the mean of the
contributions calculated in the previous step. This
average value represents the structural diversity
of the language family or subgroup. The DV can
be expressed as:
 =</p>
        <p>1 ∑︁</p>
        <p>=1
where  is the depth of the tree rooted at the
node for which we are calculating the DV, and
 is the contribution of level .</p>
        <p>For language isolates, the DV is set arbitrarily
to 1 (as suggested by Rijkhof and Bakker [ 19]),
in order to avoid assigning a value of 0 to these
languages and to ensure that they get the chance
to be selected in the sampling algorithm.</p>
        <p>By following these steps, we can compute the DV for
any node in the family tree (except for nodes representing
 − ( − 1) languages which are not structurally diverse in the tree).
 = − 1 + ( − − 1) ×  The DV metric provides a principled way to quantify the
typological diversity of languages and guide the selection
where − 1 is the contribution of the level up- process in the sampling algorithm.
wards (setting to 0 the contribution of the root As a matter of example, let us consider the example
level)  is the number of nodes at level , − 1 forest in Figure 2 and let us suppose that we want to
is the number of nodes at the level above, and  is calculate the DV of the family 1. The first step is to
the maximum number of levels in the forest. If we define , i.e. the maximum number of levels under the
are calculating the DV for the root of the family root node in the forest. In this case,  = 3. Then, we
tree, then  is the maximum number of levels in proceed to calculate the contributions of each level. For
any tree in the forest. If we are calculating the DV the first level, i.e. the one including group 1 and group 2,
for a subgroup, then  is the maximum number we have 1 = 2 and 0 = 1. 0 is set to 0, so we have:
of levels in the sibling trees of the tree rooted at
the subgroup (including the subgroup tree).
3 − (1 − 1) = 0 + (1 × 1) = 1.</p>
        <p>3
1 = 0 + (2 − 1) ×</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. The Sampling Algorithm</title>
        <p>1 (︂
3</p>
        <p>The modules rely mainly on two libraries: pandas [22,
23] for data manipulation and networkx [24] for graph
representation and algorithms. The pandas library is</p>
        <p>This algorithm can be applied to any node in the fam- used to read the input data and construct the family tree.
ily tree. If we want to calculate the DV of a subgroup, The networkx library is used to represent the family
we can simply set  to the maximum number of levels tree as a graph and perform graph-based operations such
in the sibling trees of the tree rooted at the subgroup as BFS traversal and DV calculation.
(including the subgroup tree). For example, if we want
to calculate the DV of group 1, we can set  = 2 (since 3.2. The language_family_tree
the maximum number of levels in the sibling trees is 2). Module
Then, we can calculate the contributions as before,
without considering the pseudo-nodes. The full calculation of The language_family_tree module is responsible for
the DV of this node and all the other nodes in the forest constructing the family trees from the input data. It reads
is not shown here for the sake of brevity, but it can be the CSV files and creates a hierarchical structure
reprefound in Appendix A. senting the genetic relationships between languages. It
group 2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation</title>
      <p>In this section, I describe the implementation of
makeitsample, outlining the dependencies it utilizes
(section 3.1), and the two modules of the library:
language_family_tree (Section 3.2) and forest
(Section 3.3). I also provide an overview of the
commandline interface (Section 3.4) and the structure of the input
data (Section 3.4.1).</p>
      <sec id="sec-3-1">
        <title>3.1. Libraries</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. The forest Module</title>
        <p>The forest module is responsible for managing
multiple family trees and performing operations on them. It
consists of a class called Forest that inherits from the
list class. This class represents a collection of family
trees and provides methods for reading a set of CSV files
representing family trees from a directory, adding new
LanguageFamilyTree objects to the forest, exporting
the forest to a set of JSON or CSV files, calculating the
diversity values of the trees in the forest, and selecting
languages from the forest according to the sampling
algorithm.
consists of a class called LanguageFamilyTree
inherited from the networkx.DiGraph class. This class rep- 3.4. Command-Line Interface
resents the family tree as a directed graph, where each The command-line interface (CLI) of makeitsample is
node corresponds to a language family, subgroup or lan- designed to be user-friendly and allows users to easily
guage, and edges represent parent-child relationships. run the sampling pipeline from the command line. To
The class provides methods for building the tree from a run the pipeline, users can use the following command:
CSV input (formatted as described in Section 3.4.1), for
exporting the tree to a JSON or CSV file, for converting it makeitsample [-h] [-n N] [-i INPUT] [-o
to a dictionary, for calculating the diversity values of the OUTPUT] [-f {csv,json}] [-s SAMPLENAME] [-r
nodes and for selecting a certain number of languages wRAhNeDreOMN_iSs EthEeD]sample size, INPUT is the input directory
from the tree according to the sampling algorithm de- containing the CSV files, OUTPUT is the output
direcscribed in Section 2.3. tory where the sample will be saved, f is the output</p>
        <p>When importing the data, a function of format (csv or json), SAMPLENAME is the name of the
LanguageFamilyTree refines the structure of sample file, and RANDOM_SEED is the random seed for
the tree in order to avoid structures that would make reproducibility.
impossible to be processed by the sampling algorithm.</p>
        <p>This occurs when a subgroup contains both languages
and other subgroups as children. To address this, an 3.4.1. Structure of the Input Data
additional level is introduced in the tree to separate In order to run makeitsample, the input data must be
the languages from the subgroups. This is achieved in a CSV format (as in the example in Table 1 in Appendix
by creating new nodes that become parents to each B). The CSV files (one for each language family) should
language and children to the node that was previously contain:
their parent, as shown in Figure 5. This ensures
the structure remains a tree, allowing the sampling
algorithm to function correctly.
• id: a column for the unique identifier of the
lan</p>
        <p>guage (e.g., ISO code), of the family or the group;
• name: a column storing the name of the language,
of the family or the group;
• parent_id: a column storing the id of the par- journals/10.1075/sl.13.2.03dry. doi:https://doi.</p>
        <p>ent node in the family; org/10.1075/sl.13.2.03dry.
• type: a column storing the type of the node (the [5] R. D. Perkins, Statistical techniques for
determinonly allowed values for this column are family, ing language sample size, Studies in Language.
group or language). International Journal sponsored by the
Foundation “Foundations of Language” 13 (1989) 293–</p>
        <p>The user can also add other columns with additional 315. URL: https://www.jbe-platform.com/content/
information about the languages, families or groups. journals/10.1075/sl.13.2.04per. doi:https://doi.
makeitsample will ignore these columns when con- org/10.1075/sl.13.2.04per.
structing the family tree, but they will be included in the [6] B. Bickel, Distributional typology: Statistical
inoutput file. quiries into the dynamics of linguistic diversity, in:
B. Heine, H. Narrog (Eds.), The Oxford Handbook
4. Conclusions of Linguistic Analysis, 2nd ed., Oxford University
Press, Oxford, 2015, pp. 901–923.</p>
        <p>In this paper, I presented makeitsample, a Python pack- [7] J. Nichols, Linguistic Diversity in Space and Time,
age that aims to ease the generation of typological lan- University of Chicago Press, Chicago, 1999.
guage samples based on the diversity value (DV) metric. [8] M. S. Dryer, The greenbergian word order
correlaI presented the modules of the library and the command- tions, Language: Journal of the Linguistic Society
line interface, which allow to construct a set of hierarchi- of America 68 (1992) 81–138.
cal language family trees, to calculate diversity values for [9] D. M. Eberhard, G. F. Simons, C. D. Fennig,
Etheach node, and to select languages based on their weight nologue: Languages of the World, 28th ed., SIL
Inwithin the tree. By automating the sampling process and ternational, Dallas, Texas, 2025. URL: http://www.
accounting for linguistic diversity, the library and the ethnologue.com.
command-line interface provide a principled and scalable [10] M. A. Cysouw, Quantitative methods in typology,
solution to generating language samples for typological in: R. Köhler, G. Altmann, R. G. Piotrowski (Eds.),
studies helping researchers create more representative Quantitative Linguistics: An International
Handsamples and reduce genealogical biases in their analyses. book, De Gruyter, Berlin; New York, 2005, pp. 554–</p>
        <p>The library is designed to be flexible and extensible, 578.
allowing researchers to adapt it to their specific needs [11] D. Bakker, Language sampling, in: J. J. Song
and incorporate additional sampling strategies or metrics. (Ed.), The Oxford Handbook of Linguistic
TyAlthough user-friendly, the library is still in its early pology, Oxford University Press, Oxford, UK,
stages and requires some knowledge of Python to be used 2010, pp. 100–128. URL: https://doi.org/10.1093/
efectively or at least some familiarity with the command oxfordhb/9780199281251.013.0007. doi:10.1093/
line. This might be a limitation for some users, and the oxfordhb/9780199281251.013.0007, online
plan is to create a web interface to make it more accessible edition published on Oxford Academic, 18 Sept.
to a wider audience. 2012.
[12] N. Evans, S. C. Levinson, The myth of
language universals: Language diversity and
References its importance for cognitive science,
Behavioral and Brain Sciences 32 (2009) 429–448.
[1] B. Comrie, Language Universals and Linguistic Ty- URL: https://doi.org/10.1017/S0140525X0999094X.
pology: Syntax and Morphology, University of doi:10.1017/S0140525X0999094X.</p>
        <p>Chicago Press, Chicago, 1989. [13] B. Bickel, Typology in the 21st century:
Ma[2] W. Croft, Typology and Universals, Cambridge Text- jor current developments, Linguistic
Typolbooks in Linguistics, 2 ed., Cambridge University ogy 11 (2007) 239–251. URL: https://doi.org/
Press, Cambridge, 2002. 10.1515/LINGTY.2007.018. doi:10.1515/LINGTY.
[3] H. Hammarström, R. Forkel, M. Haspelmath, 2007.018.</p>
        <p>S. Bank, Glottolog 5.1, 2024. URL: http://glottolog. [14] T. Güldemann, The Languages and Linguistics
org. doi:10.5281/zenodo.14006617. of Africa, De Gruyter Mouton, Berlin; Boston,
[4] M. S. Dryer, Large linguistic areas and lan- 2018. URL: https://doi.org/10.1515/9783110421668.
guage sampling, Studies in Language. In- doi:10.1515/9783110421668.
ternational Journal sponsored by the Founda- [15] E. Sapir, Selected Writings in Language, Culture,
tion “Foundations of Language” 13 (1989) 257– and Personality, University of California Press,
292. URL: https://www.jbe-platform.com/content/ Berkeley, CA, 1949.</p>
        <p>[16] B. L. Whorf, Language, Thought, and Reality:
SeA. Full Calculation of the DV for</p>
        <p>the Example in Figure 2
tree 1
• 0 = 0 (contribution of the root level)
 = 1 (sibling of group 5)</p>
        <p>Node count:
• 0 = 1 (number of nodes at level 0)
• 1 = 1 (number of nodes at level 1)</p>
        <p>It behaves like a language isolate, so we set  = 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>B. Example of input CSV file</title>
      <p>id
Afro-Asiatic
36
1793
1063
1064
37
1704
gnc
auj
swn
siz
cnu
jbe
shi
tzm
zgh</p>
      <p>name
Afro-Asiatic</p>
      <p>Berber
Awjila-Sokna</p>
      <p>Eastern</p>
      <p>Siwa
Northern</p>
      <p>Atlas
Guanche
Awjilah
Sawknah</p>
      <p>Siwi</p>
      <p>Chenoua
Judeo-Berber</p>
      <p>Tachelhit
"Tamazight, Central Atlas"
"Tamazight, Standard Moroccan"</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Grammar
and spelling check and Citation management. After using these tool(s)/service(s), the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>