<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Fiction”. In:Journal of Cultural Analytics</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.22148/16.019</article-id>
      <title-group>
        <article-title>How Exactly does Literary Content Depend on Genre? A Case Study of Animals in Children's Literature</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>INALCO</institution>
          ,
          <addr-line>Paris</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Russian Literature (Pushkin House)</institution>
          ,
          <addr-line>Saint Petersburg</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>2</volume>
      <issue>2018</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The content of literary 昀椀ction at least partly depends on literary tradition. The dependence is attested quantitatively in the association of genre with lexical statistical patterns. This short paper is a step to formal modeling of the content-moderating processes associated with literary genres. The idea is to explain prevalence of the particular lemmas in a literary text by the genre-dependent accessibility of the semantic category during the creative process. Data on animals mentioned in various sub-genres in a corpus of Russian children's literature is used as an empirical case. Vocabulary growth models are applied to infer genre-related di昀erences in overall diversity of animal vocabularies. A constrained topic model is employed to infer preferences for particular animal lemmas displayed by various genres. Results demonstrate the models' potential to infer genre-related content preferences in the context of high variance and data imbalance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;computational thematics</kwd>
        <kwd>genre</kwd>
        <kwd>vocabulary growth model</kwd>
        <kwd>children's literature</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Computational methods made content of literary 昀椀ction a practical target for systematic
exploration. All kinds of phenomena are being counted in corpora of 昀椀ctional texts, including
natural and material objects, body parts, emotions etc. with inferences for either literary or
cultural history5[
        <xref ref-type="bibr" rid="ref12 ref15">, 18, 12</xref>
        ]. This body of work could bene昀椀t from a more explicit recognition
of the dual source of the literary content: literary tradition (internal) and social reality
mediated by the author’s experience (external)1[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It is crucial for inferences on cultural dynamics
with literary data to acknowledge and to measure the in昀氀uence of literary tradition on 昀椀ctional
content.
      </p>
      <p>
        There is evidence that some aspects of literary tradition captured by a vague notion of genre
leave a discernible signal in the distribution of content words. For instance, predictive
models using only frequent lexical features are able to discriminate between literary genres with
decent accuracy [
        <xref ref-type="bibr" rid="ref11 ref13 ref3">17, 11, 14, 3, 15</xref>
        ]. Such models could be reverse-engineered to look for the
most informative lexical features. Yet interpreting these features without understanding the
mechanics of how genre-related constraints on content translate into lexical distribution
phenomena is a risky business. Brie昀氀y, quantitative studies of literary content are in need of a
more formalized theory.
      </p>
      <p>This short paper is a step to formal modeling of the content-moderating processes associated
with literary genres. To reduce complexity, I focus on a narrow subject — animals in children’s
literature. The presence of animals in books for children is evidently supported by literary
tradition, not only as characters, but more generally as pedagogical mater4i,a1l3][. It is also
reasonable to expect variation by sub-genre in the prominence and selection of animals,
compare e.g. fairy tales and teen detective stories. Hence I start from the premise that in this case
the in昀氀uence of literary tradition is considerable, is associated with genre, and thus could be
measured. The goal is to devise the most simple yet justi昀椀able generative models that represent
the content of literary work as a result of choices made during the creative process conditional
on genre.</p>
      <p>The models suggested in this paper are employed to make two types of measurement in a
corpus of Russian literature for children and young adults. First, a quantitative estimation of
the e昀ect of sub-genre on the number of various animals mentioned in a text (animal diversity).
I suggest to base this estimate on a vocabulary growth model to e昀ectively control for the highly
variable text length. Second, an estimation of the preferences to mention certain animal species
in each sub-genre. This task is attained with the help of a specialized topic model.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Models</title>
      <p>To quantify the relative weight of literary tradition vis-a-vis external factors in the prevalence
of a concept one needs to put heterogeneous internal and external in昀氀uences on a uni昀椀ed
numeric scale. To this end I suggest to employ a notion of cognitiavcecessibility of a concept
to the author in the process of writing. Accessibility can be operationalized as a probability
that a concept will be mentioned at least once in a certain work given all the predictive factors.
Accessibility of a concept during the creative process is not directly observable, at least not for
past literature. But it can be measuread posteriori at the population level by observing lexical
frequency. Then all the literary and social factors can be seen as distal causes that exert their
in昀氀uence on literary content through increased or decreased accessibility of some concepts.
Accessibility provides a convenient conceptual basis for the following models since it o昀ers
both a generative interpretation and a measurement scale for the data on lexical prevalence.</p>
      <sec id="sec-2-1">
        <title>2.1. Vocabulary growth model</title>
        <p>The 昀椀rst objective is to estimate relative accessibility of animals in general in various
subgenres. The task is complicated by the fact that the distribution of text lengths varies widely
between genres. From a modeling perspective the task is to predict the length of a list of
di昀erent animals mentioned in the text. Since such a list is technically just a part of the text’s
vocabulary, a general vocabulary growth model can be applied to it. The most basic model that
relates vocabulary size to the text length is known as Heaps’ law7].[ It re昀氀ects the fact that
vocabulary size in a text in natural language is unbounded, but the growth rate diminishes with
text length. So far as the model is targeted only at a share of the total vocabulary, I slightly
modify the interpretation of coe昀케cients in the Heaps’ formula
 =  
(1)
where  is the number of animals mentioned (vocabulary siz e), is text length in tokens,
(typically0.4 ≤  ≥ 0.6 ) and  — coe昀케cients that control the growth rate. This model allows
to account for the length of a text in a principled way.</p>
        <p>In the experiments below I explore two ways to incorporate the e昀ect of genre into this
model. The most obvious move is to allow either the coe昀케cient or the exponent to vary by
genre. Higher values of the coe昀케cients would indicate higher accessibility of a lexical category
in the genre. Evidently, genre (as a proxy to literary tradition) is only partly responsible for the
lexical content of the work, and much genre-internal variation remains to be explained by other
factors. The external factors that span all the aspects of socialization and linguistic experience
of the author can be accounted for by letting eithe ror  to vary by author. However this
solution entails the assumption that authors employ animal vocabulary to a similar degree in
all their texts.</p>
        <p>The alternative view is to (simplistically) assume that the observed list of animal mentions
comes from either of two processes: (a) a low-intensity background process when the number
of animals mentioned grows only slowly with the text length; (b) a high-intensity foreground
process, leading to a higher number of animal mentions for the text of a similar length. In other
words, animals may or may not be a relevantotpic for a given text. Then each genre could be
represented as a mixture of texts each one coming from one of the two processes. The genres
would di昀er in the latter model by the estimated share of texts produced by background and
foreground processes. Formally,
 =    1  + (1 −   ) 2 
(2)
where  is a genre-speci昀椀c share of background process , 1 and  2 stand for the intensity of a
background and foreground processes, respectively.</p>
        <p>For details on formal de昀椀nition of the statistical models used for the experiments, priors, and
model selection see appendixA.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Genre-topic model</title>
        <p>The second objective is to estimate the relative accessibility of certain animal species in
various sub-genres. For this case, the list of di昀erent animals mentioned in a text is treated as a
document. The theoretical assumption is that items in this list are drawn from two sources:
in昀氀uence of the literary tradition (genre) on the one side, and the external author’s experience
on the other. A topic model is the most common formal generative model to describe the
composition of a word list drawn from various sources. An advantage of a topic model is that it
implicitly accounts for the text length.</p>
        <p>The two sources assumption can be translated into a highly constrained topic model where
each document is composed of just two topics, one topic speci昀椀c for the genre of the text, and
another topic to model external in昀氀uences. As a result, each genre has its own “topic”.
Probabilities of words in these topics re昀氀ect preference (higher accessibility) for an animal associated
with a particular genre. While genre-speci昀椀c topics are meant to capture the in昀氀uence of
literary tradition, for simplicity and to make estimation possible I reduce all external factors to a
single common topic.</p>
        <p>The generative story for this model runs as follows.
1. For each document :</p>
        <p>Draw a proportion of genre-speci昀椀c topic  for this document.
2. For each word in a document:
a) With probability  draw a word from a genre-speci昀椀c topic |  .</p>
        <p>b) With probability1 −   draw a word from a general top i|c  .</p>
        <p>The model has two hyperparameters: a prior for genre-topic proportion in a docu m en∼t
Beta(3, 3) and a Dirichlet prior for distribution of probability of words in each t(o|)pi∼c
Dirichle(t 1, ...,   ),   = 0.8.</p>
        <p>Such a model can be seen as a highly constrained variant of a well-known LDA topic model
[1]. Unlike LDA, the topical composition of a document is not a parameter to be estimated
by the model, but is always a mix of two topics pre-de昀椀ned by the genre of a text. The only
document-level parameter the model is le昀琀 to estimate is a proportion of genre topic. The
probabilities of words in a topic are estimated in the same way as in LDA.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data and measurements</title>
      <p>The data for the analysis come from the Detcorpus, a corpus of Russian prose for children and
young adults written between 1900 and 20209[]. All the texts in the corpus are provided with
a list of genre tags as a part of their metadata. The genres considered in the present analysis
do not form a neat typology. The major genres that span the whole corpus include fairy tale,
science 昀椀ction, and realism, the last one generic, standing for all prose without speci昀椀c genre
attributes. The other group is formed by formulaic genres that appeared on the market since
1990s: detective stories, fantasy, horror, and romance books for teens. Animal stories, a
wellrecognizable sub-genre of prose for children is included as a separate category due to a speci昀椀c
focus on animals. For each work, genre tags were reduced to one single label from the above
list. Genre labels are regarded as a proxy to those aspects of literary tradition that supposedly
have a su昀케ciently strong and stable e昀ect to be detected in the distribution of animal mentions.
In total, the data comprises 2994 works ranging from 100 to 300000 words in length. See details
on the data composition, genre and author distribution in the appendCix.</p>
      <p>
        To identify the occurrences of animals in texts I constructed a dictionary using all Russian
names and aliases for animal taxa in Wikidata. In contrast to previous work that aimed to
measure biodiversity in literatur6e,[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], taxa names are not reduced to nouns, and when a
taxon name is a multi-word expression it is matched as a sequence of lemmas. Dictionary-based
methods are notoriously plagued by false positive matches due to homonymy. The problem
is quite severe with animal names, as metaphor is heavily used as a semantic device in this
lexical category. To achieve a satisfactory precision, I manually compiled an extensive stoplist
(405 items) of the lemmas that are less likely to refer to animianlsthis particular corpus. As a
result, of 20811 lemmas in the dictionary 1906 were matched in the corpus. The accuracy of
the method was evaluated on a sample of 50 random 500-word excerpts (precision 0.97, recall
0.81, F1 0.88). Evaluation indicated that Wikidata systematically underrepresents female forms
of animal names, names for cubs, and various derivative forms, especially diminutives.
      </p>
      <p>For modeling, each work is reduced to a list of all distinct animal lemmas mentioned in the
text (each lemma is present only once). Since the focus of the analysis is on lexical phenomena,
I chose to count lemmas, not species. It should be recognized that a relationship of animal
nominations to biological taxa can be highly ambiguous, and identi昀椀cation of the speci昀椀c taxon
meant in a text presents a separate problem. For a genre-topic model all the works by the same
author in the certain genre are joined into a single document to avoid bias induced by better
represented authors in the corpus. To simplify topic inference, only the lemmas that occurred
in 5 works or more were retained. All statistical inference was Bayesian and performed with
the help of STAN Hamiltonian Monte Carlo sampler. Se8e][for the data and code used for the
analysis.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In the 昀椀rst experiment vocabulary growth models de昀椀ned above were employed to quantify
genre di昀erences in the expected number of animals mentioned (diversity of animal
vocabulary). The results are displayed in the 昀椀g1.. Le昀琀 panel shows the animal vocabulary growth
rate parameter as estimated by the model that assumes that in the Heaps’ formula varies
by author and varies by genre. Right panel displays the percenta ge of animal-rich texts for
each genre as estimated by a mixture model. Growth rate paramete rsand  in the mixture
model are 昀椀xed for rich and small animal vocabulary clusters and do not depend on author or
genre.</p>
      <p>Both models infer similar genre di昀erences. Animal stories and, to a lesser degree, genres
with fantastic element (fairy tales, fantasy, sci-昀椀) have larger animal vocabularies on average
(or, alternatively, larger share of texts with rich animal vocabularies) in comparison to realism
as a reference point. A slightly more surprising conclusion is that formulaic genres (detective,
horror) use narrower animal vocabulary with the lowest result attained by teen romance novels.</p>
      <p>The vocabulary growth rate model indicates that more variance in the animal vocabulary
size is associated with authors than with genres. The model predicts that in a typical novella
(50,000 words) an author with an average interest in animals will mention 10 more animal
lemmas, on average, in an animal story, 3 more in a fairy tale, and 6 less in romance, all in
comparison with realism. Simultaneously, the predicted di昀erence between the author with
the highest interest in animals (Nikolai Sladkov) with one of the lowest (Anatoly Aleksin) for
a realist novella of the same length would be 197 animal lemmas, on average.</p>
      <p>Since the mixture model estimates a probability that each text belongs to either rich or small
animal vocabulary group it could be seen as a model-based clustering of texts. This allows
for 昀椀ner comparison of otherwise similar works that di昀er by the density of animal mentions.
Many authors consistently appear in one of the clusters, for instance, Vitaliy Bianki, a canonical
author of animal stories, is invariably a high-scorer. But even the texts of the same genre, size
and by the same author may fall in di昀erent groups. Short stories from the same book by
Andrei Platonov classi昀椀ed as fairy tales provide a vivid example. One is “Why did the geese
become motley”, 626 words, 1 animal species (geese), low-animal cluster. The second is “A
grateful hare”, 643 words, 11 species, high-animal cluster. In the second story, a hare helps the
protagonist by calling other animals to bring foods, which e昀ectively generates an enumeration
of species.</p>
      <p>For the second experiment a genre-topic model was trained. The parameters estimated by
the model include the probabilities for each lemma in each genre (genre-speci昀椀c topics) and
a background probability of each lemma. The probability distribution of lemmas in the
background topic is very close to the overall frequency distribution of lemmas in the data
(JensenShannon divergence0.02). High probability of a lemma in a genre topic means that this animal
is likely to be mentioned by larger number of authors writing in this genre (is more accessible
given the genre). The summary of the genre topics is presented in 昀椀g.2. For generality, instead
of presenting lemmas I group animals in larger categories and provide a number of lemmas in
each category for a top-20 lemmas in a topic. Top lists for each topic are built using a balanced
FREX metric which combines probability of a lemma in a topic with exclusivity of a lemma to
this topic in contrast to other topics.</p>
      <p>There are a few notable tendencies made apparent thanks to the genre-topic model. Animal
stories are distinctive for its focus on forest animals and the most diverse set of bird species,
primarily wildfowl. This may be contextualized with a note that the most proli昀椀c authors in
this genre (e. g. Bianki, Prishvin) were passionate hunters. Birds also have a prominent place in
other genres, including fairy tales, but a set of species is quite di昀erent (various owls, crow, tit).
In contrast to animal stories, realism is de昀椀ned by its focus on farm animals (horse being the
most “realistic”) and quite numerous species of edible 昀椀sh. Perhaps not surprisingly, interest
in snakes is characteristic of teen horror 昀椀ction. Science 昀椀ction is much more focused on sea
animals along with extinct species (o昀琀en dinosaurs). Pets (primarily cats and dogs) take a very
prominent place in detectives. All animals not native for northern temperate zone are grouped
under a label ‘exotic’. For instance, lion (‘king of animals’) and tiger are typical of fairy tales.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The models introduced in this short paper aim to ground computational analysis of literary
content (or computational thematics, as suggested by Sobchuk and Šeļa in15[]) in the
categories relevant for literary production. While the models are simplistic they operate on a level
of a whole literary work which allows to relate them to the aspects of the creative process. The
animal diversity model was able to detect genre-associated di昀erences in the accessibility of
animal vocabularies while controlling for the author and text length. This result corroborates
the supposed e昀ect of literary tradition for this particular data. The mixture model that
distinguishes between rich and scarce animal vocabularies also proved to be a useful tool to locate
diversity-generating tropes, as shown in the Platonov example in sectio4.n</p>
      <p>
        I see an important advantage of the vocabulary growth models employed here in their ability
to estimate parameters even for the very short texts on a par with longer ones. This feature is
speci昀椀cally relevant for any diversity measurements made on a lexical basis, say, biodiversity
as represented in literature. The reason is that text length is the strongest predictor of the
vocabulary size regardless of other factor1s6][. For this reason previous work on literary
biodiversity 6[
        <xref ref-type="bibr" rid="ref10">, 10</xref>
        ] had to recourse to the minimum text length threshold which is equivalent
to implicitly stratifying by text length without a proper theoretical justi昀椀cation.
      </p>
      <p>In comparison to linear prevalence models (Poisson regression) vocabulary growth models
may o昀er more 昀椀ne-grained estimates. For instance, given a point estimate of the author’s
propensity to mention animals, both the Heaps’-based model suggested in this article and a
similarly structured Poisson model (see appendBifxor details) produce rather similar posterior
inferences. But in case of a Poisson model, the parameter estimate uncertainty grows quickly
with the text length, unlike the Heaps’ model (昀椀g.3).</p>
      <p>An important limitation of the vocabulary growth models based on the Heaps’ formula in
comparison to linear models is that various predictors cannot be so easily incorporated into
the model. While coe昀케cients  and  provide two points to stratify by predictor variables,
their e昀ects on the outcome are not symmetric. Moreover, adding more than two predictors
(for instance, as factors o f ) runs into an identi昀椀cation problem in the context of Bayesian
inference. Reparameterizing the model or switching to another basic vocabulary growth model
may be required to tackle this problem.</p>
      <p>
        The genre-topic model that captures preferences for speci昀椀c animal lemmas conceptually
describes the animal pro昀椀le of the genre as a deviation from the corpus-wide distribution of
animal frequencies. This is structurally analogous to a popular idea in stylometry that is
behind the Burrows’ Delta2[] and was shown to work for the genre classi昀椀cation as well1[
        <xref ref-type="bibr" rid="ref4">4, 15</xref>
        ].
The advantage of the Bayesian genre-topic model in comparison to simpler measures of lexical
distinctiveness is that it provides estimates for the uncertainty of the parameters. Highly
uncertain estimates for the proportion of the genre-related topic in a document indicates that with
this particular model and data it is not possible to tell to what extent the animal vocabulary of
a given author is de昀椀ned by the literary tradition or by the external factors. Nevertheless, the
model was able to detect relatively minor genre-related modulations in frequency of certain
animal species in the presence of a strong signal of a general linguistic or literary background.
ing the Genres of Literary Fiction”. Ina:rXiv preprint arXiv:2305.11251 (2023).
[16] F. J. Tweedie and R. H. Baayen. “How Variable May a Constant be? Measures of Lexical
      </p>
      <p>Richness in Perspective”. In:Computers and the Humanities 32 (1998), pp. 323–352.
[17] T. Underwood. “The Life Spans of Genres”. InD:istant horizons: digital evidence and
lit[19] B. Yarkho. “Metodologiya Tochnogo Literaturovedeniya. Izbrannye Trudy po Teorii
Literatury [A Methodology for a Precise Science of Literature. Selected Works on Literary
Theory]”. In: Moscow: Languages of Slavic Cultures, 2006, pp. 247–251.</p>
    </sec>
    <sec id="sec-6">
      <title>A. The definition of vocabulary growth models</title>
      <p>the help of the Heaps’ formul a =</p>
      <p>The central idea is to de昀椀ne the expected size of the animal vocabulary in a given text with
. To adapt to the fact that vocabulary size is a
natuvariation: either coe昀케cient  or could vary by genre or by author.
ral number, the expected value predicted by the Heaps’ model can be treated as a parameter
(expected value) for a Poisson distribution. To test the hypothesis that there is an e昀ect of
literary tradition on accessibility of the animal vocabulary associated with sub-genre of
children’s literature, one needs to stratify animal vocabulary growth rates by genre. Alternatively,
inter-genre variance in animal vocabularies may be explained away by external factors
(individual author characteristics). Heaps’ formula o昀ers two options to account for genre/author</p>
      <p>To select the 昀椀nal model I tested all logically possible combinations oafnd  coe昀케cients
associated with either genre or author. The best performing model was selected by evaluating
the model’s predictive ability for the animal vocabulary data in children’s literature with the
help of the WAIC criterion. The model comparison summary is presented in ta1b.leFor
animal vocabulary data, coe昀케cient  turns out to be more e昀ective in capturing data variance
in comparison to . It works this way both for author-based variance and genre-based variance.
Including author as a factor in a formula always results in a much better 昀椀t. Whenever author
improvement is more pronounced if genre is taken into account as a more e昀ecti vecoe昀케cient.
is present, adding genre results in a relatively small (but non-null) model improvement. This
The formal de昀椀nition of the selected model follows below 3in,the rest of the models had
similar structure and priors. All models employed partial pooling on author/genre coe昀케cients.


where  is animal vocabulary size ,  stands for the expected Poisson rate for a tex t, and  
is the length of a text in thousands of tokens. External factors that in昀氀uence accessibility of
animal category are captured by that re昀氀ects interest in animals for each individual author.
Internal (literary) factors are captured by the expon e ntthat varies by genre. The
distributions of both author and genre coe昀케cients are de昀椀ned by higher-order priors,   and ,   .
̄
̄</p>
      <p>The mixture model represents vocabulary size as a result of either a low-intensity
background process, or a high-intensity foreground process, mixed in a genre-speci昀椀c proportion
  . The expected vocabulary size value for both processes is de昀椀ned by the same Heaps’
formula. The formal de昀椀nition of the model is given in4.</p>
      <p>∼   Poisson( 1) + (1 −   )Poisson( 2)
 1 =  1 
 2 =  2 
 ∼ Log-Normal(1, 0.7)
 ∼ Beta(5, 5)
logit(  ) =  

where  1 and  2 stand for the expected value for animal vocabulary for the background and
(3)
(4)
the foreground processes, respectively. Simila rl1ya,nd  2 denote accessibility coe昀케cients for
both processes.</p>
    </sec>
    <sec id="sec-7">
      <title>B. Poisson prevalence model</title>
      <p>To provide a comparison with the suggested vocabulary growth model, a more traditional
Poisson generalized linear model for lexical prevalence was de昀椀ned and applied to the same data.
The model is designed to maximally 昀椀t the structure of the Heaps-based vocabulary growth
model. Partial pooling of the author and genre coe昀케cients is employed as well. To optimize
inference, the non-centered model parameterization was used. Logarithm of text length is taken
into account as an exposure parameter. The formal model de昀椀nition is as follows:
log(  ) =  +̄    +     +   log</p>
    </sec>
    <sec id="sec-8">
      <title>C. Corpus details</title>
      <p>Texts in the Detcorpus data come with a list of genre tags assigned based on the bibliographic
and contextual data. For the present analysis lists of genre tags were run through a
simpli昀椀cation procedure to arrive at a single level for each text. In case there are several genre tags for a
text, only one of them is retained. Some secondary sub-genre tags are omitted as a result, for
instance, “school novel”. If a list of genres contain a fairy tale tag, the text was always regarded
as a fairy tale. If there are several genre tags, the 昀椀rst one in a list is regarded as primary and
retained. Several genres with a very sparse representation in the corpus were omitted from the
analysis (adventure, biography).</p>
      <p>The data contains texts written by 917 authors, with the majority of them (89%) represented
in a single genre only. See tabl2e for details on author and genre distribution.</p>
      <p>As a result, one can see that the composition of the dataset in terms of genres is not in any
way a balanced sample. The genre preferences changed with time, and the corpus sample is
also somewhat imbalanced diachronically, with some decades represented better than others.
The distribution of genres by decade are shown on 昀椀g.4. Some genres are represented better
than others, with realism (a “default” genre assigned to texts without a speci昀椀c genre a昀케liation)
spanning 53% of works included in the corpus.
genre
realism
skazka
detective
scifi
animalistic
love
horror
fantasy
1588
450
349
205
164
53.0
15.0
11.7
6.8
5.5
3.8
2.8
1.4
ngenres
1
2
3
4
5
6
813
85
15
2
1
1
88.7
9.3
1.6
0.2
0.1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>M. I. Jordan.</given-names>
          </string-name>
          “Latent Dirichlet Allocation”.
          <source>JIonu:rnal of machine Learning research 3</source>
          .
          <string-name>
            <surname>Jan</surname>
          </string-name>
          (
          <year>2003</year>
          ), pp.
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Burrows</surname>
          </string-name>
          . “'Delta'
          <article-title>: a Measure of Stylistic Di昀erence and a Guide to Likely Authorship”</article-title>
          .
          <source>In: Literary and linguistic computing 17.3</source>
          (
          <issue>2002</issue>
          ), pp.
          <fpage>267</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Calvo</surname>
          </string-name>
          <article-title>TelloT.he Novel in the Spanish Silver Age</article-title>
          . Wetzlar: Bielefeld University Press,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cosslett</surname>
          </string-name>
          .
          <source>Talking animals in British children's 昀椀ction</source>
          ,
          <fpage>1786</fpage>
          -
          <lpage>1914</lpage>
          . New York: Routledge,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Heuser</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Le-Khac</surname>
          </string-name>
          .
          <article-title>A Quantitative Literary History of 2,958 Nineteenth-Century British Novels: The Semantic Cohort Method</article-title>
          .
          <source>Pamphlets of the Stanford Literary Lab 4</source>
          .
          <year>2012</year>
          . url: http://litlab.stanford.edu/LiteraryLabPamphlet4. p.df
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Langer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Burghardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Borgards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Böhning-Gaese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Seppelt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Wirth</surname>
          </string-name>
          . “
          <article-title>The Rise and Fall of Biodiversity in Literature: A Comprehensive Quanti昀椀cation of Historical Changes in the Use of Vernacular Labels for Biological Taxa in Western Creative Literature”</article-title>
          .
          <source>In:People and Nature</source>
          <volume>3</volume>
          .5 (
          <issue>2021</issue>
          ), pp.
          <fpage>1093</fpage>
          -
          <lpage>1109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>D. C. van Leijenhorst</surname>
            and
            <given-names>T. P. Van der Weide.</given-names>
          </string-name>
          “
          <article-title>A Formal Derivation of Heaps' Law”</article-title>
          .
          <source>In: Information Sciences 170</source>
          .
          <fpage>2</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2005</year>
          ), pp.
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.ins.
          <year>2004</year>
          .
          <volume>03</volume>
          .006.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maslinsky</surname>
          </string-name>
          .
          <article-title>Replication Data for: How Exactly does Literary Content Depend on Genre? A Case Study of Animals in Children's Literature. Repository of Open Data on Russian Literature and Folklore</article-title>
          .
          <source>Version V1</source>
          .
          <year>2023</year>
          . do1i0:.31860/openlit-2023.
          <fpage>10</fpage>
          -
          <lpage>R005</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maslinsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lekarevich</surname>
          </string-name>
          , and
          <string-name>
            <surname>L. AleinikC.</surname>
          </string-name>
          <article-title>orpus of Russian Prose for Children and Young Adults. Repository of Open Data on Russian Literature and Folklore</article-title>
          .
          <source>Version V2</source>
          .
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .31860/openlit-2021.
          <fpage>4</fpage>
          -
          <lpage>C001</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piper</surname>
          </string-name>
          . “
          <article-title>Biodiversity is not Declining in Fiction”</article-title>
          .
          <source>IJno:urnal of Cultural Analytics 7.3</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .22148/001c.
          <fpage>38739</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piper</surname>
          </string-name>
          . “Fictionality”.
          <source>InJ:ournal of Cultural Analytics 2.2</source>
          (
          <year>2016</year>
          ). doi:
          <volume>10</volume>
          .22148/16.011.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piper</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bagga</surname>
          </string-name>
          .
          <article-title>“A Quantitative Study of Fictional Things”</article-title>
          .
          <source>InP:roceedings of the Computational Humanities Research Conference. Antwerp, Belgium</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>268</fpage>
          -
          <lpage>279</lpage>
          . url: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3290</volume>
          /long%5C%
          <fpage>5Fpaper1576</fpage>
          .p.
          <fpage>df</fpage>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ritvo</surname>
          </string-name>
          .
          <article-title>“Learning from Animals: Natural History for Children in the Eighteenth and Nineteenth Centuries”</article-title>
          .
          <source>In:Children's literature 13.1</source>
          (
          <issue>1985</issue>
          ), pp.
          <fpage>72</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharmaa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singhal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Underwood</surname>
          </string-name>
          . “
          <article-title>The rise and fall of genre di昀erentiation in English-language 昀椀ction”</article-title>
          .
          <source>In: DH2020 (ADHO) Proceedings. Amsterdam</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>erary change</article-title>
          . Chicago: University of Chicago Press,
          <year>2019</year>
          . Chap.
          <volume>2</volume>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Underwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bamman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          . “
          <source>The Transformation of Gender in English0.1</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>