=Paper= {{Paper |id=Vol-3290/short_paper5533 |storemode=property |title=What Do We Talk About When We Talk About Topic? |pdfUrl=https://ceur-ws.org/Vol-3290/short_paper5533.pdf |volume=Vol-3290 |authors=Joris J. van Zundert,Marijn Koolen,Julia Neugarten,Peter Boot,Willem van Hage,Ole Mussmann |dblpUrl=https://dblp.org/rec/conf/chr/ZundertKNBHM22 }} ==What Do We Talk About When We Talk About Topic?== https://ceur-ws.org/Vol-3290/short_paper5533.pdf
What Do We Talk About When We Talk About Topic?
Joris J. van Zundert1,2 , Marijn Koolen1,2 , Julia Neugarten3 , Peter Boot1 ,
Willem van Hage4 and Ole Mussmann4
1
  KNAW Huygens Institute, Amsterdam, the Netherlands
2
  DHLab, KNAW Humanities Cluster, Amsterdam, the Netherlands
3
  Radboud University Nijmegen, Nijmegen, the Netherlands
4
  eScience Center, Amsterdam, the Netherlands


                                         Abstract
                                         We apply Top2Vec to a corpus of 10,921 novels in the Dutch language. For the purposes of our research
                                         we want to understand if our topic model may serve as a proxy for genre. We 昀椀nd that topics are
                                         extremely narrowly related to an existing genre classi昀椀cation historically created by publishers. Inter-
                                         estingly we also 昀椀nd that, notwithstanding careful vocabulary 昀椀ltering as suggested by prior research,
                                         various other signals, such as author signal, stubbornly remain.

                                         Keywords
                                         literary 昀椀ction, novels, computational literary studies, topic models, top2vec,




1. Introduction
This short paper presents the preliminary results of topic-modelling 10,000+ contemporary
novels in the Dutch language, published between 2009 and 2019. The purpose of this paper
is to understand how topics yielded by topic modelling relate to genre in this corpus. This
is an important step for the ultimate aim of the project that this paper relates to, which is to
understand how reader impact is distributed across genre and topic.
   While the results of topic models are o昀琀en taken at face value as semantically meaningful,
such assumptions risk mistaking artefacts of a speci昀椀c corpus and its structure for content-
level literary topics. Only a small part of the topics resulting from our analysis relate to con-
tent words that share clearly recognizable subject matter (including words relating to football,
medicine or music). Many other topics group together language use, historical period, or geo-
graphical location. In literary studies, such groupings do not constitute the topic or theme of a
novel. We conclude that contextual information from novels in a corpus, such as the location,
historical period, or language community in which they are situated, and even author signal
co-shape topics. Additionally, we 昀椀nd that the topics we 昀椀nd are very strongly correlated to

CHR 2022: Computational Humanities Research Conference, December 12–14, 2022, Antwerp, Belgium
£ joris.van.zundert@huygens.knaw.nl (J. J. v. Zundert); marijn.koolen@gmail.com (M. Koolen);
julia.neugarten@ru.nl (J. Neugarten); peter.boot@huygens.knaw.nl (P. Boot); w.vanhage@esciencecenter.nl
(W. v. Hage); o.mussmann@esciencecenter.nl (O. Mussmann)
ç https://jorisvanzundert.net/ (J. J. v. Zundert); https://marijnkoolen.com/ (M. Koolen)
ȉ 0000-0003-3862-7602 (J. J. v. Zundert); 0000-0002-0301-2029 (M. Koolen); 0000-0003-3314-9445 (J. Neugarten);
0000-0002-7399-3539 (P. Boot); 0000-0002-6478-3003 (W. v. Hage); 0000-0002-3803-4287 (O. Mussmann)
                                       © 2022 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                       398
genres, like horror, romance, crime, etc. Our analysis invites re昀氀ection on the usefulness of
topic modelling as a tool for computational literary studies (CLS).
  Because of copyright, we cannot share the texts of the novels on which the topic models
are based. However, we can and will eventually share the intermediate results and our code
through the GitHub repository of the project.1


2. Topic modelling and literature
Various technologies have been used in CLS and other 昀椀elds to establish the semantic value
of topics. In CLS, for example, researchers have used frequent words [20], (some variety of)
keywords [3, 19], top-down methods such as the UCREL Semantic Analysis System [18], and
systematic manual procedures [1].
   A commonly applied technique is topic modelling, which algorithmically identi昀椀es groups
of words that tend to co-occur in a large collection of documents [10]. Although introduced in
a non-昀椀ction context, this has been applied to 昀椀ction (e.g. [6, 14, 22, 15, 12, 7, 11, 25]). Topic
modelling has also been applied to the discourse of literary studies [9] and to online literature
reviews [28].
   Some previous research in CLS suggests that topic modelling literary works leads to seman-
tically cohesive topics and an overarching understanding of the subject matter represented in
those works [2]. In Macroanalysis [13], Jockers used “Themes” as the title for his chapter about
topic modelling. This title suggests that topic modelling is a technique for unearthing “the”
theme of a novel, viz. “a salient abstract idea that emerges from a literary work’s treatment of
its subject-matter” (the 昀椀rst meaning of “theme” de昀椀ned in [5]). Schröter and Du [23] suggest
that “sujet” is similar to literary topic. Lundy [15] uses topic modelling on a corpus of ca. 1,000
recent popular U.S. novels. He creates one set of general topics, analyzing entire novels at a
time. Additionally, he creates a set of more speci昀椀c topics using individual sentences from
these books. The focus of the study is on how these topics are distributed over genres. Jautze
et al. [12] topic model 400 recent Dutch works of 昀椀ction, and use the resulting topics to predict
a reader response variable (perceived literariness). Others (e.g. [11]) argue that topic models
do not generate semantically coherent descriptions of topics meaningful to CLS.
   Of particular interest here are [22], where it is argued that “strong genre signals exist […] on
the levels of function words, content words and syntactic structure” but that they “also exist
on the level of theme or topic”; [25] arguing that topics relate to structural rather than content
elements of text; and [24] where it is shown that topics strongly correlate to meta-textual
features such as author and genre.


3. Problem
There is no hard scienti昀椀c consensus on what textual features constitute genre, and as, for
instance, [29] argues there may be good reason to question canonical genre classi昀椀cations.
Topics from topic models on the other hand are notoriously hard to clearly relate to literary

1
    https://github.com/impact-and-fiction




                                               399
constructs or categories [22, 25, 24]. If we are interested in the relation between quanti昀椀able
textual features and readers’ preferences, our question becomes how bottom up topics from a
topic model relate to given genre metadata.
   We operationalise genre categories using Dutch NUR-codes (Nederlandse Uniforme Rubrieksin-
deling or Dutch Uniform Categories classi昀椀cation). These were introduced in 2002 as a market
monitoring instrument and succeed the comparable NUGI codes that have been in use for the
same reason since 1987. NUR is a practical marketing instrument devised and applied by pub-
lishers. It is largely ignored in Dutch literary culture, and it goes largely unnoticed by readers
who mostly encounter it because bookshops tend to sort and arrange their product range ac-
cording to the system [26, 27]. NUR can be regarded as a rough approximation of the concept
of genre as it is understood by booksellers and readers.
   The examination of the correlation between genre (NUR) and topic can be broken down
into several sub-questions. How is a topic related to speci昀椀c NUR codes? What is the topic
distribution of a NUR code? What is the NUR distribution of the books most associated with a
topic? As a 昀椀rst step, we determine how strongly topics are associated with NUR codes. Ideally,
such associations are unrelated to topics that emerge from our model as the artefacts of corpus
features like author signal, translation, and corpus composition as these are irrelevant with
regard to canonical genre categories.
   We make the results of our topic modelling concrete and useful to CLS in two ways: by
examining the correlations between the topics detected in our corpus and one possible op-
erationalisation of genre, and by re昀氀ecting on the usefulness of topic modelling for literary
analysis in light of our results.


4. Method
4.1. Data
Courtesy of an agreement with the Dutch national library and seven Dutch publishing houses
(representing multiple publishers) we have access to the full text of 10,921 Dutch-language
novels, published between 2009 and 2019 in the Netherlands. Table 1 lists a number of general
statistics for the corpus used in this research.

Table 1
Corpus statistics
                                                                 Per book
             Element                 Number      Min     Max    Median    Mean      Std.dev
             Novels                     10,921     1        1         1         1         1
             �㕊 �㕖�㕛�㕑�㕜�㕤�㕠5000       153,553     1      104      14.1      11.0      11.3
             Paragraphs             24,356,023     1    21355    2230.2    1668.0    2000.9
             Sentences             104,511,706     1    80140    9569.8    7213.0    8322.8
             Words                 931,220,543     1   655744   85268.8   66577.0   71191.9

   The collection is based on the EPUBs deposited by publishers at the national library. There-
fore, it is a subset of all books published in the Netherlands in this period. The collection is




                                                       400
skewed towards more recent books (see Figure 2), primarily because more books are now being
made available as ebooks, not because of an increase in publications. The corpus consists of
both Dutch novels and at least 2,199 translated novels, mainly from English, German, French
and various Scandinavian languages, as well as smaller representations of languages such as
Spanish, Italian, and Japanese.




Figure 1: Distribution of book length in number of words for 10,921 Dutch novels




Figure 2: Number of books in the corpus by publication year.


   Figure 1 shows the distribution of book lengths in number of words per book. There is
one sharp peak around 50,000 words and a lower, less pronounced peak around 90,000 words,
roughly coinciding with the conventional lengths of novellas (~80-120 pages) and “full” novels
(~300-500 pages). A small number of books are very short. Of the 10,921 novels, 216 novels (2%)
are shorter than 1,000 words, and 463 novels (4%) are between 1,000 and 10,000 words. Some of
these may be picture books, children’s books, or collections of poetry. Some unusually short
books are regular-sized novels for which the text extraction step did not work properly. These
are mostly EPUB “incunables”. Apparently publishers needed to get used to the EPUB format
and so early EPUBs o昀琀en have poor 昀椀le and content structure.




                                               401
Figure 3: Distribution of books over NUR genre classes


4.2. Preprocessing
Based on existing research [12, 25], we pre-process the novel texts to remove person names and
use lemmas rather than full word forms. We tokenise and parse all novels using SpaCy 3.32
and remove all word tokens that are part of person entities. SpaCy inserts underscores in lem-
mas for certain compound words, but sometimes fails to split the words correctly. E.g. for the
Dutch word “boekhandel” (en: book shop), the SpaCy lemma is almost always “boek_handel”,
but sometimes SpaCy assigns the lemma “boekh_andel”. Therefore, we post-process the lem-
mas by removing underscores. Based on a sample of the 1,000 most common lemmas contain-
ing underscores and the variants containing underscore in a di昀昀erent position, we found that
removing the underscores rarely con昀氀ates words with di昀昀erent meanings.
   We remove common words based on their document frequency, as such words tend to ap-
pear in the majority of topics and thus have no discriminating e昀昀ect between topics. We also
remove lemmas that occur in very few books. These tend to include many speci昀椀c names
(many character names are not recognised by SpaCy as person named entities), and very rare
vocabulary used by only one or a few authors. This means that we remove lemmas that occur
in fewer than 1% or more then 10% of books (fewer than 103 books or more than 1,030 books).
This leaves a vocabulary of 36,927 lemmas.
   This setup represents a trade-o昀昀 between corpus coverage and the ability to generate di昀昀er-
entiating topics. For future evaluation we aim to vary this bandwidth to gauge the stability of

2
    See https://spacy.io




                                               402
the topic models inferred (see also 6).

4.3. Segmentation
Next, we need to choose a unit of measure. Based on prior research We test two di昀昀erent unit
sizes: the whole novel as a document, and documents constructed from joining a sequence of
paragraphs in a novel into segments containing at least 5,000 words. Using whole documents
yields rather few topics (95) to relate to 731 existing NUR codes, while the latter choice results
in more and smaller documents, and 1,182 topics. Further details and analysis of these di昀昀erent
unit are discussed in Appendix A.

4.4. Topic modelling
Given the number and size of documents (or segments) in a corpus, it is di昀케cult to decide the
minimum number of relevant topics. [15] used LDA3 on a set of 1,136 novels and chose 60 as
the optimal number of topics based on the fraction of topics they could meaningfully interpret.
[12] used LDA on lemmatised 1,000-word segments of 401 Dutch novels and set the number
of topics to 50. [25] segmented novels into 300 to 500 word segments, and used a pointwise
mutual information (PMI) and a cosine based coherence measure to observe that “as a rule of
thumb [...] the number of topics should lie between 100 and 150” [25, p.66].
   In this research we applied Top2Vec [4], which in recent studies has compared favorably
to LDA and other techniques [21], [8]. Models generated using LDA or PLSA4 generate a
pre-determined number topics as distributions of words. This o昀琀en means that topically unin-
formative words have high probabilities in the topics, since they make up a large proportion
of all text. In Top2Vec, joint document and word embeddings are learned in a 300-dimensional
space, which is projected onto a low-dimensional space using UMAP [17], a昀琀er which HDB-
SCAN [16] is used to detect dense document clusters, which determine the number of topics.
This ensures that the words nearest a topic vector best describe the topic and its surrounding
documents [4, p.3].


5. Results
In this 昀椀rst analysis we use the full corpus, lemmatise tokens, drop persons names (PER accord-
ing to SpaCy), and cull lemmas appearing in less than 1% or over 10% of all novels. Top2Vec
was used in its ’fastlearn’ mode with 8 workers and a standard multilingual universal sentence
encoder.5 When using full novels as unit of measure, this resulted in 95 topics. For 5,000 token
windows, Top2Vec generated 1,182 topics.
   Labeling the topics that result from topic modelling is a complicated and subjective process
which we believe requires annotation and an assessment of inter-annotator agreement. Such


3
  Latent Dirichlet allocation, a common approach to topic modelling, see https://en.wikipedia.org/wiki/Latent_Dir
  ichlet_allocation
4
  See https://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis
5
  https://tfhub.dev/google/universal-sentence-encoder-multilingual/3




                                                      403
assessment falls outside the scope of the current paper. For this reason, we have chosen to
number topics here, rather than give them a semantically meaningful label.




Figure 4: Heatmap of books most associated with a topic (x-axis) and their distribution across NUR
codes (y-axis), given entire novels as document unit for topic modelling




Figure 5: Heatmap of books most associated with a topic (x-axis) and their distribution across NUR
codes (y-axis), given text segments of 5,000 tokens for topic modelling


  Each book has a nearest topic neighbour, which Top2Vec uses to determine the size of a
topic. This way, each topic has a size, expressed as the number of books for which that topic




                                               404
is the nearest neighbour. This allows us to investigate how strongly topics are associated with
the NUR codes: we look at the distribution of NUR codes as the percentage of a topic’s most
associated books that are assigned that NUR code. For instance, the 昀椀rst topic is the nearest
topic for 1,655 of the novels, and of these 1,643 (99%) are labelled with NUR code 343 Romance.
This topic is therefore strongly associated with a single NUR code. We investigate the associ-
ation of topics and NUR code using a heatmap (see 昀椀gure 4). The darker a cell, the stronger
the topic (x-axis) is associated with a genre (y-axis). Many topics strongly associate with one
or two NUR codes. One observation we derive from this is that topic as inferred by Top2Vec
is strongly associated with publishers’ choices for NUR genre code. This observation can be
further corroborated if we use UMAP to create a visualisation of topic clusters where we color
each vector according to the NUR it most closely associates with (see 昀椀gure 6).




Figure 6: UMAP reduction of topic clusters, topics colored by NUR.


   As we have seen in the previous section, the number of topics increases if the unit of measure
decreases. If we topic model at the level of whole novels we 昀椀nd only 95 topics. With 5,000
word segments, the resulting model has 1,182 topics, and the association between topic and
NUR code is even stronger (see Figure 5). Both the size of the segments (roughly comparable
to the size of chapters) and the number of topics found, coincide more closely with intuitions
about how literary topics may function. That is, from a literary criticism point of view we
would expect topics to be bound more narrowly to chapters or paragraphs rather than to a
book as a whole, while 95 topics would seem a tiny set of topics to cover a corpus of over
10,000 novels.


6. Discussion
What becomes clear from the results as depicted in Figures 4 and 5, is that topics generated by
Top2Vec are extremely narrowly related to genre, with many topics almost exclusively related
to one genre. Furthermore, topics that are strongly associated with the genres “Dutch liter-
ary novel” and “Translated literary novel” turn out to contain a large number of geographical




                                               405
Figure 7: UMAP reduction of topic clusters, topics colored by 100 most prolific authors, others in grey.


indicators (cf. appendix B).
   Our results con昀椀rm 昀椀ndings from [22, 25, 24]. In all this means that topics as generated
by Top2Vec across our corpus will be an adequate proxy for genre in the course of our project.
Thus our current result can be summarised as “when we talk about topic modelling we actually
talk about genre”.
   Like the results of [22, 25, 24], our results give pause to consider that topics generated
through topic modelling techniques are much more related to signals of genre than to semantic
昀椀elds that literary researchers would consider topical and relevant. Similarly we consider that
although geography can be topical for a novel, geography related signals seem much stronger
than their relevance for literary analysis would warrant. Most salient is the observation that,
even though we followed [25] in carefully removing function words and author speci昀椀c vocab-
ulary, we still 昀椀nd that topics strongly coincide with author if we recolor 昀椀gure 6 according to
author (see 昀椀gure 7). On the one hand this may mean that authors keep within genre, on the
other it means it is still hard to decide what topic model topics relay to us.
   So far, we have only looked at one topic modelling technique (Top2Vec) and two segment/document
sizes. Additionally, our 昀椀ndings are limited because we used only Dutch NUR coding as a genre
target. For now, we have also disregarded the skewed makeup of our corpus, in which the ro-
mance genre is severely over-represented (cf. 昀椀gure 3). We still need to evaluate the e昀昀ects
of using a di昀昀erent corpus balancing, di昀昀erent document sizes, isolating subgenres, di昀昀erent
topic modelling techniques such as classic LDA, and di昀昀erent genre labels.
   In our current corpus, topic turns out to be strongly associated with genre as labeled by




                                                 406
Dutch publishers. Our next step will be to determine the distribution of topics across di昀昀erent
NUR genres. A昀琀er that we aim to gauge how features of reader reviews relate to the topics we
found.


References
 [1] M. J. Adler. The Great Ideas: A Syntopicon of Great Books of the Western World. Vol. 2.
     Encyclopaedia Britannica, 1952.
 [2] M. Algee-Hewitt, R. Heuser, and F. Moretti. Stanford Literary Lab Pamphlet 10: On Para-
     graphs. Scale, Themes, and Narrative Form. 2015. url: https://litlab.stanford.edu/Literar
     yLabPamphlet10.pdf.
 [3] D. Allington. “Customer Reviews of ‘Highbrow’ Literature: A Comparative Reception
     Study of The Inheritance of Loss and The White Tiger”. In: American Journal of Cultural
     Sociology 9.2 (2021), pp. 242–268.
 [4] D. Angelov. “Top2Vec: Distributed Representations of Topics”. In: arXiv preprint arXiv:2008.09470
     (2020).
 [5] C. Baldrick. The Oxford Dictionary of Literary Terms [online]. Oxford University Press,
     2015.
 [6] K. Bode. “”Man people woman life” - “Creek sheep cattle horses”: In昀氀uence, Distinction,
     and Literary Traditions”. In: A World of Fiction: Digital Collections and the Future of Lit-
     erary History. University of Michigan Press, 2019, pp. 157–197.
 [7] R. S. Buurma. “The Fictionality of Topic Modeling: Machine Reading Anthony Trollope’s
     Barsetshire Series”. In: Big Data & Society 2.2 (2015), p. 2053951715610591.
 [8] R. Egger and J. Yu. “A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and
     BERTopic to Demystify Twitter Posts.” In: Frontiers in sociology 7 (2022), p. 886498. doi:
     10.3389/fsoc.2022.886498.
 [9] A. Goldstone and T. Underwood. “The Quiet Transformations of Literary Studies: What
     Thirteen Thousand Scholars Could Tell Us”. In: New Literary History 45.3 (2014), pp. 359–
     384.
[10]   A. Goldstone and T. Underwood. What Can Topic Models of PMLA Teach Us About the
       History of Literary Scholarship? 2012. url: https://tedunderwood.com/2012/12/14/what-
       can-topic-models-of-pmla-teach-us-about-the-history-of-literary-scholarship/.
[11]   R. Heuser and L. Le-Khac. A Quantitative Literary History of 2,958 Nineteenth-Century
       British Novels: The Semantic Cohort Method, Literary Lab Pamphlet 4. 2018.
[12]   K. Jautze, A. van Cranenburgh, C. Koolen, et al. “Topic Modeling Literary Quality”. In:
       Dh. 2016, pp. 233–237.
[13]   M. L. Jockers. Macroanalysis: Digital Methods and Literary History. University of Illinois
       Press, 2013.
[14]   M. L. Jockers and D. Mimno. “Signi昀椀cant Themes in 19th-Century Literature”. In: Poetics
       41.6 (2013), pp. 750–769.




                                              407
[15]   M. Lundy. “Text Mining Contemporary Popular Fiction: Natural Language Processing-
       Derived Themes Across Over 1,000 New York Times Bestsellers and Genre Fiction Nov-
       els”. PhD thesis. University of South Carolina, 2020.
[16]   L. McInnes, J. Healy, and S. Astels. “HDBscan: Hierarchical Density Based Clustering”.
       In: Journal of Open Source So昀琀ware 2.11 (2017), p. 205.
[17]   L. McInnes, J. Healy, N. Saul, and L. Großberger. “UMAP: Uniform Manifold Approxima-
       tion and Projection”. In: Journal of Open Source So昀琀ware 3.29 (2018), p. 861.
[18]   D. McIntyre and D. Archer. “A Corpus-based Approach to Mind Style”. In: (2010).
[19]   J. Misset. “Replete with instruction and rational amusement”?: Unexpected Features in the
       Register of British Didactic Novels, 1778–1814. 2022.
[20]   F. Pianzola, S. Rebora, and G. Lauer. “Wattpad as a Resource for Literary Studies. Quanti-
       tative and Qualitative Examples of the Importance of Digital Social Reading and Readers’
       Comments in the Margins”. In: PloS one 15.1 (2020), e0226708.
[21]   E. Saral and R. G. Alhama. “A Topic Modeling Study of the COVID-19 Impact in an Online
       Eating Disorder Community in Reddit”. In: Tilburg: Tilburg University, 2022. url: https:
       //clin2022.uvt.nl/a-topic%20-modeling-study-of-the-covid-19-impact-in-an-online-eati
       ng-disorder-community-in-reddit/.
[22]   C. Schöch. “Topic Modeling Genre: An Exploration of French Classical and Enlighten-
       ment Drama.” In: DHQ: Digital Humanities Quarterly 11.2 (2017).
[23]   J. Schröter and K. Du. “Validating Topic Modeling as a Method of Analyzing Sujet and
       Theme.” In: Journal of Computational Literary Studies 1 (2022).
[24]   L. Thompson and D. Mimno. “Authorless Topic Models: Biasing Models Away from
       Known Structure”. In: Proceedings of the 27th International Conference on Computational
       Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018,
       pp. 3903–3914. url: https://aclanthology.org/C18-1329.
[25]   I. Uglanova and E. Gius. “The Order of Things. A Study on Topic Modelling of Literary
       Texts.” In: Chr 18-20 (2020), p. 2020.
[26]   K. Van Rees, S. Janssen, and M. Verboord. “Classi昀椀catie in het culturele en literaire veld
       1975-2000: Diversi昀椀catie en nivellering van grenzen tussen culturele genres”. In: Produc-
       tie van literatuur. Het literaire veld in Nederland 1800-2000. Ed. by G. Dorleijn and K. Van
       Rees. Nijmegen: Vantilt, 2006, pp. 239–283.
[27]   “Nur”. In: Algemeen letterkundig lexicon. Ed. by G. Vis, P. Verkruijss, H. Van Gorp, D. De-
       labastita, G. Van Bork, L. Bernaerts, F. Willaert, E. Op de Beek, and N. Geerdink. Digitale
       Bibliotheek voor de Nederlandse Letteren, 2012. url: https://www.dbnl.org/tekst/dela0
       12alge01%5C%5F01/dela012alge01%5C%5F01%5C%5F01441.php.
[28]   M. Walsh and M. Antoniak. “The Goodreads “Classics”: A Computational Study of Read-
       ers, Amazon, and Crowdsourced Amateur Criticism”. In: Journal of Cultural Analytics 4
       (2021), pp. 243–287.




                                               408
[29]   M. Wilkens. “Genre, Computation, and the Varieties of Twentieth-Century U.S. Fiction”.
       In: CA Journal of Cultural Analytics 2.2 (2016). doi: 10.22148/16.009. url: https://cultura
       lanalytics.org/article/11065.


A. Unit of Measure
A.1. Segmentation
From the perspective of literary studies, it is illogical to bind topic to the full text of a novel. A
novel likely touches on a multitude of topics, so a division into chapters, sections, paragraphs
or even sentences might yield more useful topics. However, segmenting whole novels into
minimum-sized windows has consequences for the co-occurrence of words and the number of
topics that will be detected. By segmenting a whole novel, the words within one segment no
longer co-occur with the words in another segment from the same novel. Therefore, the word
co-occurrence matrix becomes more sparse, so topic modelling algorithms identify more and
smaller dense clusters, resulting in more topics.
   We investigate the impact of segmenting on the co-occurrence of words by analysing how
the number of co-occurring pairs of lemmas increases as we iterate over novels, using either
the whole novels as document boundaries, or segments constructed from joining a sequence of
paragraphs in a novel into segments containing at least 5,000 words. Figure 8 shows the result.
The X-axis shows the number of lemmas (a昀琀er 昀椀ltering out the most and least frequent lemmas
as described above) and the Y-axis shows how many distinct co-occurring pairs of lemmas are
found.
   The di昀昀erence between segmenting and not segmenting results generates a clear picture. At
the level of whole novels, well over 196 million co-occurrence pairs are indexed a昀琀er having
seen 300,000 lemmas (corresponding to about 250 novels). For the segmented novels, at the




Figure 8: The di昀昀erence between full novel and a window size of 5,000 tokens on the number of co-
occurring lemmas




                                                409
same 300,000 lemma point, there are only 18.3 million pairs. That is a full order or magnitude
less. Regarding a choice for unit of measure, this means that we have to carefully investigate
the trade-o昀昀 that exists between the size of documents fed to a topic modelling algorithm (i.e.
whole novels or smaller segments) and the number of topics returned.


B. Topic examples
The following are examples from topics, generated by Top2Vec at the level of whole novels,
that show a high concentration of coherent geographical lemmas. Lemmas strongly related to
a coherent geographical location are in bold.

Topic number: 4
nochtans, komaan, stilaan, gsm, parking, job, vooraleer, miserie, kot, brussels, antwerps, euh,
proper, gsmnummer, goesting, bijgevolg, verdict, plezant, antwerpen, vlaams, voormiddag,
speurder, voordien, zaventem, oostende, 昀氀ik, ontgoocheling, zogezegde, deurgat, gent, au-
tosnelweg, voorhand, pv, gents, crapuul, evolueren, zijt, knokke, vlaming, sukkelaar, meche-
len, recupereren, meermaals, contacteren, evident, nonkel, allez, klasseren, rijkswacht, schelde

Topic number: 5
vondelpark, schiphol, leidseplein, bitterbal, shag, know, like, lullen, please, gadverdamme,
it, amsterdams, snot, never, grachtenpand, can, kroket, amsterdam, goor, spuug, there,
veegt, kut, ie, see, bh, amstelveen, 昀椀etspad, plee, geilheid, wc, gezeik, rouwkaart, quote,
almere, sure, tje, fucking, is, ehm, hilversum, only, randstad, drop, lacherig, koninginnedag,
too, zeiken, opschuden, poep

Topic number: 11
zo, polder, hbs, jenever, haarlem, zoldering, stationsplein, ballpoint, tramhalte, schevenings,
verveloos, scheveningen, vondelpark, shag, rotterdam, waartussen, grammofoon, zand-
voort, rijksdaalder, arnhem, hongerwinter, jongensboek, ijsselmeer, gymnasium, schemer,
schoolschri昀琀, ijl, schemren, brokkelig, sto昀樀as, amstel, klomup, vitrage, bakeliet, sigarenwinkel,
schrijfmachine, vergelen, bovenhuis, rui, plantsoen, brilleglas, groningen, celluloid, windstil,
trapper, leidseplein, vooroorlogs, veraf, allengs, wassenaar

Topic number: 15
stockholm, zweden, kronen, kopenhagen, oslo, noorwegen, denemarken, deens, noors,
midzomer, 昀椀ns, zweed, vooronderzoek, line, verhoren, ordner, politieacademie, lichtkegel, le-
gitimatie, volvo, 昀椀nland, strafregister, rechercheteam, 昀樀ord, kelderruimte, nor, wide, poli-
tiemen, hoofdbureau, moordzaken, geweldsdelict, villawijk, scandinavisch, messing, goddomme,
avondkrant, rijkweg, smeriss, opsporing, bewakingscamera, scandinavie, freule, pagekapsel,
nova, zomerhuis, oostzee, sankt, thomas, moordonderzoek, sterfgeval




                                               410