<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Literary text mining using verb feature clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marianna Dilai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Dilai</string-name>
          <email>iryna.dilay@lnu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ivan Franko National University of Lviv</institution>
          ,
          <addr-line>Universytetska Street, 1, Lviv, 79000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Stepan Bandera Street, 12, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>MoDaST-2024: 6th International Workshop on Modern Data Science Technologies</institution>
          ,
          <addr-line>May, 31 - June, 1, 2024, Lviv- Shatsk</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this study, we explore text clustering techniques applied to a corpus of the works by the renowned Canadian postmodernist writer Margaret Atwood. Leveraging unsupervised machine learning methods, we investigate the thematic affinities within her literary legacy. Our approach involves employing document n-gram embeddings and bag-of-words clustering algorithms to analyze the structural similarities among the works. Additionally, we introduce a novel featurebased clustering model focusing on verbs, essential elements in English sentence structure and meaning formation. We assess the performance of verb-centered clustering through experimentation and evaluation, including the use of logistic regression classifiers and Rand index calculation. Ultimately, our findings shed light on the prevailing topics and thematic patterns permeating the author's diverse literary oeuvre, offering insights for computational text mining methodologies and literary analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>text mining</kwd>
        <kwd>machine learning</kwd>
        <kwd>clustering</kwd>
        <kwd>topic modeling</kwd>
        <kwd>logistic regression</kwd>
        <kwd>prediction</kwd>
        <kwd>verb</kwd>
        <kwd>distant reading</kwd>
        <kwd>literary text1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Text is one of the most common and sophisticated types of data. Text mining as a type of
large-scale data mining has become a promising direction in data science, combining NLP,
machine learning, and information retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Transforming the unstructured text into a
structured one and revealing hidden meaningful structural patterns enables the discovery
of high-quality insights. Text mining is aimed at capturing key concepts, topics, trends, and
latent relationships in vast collections of text material.
      </p>
      <p>
        Recently, more and more attention has been paid to the computer-aided study of literary
text. Computational methods are being applied to analyze large libraries of literary data.
The most prominent of them is the method of ‘distant reading’ proposed by Franko
Moretti [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as a computer-assisted mode of reading that provides abstraction and the
reader’s/researcher’s detachment to viewing, processing, and interpreting the text.
Normally, distant reading encompasses topic modeling, stylometry, and network analysis.
Modern distant reading resorts to machine learning methods and techniques to produce
robust computational findings and predictions.
      </p>
      <p>This study aims to harness literary text mining techniques by testing a feature clustering
model capable of eliciting the prevalent thematic patterns and measuring the affinity among
the literary works exemplified by a postmodern Canadian writer Margaret Atwood.</p>
      <p>The tasks set here are as follows:
•
•
•
•</p>
      <p>To create and process a corpus of works by M. Atwood.</p>
      <p>To elaborate an efficient text analysis model that can be used for text classification
predictions.</p>
      <p>To test the performance of verb-based text clustering, assuming that verbs
constitute meaningful and informative text structure features.</p>
      <p>To account for the thematic patterns and topics generated by the model.</p>
      <p>Viewing fiction as information and applying text mining can shed light on sociocultural
topics not immediately clear to readers/researchers. The new mode of reading, where the
AI performs the function of an active participant, rather than a tool, opens new prospects
for the literary revolution and requires boosting machine learning capabilities applied to
literary text data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>By synthesizing existing research and scholarship on text mining with a focus on verb
feature clustering, we seek to show the current state of the field and justify our methodology
choice.</p>
      <p>
        Literary text mining, a burgeoning field at the intersection of computational linguistics
and literary studies, offers possibilities for uncovering hidden linguistic patterns within
literary texts. As a result, it has been addressed by several modern researchers [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3-5</xref>
        ].
Authorship attribution methods were especially popular at the beginning of the 21st
century [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Now, automatic text mining reveals genre-depending features, and is used for
topic modeling [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5-7</xref>
        ] and cataloging documents [
        <xref ref-type="bibr" rid="ref8 ref9">8-9</xref>
        ], author’s stylometry, and, most
recently, distant reading [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. F. Moretti initially applied the distant reading approach relying
on computational techniques to represent the perspectives of the "great unread", which
encompassed a massive collection of literary texts and strived to quantitively analyze the
titles of the novels and classify them [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Clustering algorithms take center stage as they play a crucial role in organizing text
documents into meaningful clusters. From the classic K-means to the more nuanced
hierarchical clustering, each algorithm has shown its efficacy in revealing thematic
structures. Agglomerative clustering is a type of hierarchical clustering that follows a
bottom-up approach. Normally Euclidean distance and the Ward linkage method are used
to measure the distance between the clusters.</p>
      <p>
        In order to reduce dimensionality and visualize the data in the low dimensional space,
the t-distributed stochastic neighbor embedding (t-SNE) method as a specific unsupervised
machine learning technique is used [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        If preprocessing is aimed at cleaning unstructured text, e.g., eliminating stop words,
feature selection converts the text into structured data. One of the common ways of feature
selection is a bag of words where words are represented as vectors. For instance, M. Short
used this method to leverage machine learning and information extraction to assign subject
headings to dime novels [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The first model to perform classification and predict labels for fiction test sets was tested
in 2013 using Weka 3 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Another popular machine learning tool used for text data mining
is a Java application for classification called MALLET [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It uses Latent Dirichlet Allocation
(LDA) for topic modeling and was, in particular, leveraged to analyze contemporary popular
fiction, over 1,000 New York Times bestsellers, and genre fiction novels [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The vector space model relies on a bag of words or n-grams in the text. Feature selection
is believed to reduce redundancy in the representation of text data and save computational
time text. N-grams are often selected as features that successfully represent text for stylistic
purposes [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Nonetheless, other meaningful features can be leveraged to facilitate
automatic text mining and aptly classify meaningful information.
      </p>
      <p>The major problems facing researchers nowadays pertain to the need for dimensionality
reduction and boosting the performance of machine learning algorithms applied to text
mining.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and materials</title>
      <p>
        Topic modeling is viewed as a type of ‘soft’ clustering of text documents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. It belongs to
unsupervised machine learning techniques and is based on the structural similarity (or
difference) in data. Statistical measures are applied to calculate the distance between the
documents. In our study, we rely on the vectors obtained for each document by aggregation
of n-gram embeddings. Documents close to one another have similar embeddings and can
be considered semantically (thematically) related.
      </p>
      <p>Though n-gram embeddings are generally accepted as a reliable text clustering
technique, we test and compare here also other clustering techniques. In particular, we
presume that text clustering can be done based on the most informative features. As far as
content structuring is concerned, such features can be all verbs (verb forms) in the text.
Stemming from the pivotal role of the verb in the sentence as the predicate, its low
possibility of being omitted, a high semantic load of lexical verbs, and their central role in
the narrative as plot building blocks, we assume that verbs contribute significantly to
uncovering the content and theme of the text. As a result, they can serve as meaningful
features for document clustering.</p>
      <p>
        The automated data processing is performed here by applying Orange, a machine
learning and data mining suite for data analysis through Python scripting [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. It was
developed by the Laboratory of Bioinformatics, Faculty of Computer Science, University of
Ljubljana. We use Orange 3.36.2, which contains several text mining widgets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>The material of the research is a corpus of 19 books (documents) by a prominent
Canadian postmodernist writer Margaret Atwood. It includes the novels “The Edible
Woman” (1969), “Surfacing” (1972), “Lady Oracle”(1976), “The Handmaid’s Tale” (1985),
“Cat’s Eye” (1988), “Alias Grace” (1996), “The Blind Assassin” (2000), “Oryx and Crake”
(2003), “The Year of Flood” (2009), “MaddAddam” (2013), “The Testaments” (2019), “The
Heart Goes Last” (2015), a novella “The Penelopiad” (2005), collections of short fiction
“Murder in the Dark” (1983), “Bluebeard’s Egg and Other Stories” (1983), “Wilderness Tips”
(1991), “Good Bones” (1992), “The Tent” (2007), and a non-fiction “The Payback” (2007).
The total size of the corpus is 294,606 tokens.</p>
      <p>In order to check the thematic affinity between the novels and pinpoint the principal
topics in the literary legacy of M. Atwood by utilizing computational procedures, some
assumptions can be made. First, we predict that the works written consecutively at a certain
period of the literary activity have more similarities. Besides, parts of the trilogy
“MaddAddam” (“Oryx and Crake” (2003), “The Year of Flood” (2009), “MaddAddam” (2013)
presumably share the same thematic scope and characters, as well as “The Testaments”, the
sequel to the novel “The Handmaid’s Tale”. On a broader scale, the topics raised by the
author are related to feminism, ecology, ecofeminism, self-identity, dystopian societies,
power and oppression. However, their distribution across the works of the author and
salience differs.</p>
      <p>The applied text mining model splits the corpus of texts into two datasets: a training
dataset – a corpus of 16 works by M. Atwood and a testing dataset (245,084 tokens; 23,863
types) – consisting of three works (49,522 tokens; 12,211 types). The training text dataset
with unlabeled data is used to conduct unsupervised document clustering. The testing text
data set includes the later works of the author which might combine a variety of the
previous topics and/or can be hard to classify: “The Penelopiad” (2005), the last book of the
trilogy “The MaddAdam” (2013) and “The Blind Assassin” (2000), which stands out in the
literary legacy due to its complex narrative structure (novel-within-a-novel), historical
context and intricate plot construction.</p>
      <p>The experiment consists of the following interrelated stages:
1. Text preprocessing (performed by Orange to eliminate the noise).
2. Clustering of the training dataset (16 documents) is implemented by applying
document n-gram-based embedding.
3. The results are compared to the clustering of the training dataset by applying a bag
of words.
4. Visualizations are provided in the form of dendrograms and visualization maps.
5. The word clouds representing the clusters visualize the prominence of the most
frequent lexical items in the texts.
6. The clusters are analyzed in terms of the topic affinity based on the common
semantic content.
7. Logistic regression classifier is applied.
8. Predictions for the testing dataset are made based on the trained model.</p>
      <p>This text mining model is enhanced by selecting the most informative features and thus
saving computational time by reducing the size of data. Though our dataset is not of a
particularly large size, testing a feature-based document clustering model can be beneficial
for larger datasets.</p>
      <p>The feature tested in this study is the verb as a pivotal element of the English sentence
structure and meaning construction. Thus, we have repeated the same stages of the analysis
with a reduced, verb-filtered dataset, compared the results with the results obtained from
the bag of words-based clustering, and assessed the performance of the verb-centered
clustering model. For part of speech tagging a Penn Treebank POS Tagger has been used.</p>
      <p>The performance of the verb-centered model is measured by calculating the Rand index.
Finally, we draw conclusions about the prevailing topics generated by the model and
interpret the findings.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>The experimental part consists in elaborating a text mining model that relies on the
stepwise combination of unsupervised machine learning methods, such as clustering, and
supervised machine learning methods, such as logistic regression, as well as training and
testing a model for predictions.</p>
      <sec id="sec-4-1">
        <title>4.1. Text analysis model</title>
        <p>The corpus of texts (16 novels) was loaded into the model using the Import Documents
widget. The Corpus Viewer widget allows us to see all the texts in the corpus. An important
initial stage of the analysis is text preprocessing, which is done with the help of the
Preprocess Text widget. It runs transformation (lowercase), tokenization (Regexp),
normalization (lemmatization by Lemmagen Lemmatizer), and filtering (stopwords,
numbers).</p>
        <p>Clustering implementation by Document Embedding (1) and Bag of Words (2) and the
overall workflow is depicted in Figure 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.1. Document embedding</title>
        <p>
          The Document Embedding widget embeds documents from the corpus into vector space by
using pre-trained fastText models [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The Document Embedding parses n-grams of each
document in the corpus, obtains embedding for each n-gram using the pre-trained model,
and obtains one vector for each document by aggregation of n-gram embeddings using
Mean aggregator. In the Data Table, we can see 300 features for 16 instances. These features
(vectors) are compared to find similar documents. We visualize documents on the map
using two-dimensional data projection with t-SNE. The t-SNE widget plots the data with a
tdistributed stochastic neighbor embedding method. t-SNE is a dimensionality reduction
technique, similar to MDS, where points are mapped to 2-D space by their probability
distribution Exaggeration 3, PCA components 20. The fastText embedding is shown in
Figure 2. Documents close to each other on the t-SNE map have similar embeddings and are
viewed as semantically related. The content of the clusters on the map can be explored in
the Word Cloud. The clusters and their word clouds are presented in Figures 3-5.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.1.2. Clustering by using Bag of Words</title>
        <p>We turn the text into a numerical representation, counting how many times each word
occurs in the text. This approach is called Bag of Words. Figure 6 shows Bag of Words
outputs in Data Table (23,862 features for 16 texts, term frequencies for each document).</p>
        <p>Then we connect the Bag of Words to the Distances widget (cosine distance, which is
1similarity) and Hierarchical Clustering. The results of clustering are given in Figure 7.</p>
        <p>The next stage lies in extracting all verbs from the classified texts and eliciting the most
common key verbs. The results are provided in Figures 8-9.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.2. Predicting a text type</title>
        <p>In order to predict the type of unclassified texts by the same author, we build a new analysis
model by importing the corpus of classified texts (C1 and C2) (Figure 10 shows the analysis
flow on the canvas). Using Logistic Regression, we construct the model to predict the class
of texts (classification algorithm with ridge (L2) regularization). The Nomogram visualizes
the Logistic Regression classifier. It offers an insight into the structure of the training data
and the effects of the attributes on the class probabilities. Apart from visualization of the
classifier, the widget offers interactive support for predicting class probabilities.
Continuous attributes can be plotted in 2D for the selected target class C1 (Figure 11). The
Nomogram displays the top words important for the classifier, which most contribute to the
prediction.</p>
        <p>After that, we import unclassified texts (“The Penelopiad”, “MaddAddam”, “The Blind
Assassin”) using Import Documents (1) widget on the canvas (Figure 12). We connect it to
the Predictions widget and Logistic Regression. Predictions widget receives a dataset and
predictors (predictive models). It shows the probabilities and final decisions of predictive
models. The output of the widget is another dataset, where predictions are appended as
new metaattributes. The result can be observed in a Data Table (Figure 13).</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.3. Verb feature clustering model</title>
        <p>Finally, we apply filter-based feature clustering of texts. Feature selection as a
preprocessing stage is aimed at reducing the size of data and saving computational time.
Choosing informative features, especially without relying on specific machine learning
algorithms, can be problematic. We intend to test how the model will perform if such
features are all verbs in the texts.</p>
        <p>Verbs are viewed as essential features of content construction and, presumably, topic
modeling. The question is whether it is enough to use the list of verbs from the texts to
obtain the same thematic affinity clusters as with the complete texts; in other words, how
informative are verbs in the text, and whether the same classification accuracy can be
achieved as with the methods described above. Filter-based selection is applied to extract
all the verbs from the corpus of texts and create verb datasets.</p>
        <p>Figure 14 shows the model for finding similar texts filtered by verbs (Treebank POS
tagger). With the help of the Select Columns widget, we selected all the verbs (_V) and
ignored words of other POS in the Bag of Words (Figure 15). Then we iterate the procedure
for hierarchical clustering using cosine distance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The results of the verb feature clustering described above do not fully coincide with the
previously completed text clustering (Figures 16-17).</p>
      <p>The performance of the verb feature-based document clustering model has been
calculated by measuring the Rand index. We compare its performance against the
performance achieved by the bag of words clustering. The Rand index is a common measure
of similarity between clusterings. In our case, the rand index of 0.688 is achieved, which
indicates a reasonable level of agreement between the two clusterings.</p>
      <p>To evaluate the performance of the model it is also worthwhile to test the prediction of
the classification of the new texts. The workflow for the predictions for unclassified texts
filtered by verbs is shown in Figure 18. Predictions for unclassified texts filtered by verbs
(“The Penelopiad”, “MaddAddam”, “The Blind Assassin”) are provided in Figure 19.</p>
      <p>The results of the classification are the same as the results obtained for unfiltered texts.
The model shows high performance when predicting the class of the document, referring
all three works from the testing dataset to the second cluster, which is consistent with the
bag of words clustering.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussions</title>
      <p>The applied text analysis model relies on document embedding and bag of words clustering,
further used for class predictions. An attempt has been made to test the feature clustering
based on verb filtering.</p>
      <p>Literary text mining using verb feature clustering appears to be a plausible technique for
reducing data dimensionality and saving computational time. The procedure applied above
can be extrapolated to other types of text as well. The efficiency of verbs as meaningful
features in the studied literary texts can be explained by the genre characteristics of the
texts. In their majority, these texts are narratives and have a linear structure of alternating
processes signaled by verbs. The variation of verbs and their saturation in the text is also
higher in fiction.</p>
      <p>The patterns identified with verb feature clustering are quite meaningful. For example,
in Cluster 1 of the resultant verb clustering (Figure 17), the lexical verbs to knit, to cry, to
fuck, to shop, to dress, to cook, etc. are allusive of female or gender-related themes and motifs,
the verbs to pay, to keep, to save, to drink, to lead, to push, to gather indicate the social issues
in the texts from this cluster. Cluster 2 gives more prominence to the creation motifs: to
write, to draw, to color, etc. While in Cluster 1 prevail the verbs that denote inner perception
and mental states, such as to consider, to suppose, to see, in Cluster 2, we come across the
verbs denoting the external expression of the feelings and emotions: to laugh, to cry, to call,
to shout, etc.</p>
      <p>The prevailing verb forms in the clusters were _VBG and _VBD.</p>
      <p>The Rand index has been calculated to measure the agreement between the two main
clusterings. A score of 0.688 has been achieved. Though the model still requires testing and
improvements, the result is quite satisfactory, taking into account that the lower-level
clusters within the hierarchical clustering are better classified. It is worth mentioning that
both the bag of words model and the verb feature model yield the same results in correctly
classifying the novels which constitute the trilogy “MaddAddam” and refer to the same
(mini)cluster “The Handmaid’s Tale” and its sequel “The Testaments”.</p>
      <p>The model can be augmented by introducing verb arguments, but this already leads us
to the level of a clause, which can be a promising direction in the future.</p>
      <p>The most significant result of the conducted text mining is that the trained model has
been successfully tested for the prediction of text classification. Both the bag of words model
and the verb feature model referred “The Penelopiad”, “MaddAddam”, “The Blind Assassin”
to the second cluster.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>The applied feature engineering lies in defining, selecting and testing verbs as informative
clustering features. It encompasses text preprocessing, feature generation, verb feature
selection, pattern discovery, and evaluation. The clusters have been identified with
hierarchical clustering algorithms.</p>
      <p>Verb clusters uncover insights about the analyzed texts. Patterns emerging from clusters
testify to the recurring themes. The applied visualization techniques, such as scatter plots,
dendrograms, and word clouds, show the intricate relationships between the clusters and,
respectively, between the texts analyzed.</p>
      <p>The conducted literary text mining with a focus on verb feature clustering shows a good
performance (Rand index = 0.688), though it still needs training on a bigger dataset and
other texts. It sheds light on the linguostylistic characteristics of literary works and can be
viewed as a plausible distant reading technique.</p>
      <p>As a prospect of further research, we can see the application of the multilingual
SentenceBERT model embeddings to establish semantic textual similarity, and since verbs are the
core elements of any sentence structure, it can provide further insights into the
informativeness of verbs as computationally efficient topic modeling features.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Atkinson-Abutridy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Text</given-names>
            <surname>Analytics</surname>
          </string-name>
          .
          <article-title>An Introduction to the Science and Applications of Unstructured Information Analysis</article-title>
          ,
          <source>Chapman &amp; Hall</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Moretti</surname>
          </string-name>
          , Distant reading, Verso, London,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Jockers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <article-title>Significant themes in 19th-century literature</article-title>
          ,
          <source>Poetics</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ) (
          <year>2013</year>
          )
          <fpage>750</fpage>
          -
          <lpage>769</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.poetic.
          <year>2013</year>
          .
          <volume>08</volume>
          .0052013.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lundy</surname>
          </string-name>
          ,
          <source>Text Mining Contemporary Popular Fiction: Natural Language ProcessingDerived Themes Across Over</source>
          <volume>1</volume>
          ,000 New York Times Bestsellers and Genre Fiction Novels,
          <source>Master's thesis</source>
          , University of South Carolina, Columbia,
          <string-name>
            <surname>SC</surname>
          </string-name>
          , US,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Short</surname>
          </string-name>
          , Text Mining and
          <article-title>Subject Analysis for Fiction; or, Using Machine Learning</article-title>
          and Information Extraction to Assign Subject Headings to Dime Novels,
          <source>Cataloging &amp; Classification Quarterly</source>
          ,
          <volume>57</volume>
          (
          <issue>5</issue>
          ) (
          <year>2019</year>
          )
          <fpage>315</fpage>
          -
          <lpage>336</lpage>
          . doi:
          <volume>10</volume>
          .1080/01639374.
          <year>2019</year>
          .
          <volume>1653413</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>K. van Dalen-Oskam</surname>
          </string-name>
          ,
          <article-title>The Riddle of Literary Quality: A Computational Approach</article-title>
          , Amsterdam University Press, Amsterdam, the Netherlands,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1515/9789048558155.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Purves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koblet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Adams</surname>
          </string-name>
          , Analysing Environmental Narratives Computationally, in: R. S. Purves,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koblet</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          Adams (Eds.), Unlocking Environmental Narratives:
          <article-title>Towards Understanding Human Environment Interactions through Computational Text Analysis</article-title>
          , Ubiquity Press, London, UK,
          <year>2022</year>
          . pp.
          <fpage>43</fpage>
          -
          <lpage>84</lpage>
          . doi:
          <volume>10</volume>
          .5334/bcs.c.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Cain</surname>
          </string-name>
          ,
          <article-title>Using topic modeling to enhance access to library digital collections</article-title>
          ,
          <source>Journal of Web Librarianship</source>
          ,
          <volume>10</volume>
          (
          <issue>3</issue>
          ) (
          <year>2016</year>
          )
          <fpage>210</fpage>
          -
          <lpage>225</lpage>
          . doi:
          <volume>10</volume>
          .1080/19322909.
          <year>2016</year>
          .
          <volume>1193455</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Neatrour</surname>
          </string-name>
          , E. Callaway,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cummings</surname>
          </string-name>
          ,
          <article-title>Kindles, card catalogs, and the future of libraries: a collaborative digital humanities project</article-title>
          ,
          <source>Digital Library Perspectives</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ) (
          <year>2018</year>
          )
          <fpage>162</fpage>
          -
          <lpage>187</lpage>
          . doi:
          <volume>10</volume>
          .1108/dlp-02-2018-0004.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , Text mining,
          <source>in: Encyclopedia of Mathematical Geosciences</source>
          , Springer International Publishing, Cham,
          <year>2023</year>
          , pp.
          <fpage>1535</fpage>
          -
          <lpage>1537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[11] Weka 3: Machine Learning Software in Java</source>
          ,
          <year>2013</year>
          . URL: https://www.cs.waikato.ac.nz/ml/weka/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>A. K. McCallum</surname>
            ,
            <given-names>MALLET:</given-names>
          </string-name>
          <article-title>a machine learning for language toolkit</article-title>
          ,
          <year>2002</year>
          . URL: http://mallet.cs.umass.edu/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Georgieva-Trifonova</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Duraku, Research on N-grams feature selection methods for text classification</article-title>
          .
          <source>IOP Conference Series: Materials Science and Engineering</source>
          , volume
          <volume>1031</volume>
          ,
          <year>2021</year>
          . 012048. doi:
          <volume>10</volume>
          .1088/
          <fpage>1757</fpage>
          -899X/1031/1/012048.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demsar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Curk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gorup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hocevar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Milutinovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mozina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polajnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Toplak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Staric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stajdohar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Umek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zbontar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zitnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zupan</surname>
          </string-name>
          ,
          <article-title>Orange: Data Mining Toolbox in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>14</volume>
          (
          <year>2013</year>
          )
          <fpage>2349</fpage>
          −
          <lpage>2353</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Orange</given-names>
            <surname>Data Mining</surname>
          </string-name>
          ,
          <year>2015</year>
          . URL: https://orangedatamining.com/docs/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <source>Learning Word Vectors for 157 Languages, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan,
          <year>2018</year>
          , pp.
          <fpage>3483</fpage>
          -
          <lpage>3487</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>