<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Information Retrieval Journal</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1017/CCOL0521333555</article-id>
      <title-group>
        <article-title>Automatic Classification of Literary Epochs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Irina Rabaev</string-name>
          <email>irinar@ac.sce.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marina Litvak</string-name>
          <email>marinal@c.sce.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladimir Younkin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Campos</string-name>
          <email>ricardo.campos@ubi.pt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alípio Mário Jorge</string-name>
          <email>amjorge@fc.up.pt</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adam Jatowt</string-name>
          <email>adam.jatowt@uibk.ac.at</email>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Text Classification, Implicit Information Retrieval, Implicit Temporal Context Retrieval</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shamoon College of Engineering</institution>
          ,
          <addr-line>Beer Sheva</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Beira Interior, INESC TEC, Ci2 - Smart Cities Research Center - Polytechnic Institute of Tomar</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Innsbruck</institution>
          ,
          <addr-line>Innsbruck</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>20</volume>
      <issue>2017</issue>
      <fpage>367</fpage>
      <lpage>376</lpage>
      <abstract>
        <p>This paper describes the shared task on Automatic Classification a part of the 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT'23) held at SIGIR 2023. The competition aimed to enhance the capabilities of largescale analysis and cross-comparative studies of literary texts by automating their classification into the respective epochs. We believe that the competition contributed to the field of information retrieval by exposing the first large benchmark dataset and the first study's results with various methods applied to this dataset. This paper presents the details of the contest, the dataset used, the evaluation procedure, and an overview of participating methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Automatic epoch classification in the context of literary texts can be viewed as a form of implicit
temporal information retrieval. Literature reflects the language styles, grammatical variations,
thoughts, emotions, and perspectives of diferent times. The classification of literary texts into
their respective epochs involves extracting implicit temporal information embedded in the
language [1], enabling the retrieval of the historical context and characteristics unique to each
literary period.</p>
      <p>Literature can be classified by movements, genres, or periods. In this competition, we focused
on the division of literature into diferent periods, a.k.a. epochs. According to diferent academic
sources, some epochs are well-defined, while others may overlap [ 2, 3, 4], which is often a point
of contention between scholars. One possible way to categorize literature by epochs from 1700
to our days is as follows:
1. Romanticism (1798-1837) [5]: Romanticism focused on individualism, emphasized
emotions over reason, imagination, freedom of form, and the natural world.
2. Victorian Literature (1837-1901) [6]: Named after Queen Victoria’s reign, tended to depict
daily life, and focused on realism, social reform, and a growing interest in science and
technology. Novel became the leading literary genre during this period.
3. Modernism (1900-1945) [7]: Literature during this period often employed blended writing
elements, experimentation with form and language, nonlinear plot, and introspection.
4. Postmodernism (1945-2000) [8]: Postmodernism is characterized by self-reflexivity,
unreliable narrators, unrealistic and impossible narratives, parody, dark humor, and irony.
5. Our days (from 2000): Contemporary literature reflects technological advances,
globalization, questions conventions, and often breaks traditional writing rules.</p>
      <p>Every literary epoch is characterized by its voices, themes, and styles. In recognizing and
understanding these epochs, we can acquire a more profound insight into the progression of
human thought throughout history and the extensive range of human experiences and creativity.
This motivated us to conduct the CoLiE task, which is, to the best of our knowledge, the first
to be held on the automatic classification of text into five literary epochs. The main goal is
to advance the field of implicit temporal information retrieval from a text and to compare the
performances of diferent models and systems on a new dataset.</p>
      <p>This paper describes the contest details. Section 2 provides an overview of the task and a
description of the dataset. Section 3 presents a summary of participating systems, followed by
Section 4, which presents results and discussions. Section 5 draws conclusions and proposes
future directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description and Dataset</title>
      <p>
        The task on Automatic Classification of Literary Epochs (CoLiE) aimed at automatic
identification of the following literary epoch of a given text from its writing style: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Romanticism
(1798-1837), (2) Victorian Literature (1837-1901), (3) Modernism (1900-1945), (4) Postmodernism
(1945-2000), and (5) Our days (from 2000). In this section, we describe the dataset and the format
of the competition.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>In this competition, we introduce “BookSCE” — a new large-scale dataset of books, mostly
published over the last three centuries. BookSCE is built upon the online book repository Project
Gutenberg: Free eBooks, which focuses on literature and other written works. The books in
BookSCE were annotated with labels that include the book’s meta-data and authors-related
information, such as name, residence, age, and publication date. Some labels were automatically
extracted from the Project Gutenberg site. When the specific information was not present in
the Project Gutenberg database, we tried to automatically retrieve it from other sources, e.g.,
from the pdf file itself, Wikipedia, and Wikibooks. To verify the automatic annotation, we
performed manual label validation on a random dataset sample. Because this competition aimed
at automatic epoch classification, we used only a subset of BookSCE with a verified year label
converted to the corresponding epoch. The dataset for the CoLiE task consists of around 11K
books from literary epochs described at the beginning of Section 2. Each book is split into
multiple consequent disjoint 1000-word chunks. Each chunk is provided as a text file. The
dataset is divided into training, validation, and testing sets while preserving the epochs ratio in
each set. Table 1 summarizes the BookSCE subset compiled for the CoLiE task.</p>
        <p>The training and validation sets were released at the beginning of the competition. The test
set was released (without labels) a week before the competition’s deadline.</p>
        <p>The whole dataset with the corresponding ground-truth labels for the train and validation sets
can be downloaded from https://www.kaggle.com/competitions/colie/data. Our decision not to
publish the ground truth for the test set is primarily due to our plans to organize future editions
of the competition. Compilation of a new test set, including its collection and annotation, is
very time- and labor-consuming.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Competition Format</title>
        <p>The competition was hosted on the Kaggle platform https://www.kaggle.com/competitions/
colie/ - a popular online platform for data science competitions. Kaggle provides a robust
infrastructure for competition management, ensuring a smooth and eficient contest experience
for organizers and participants alike. Every Kaggle competition has a public and private
leaderboard. Competition hosts split the test dataset into two parts, using one part for the
public leaderboard and another part for the private leaderboard, 60% and 40% of the test set,
respectively, for this competition. Participants are unaware of which samples are public or
private. The public leaderboard is visible to the participants when the competition is alive. The
private leaderboard is kept secret until after the competition deadline and is used for determining
the final rankings. Therefore, the rankings on the public leaderboard are not necessarily the
same as those on the private leaderboard.</p>
        <p>The evaluation was based on average accuracy:
 =
     
   
.</p>
        <p>In addition, participants were required to provide a short description of their methods together
with the confusion matrix for the validation set.</p>
        <p>The input to the classifier is a 1000-word chunk of a book in text format, and its output
is a single value (epoch). The submission file must contain two columns: one represents the
ifle name (chunk ID) in the test set, and the second is its epoch’s label. For convenience,
the participants were provided with the ”sample_submission.csv” file as an example of the
submission format.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Participating Teams</title>
      <p>Seven teams enthusiastically participated in the contest, six of which agreed to share their
identities and briefly overview their methodology. Below we present a summary of the participating
methods. Readers who are interested in more details should contact the representatives of the
teams.</p>
      <p>Technology, Poland.</p>
      <p>WebSty.</p>
      <p>Submitted by Tomasz Walkowiak, CLARIN-PL, Wroclaw University of Science and
Each text was vectorized by the TF-IDF weights scaled to z-scores. The method used 5,000
of the most common training set words from the texts for this process.For classification, a
multilayer perceptron (MLP) was employed. The network consisted of 5,000 input neurons, two
hidden layers (with 1,000 and 500 neurons, respectively), and an output layer (with 5 neurons).
The ReLU was used as the activation function in the hidden layers, while SoftMax was applied
in the final layer. The dataset includes information about the book identifier for each text. It is
in the first column of provided data. This means that texts from the same book can be selected.
As an entire book consists of a sequence of texts belonging to the same literary period, the team
decided to improve recognition eficiency by leveraging this information [ 9]. To achieve this,
they adopted a sequence classification method proposed in [ 10], which utilizes logits from the
neural classifier trained for classifying individual texts. The logit is the raw output of the final
layer before applying the SoftMax activation function to convert it into probabilities. The logits
are calculated by combining the weighted sum of the outputs from the last hidden layer with
biases. The sequential classification of texts (   ) from the same book employs the summing of
logits (∑   (  )) and is defined as follows:
 
 ∑   (  )</p>
      <p />
      <p>The selected class is assigned to all texts   from the same book.</p>
      <p>Back to the ... Past. Submitted by Pietro Maldini, an independent participant. From each
ifle provided, stop words and punctuation were filtered out, and some portion of the first
words were taken. This dataset with reduced dimensions was used to train a Deep Neural
Network using Keras. At first, the documents were vectorized, after that, they were fed to
an Embedding Layer to get a representation for each word. This representation was passed
through a Bidirecional GRU layer, then through a Dropout layer, a Dense layer, another Dropout
layer, and a final Output layer. The network was trained using AdamW optimizer with a
SparseCategoricalCrossentropy loss function. The model predicted a Literary Epoch for each
fragment of a book. The predictions for each fragment of the book were combined and used to
predict the label of the book.</p>
      <p>Behrooz Qiassi. Submitted by Behrooz Qiassi, an independent participant. This method uses
feature extraction followed by classification. The TF-IDF vectorizer was employed for feature
extraction and the Logistic Regression model was used as the classifier.</p>
      <p>
        AMXingu. Submitted by Daniel Quintão de Moraes, Giuseppe Vicente Batista, and Gustavo
Pádua Beato, Instituto Tecnológico de Aeronáutica - ITA. The model consists of a three-step
pipeline as follows: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) TF-IDF with sublinear term-frequency[11]; (2) TruncatedSVD (Singular
Values Decomposition) with 128 components, which is a sparse version of SVD also known as
Latent Semantic Analysis [11, 12]; and (3) an XGBoost classifier with 0.05 learning rate [ 13].
Words with a relative maximum document frequency above 0.7 and with absolute minimum
document frequency below 2 were excluded from the vocabulary in order to avoid stop words
and unimportant words, respectively. TruncatedSVD contained 128 components.
      </p>
      <p>Although the dataset (as well as the expected submission format) had been originally split
into chunks, the participants concatenated all book chunks belonging to the same book before
classification. Accordingly, they made validation and test sets predictions per book and replicated
it for all the book’s chunks before test submission. The team motivated this step by the fact
that a book belongs to a single literary epoch, although some models may benefit from chunk
splitting (e.g., deep learning methods with limited input dimension).</p>
      <p>
        Sorbonne University. Submitted by Iglika Nikolova-Stoupak, Kyoto University, Gaël Lejeune,
Sorbonne University, and Eva Lacroix, Sorbonne University. The team used a sample of 50,000
entries (while keeping the balance between the 5 labels) as train data and the whole validation
set as validation data. The pipeline of the best system consists of the following: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Cleaning of
the textual data (including removal of capitalization and symbols except common punctuation);
(2) Application of the TF-IDF vectorizer from python’s sklearn library on the textual data with
the following settings: char_wb analyser with n-gram range (5,6); and (3) Training a Logistic
Regression model (with the following settings: penalty “l2”, C “1”, solver “lbfgs”).
Debajyoti Mazumder. Submitted by Debajyoti Mazumder, the Department of Data Science
and Engineering, Indian Institute of Science Education and Research Bhopal, India. The
pretrained RoBERTa model have been used from huggingface1. Pooler output from the pretrained
model is taken and a linear layer is stacked on top of it for classification purpose. Only the last
layer of RoBERTa-base[14] was trained and the rest layers were frozen. The maximum sequence
length 500 was chosen. The learning rate of 2e-4 was chosen with weighted cross entropy
loss and AdamW[15] optimizer for mitigating the imbalance in this large dataset. The class
weights are given according to the distribution of classes. A stepwise learning rate scheduler
1https://huggingface.co/roberta-base
with gamma=0.95 was used, and the model was allowed to run on an early stop strategy with
patience=3.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>
        We received a total of 71 submissions from seven diferent teams. Each team chose the two best
submissions that counted toward the final rank. For comparison purposes, we implemented a
very simple baseline–logistic regression applied on normalized count vectors, which achieved
an accuracy of 0.566. A summary of the overall rankings of the submitted methods is provided in
Table 2. Table 3 shows the confusion matrixes on the validation set. The classification accuracy
ranges between 65 and 79 percent. The best results were achieved by the ’WebSty’ (first place)
and ’Back to the ... Past’ (second place) teams. As can be observed, text classification approaches
using traditional classifiers and shallow neural networks above classic text representations (such
as TF-IDF) outperformed classification with pretrained language model (such as RoBERTa). This
outcome is very interesting, given that language models are reported to outperform traditional
approaches in most IR tasks. Moreover, it can be observed from Table 3(f) that the method applied
RoBERTa obtained the best results for the ’Our days’ category, which can be explained by the fact
that the RoBERTa was pretrained on modern texts. Also, the representation of documents with
TF-IDF vectors seems to be a better option than counting word appearances, as the results of a
baseline and two other systems that used the same classifier (Logistic Regression) demonstrate.
An additional observation from all confusion matrices is that underrepresented categories have
much more misclassifications than the ’Modernism’ and ’Victorian’ categories, which constitute
the majority of the dataset. Also, we noticed that all classifiers did not distinguish very well
between adjacent categories. This may be explained by the way literature has evolved, where
the changeover between the writing styles representing diferent epochs was gradual and took
place over a long period of time.
2After the end of the competition, the team noted that their submission used misspelled labels which degraded their
score. The team decided to rerun the model with the correct labels and reported that their scores went up by 3%.
Confusion matrices on the validation set. The classes are: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Romanticism, (2) Victorian, (3) Modernism,
(4) PostModernism, and (5) Our days. Rows: true labels; columns: predicted labels.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This competition provided an opportunity to compare and evaluate the efectiveness of diferent
classification methods to accurately categorize texts into their respective literary epochs. By
evaluating the performance of various algorithms and techniques, researchers can determine
which methods are most efective in achieving accurate and reliable results. Several interesting
observations, some more expected and some less, were reported in this study. In the future,
we intend to organize new editions of this competition with new tasks related to literary
classification and implicit information retrieval.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>M. Litvak and I. Rabaev were supported by the Internal SCE grant - Excellence Research Track
B, no. EX/06-B-Y22/T1/D3/Yr1. Ricardo Campos and Alípio Jorge were financed by National
Funds through the FCT - Fundação para a Ciência e a Tecnologia, I.P. (Portuguese Foundation
for Science and Technology) within the project StorySense, with reference 2022.09312.PTDC.
The authors would like to thank Milana Michaeli for her assistance in manually checking the
BookSCE labels.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Jorge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <article-title>Identifying top relevant dates for implicit</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>