1. Introduction

Information Retrieval Journal

10.1017/CCOL0521333555

Automatic Classification of Literary Epochs

Irina Rabaev

irinar@ac.sce.ac.il 0

Marina Litvak

marinal@c.sce.ac.il 0

Vladimir Younkin

Ricardo Campos

ricardo.campos@ubi.pt 1

Alípio Mário Jorge

amjorge@fc.up.pt 2

Adam Jatowt

adam.jatowt@uibk.ac.at

Text Classification, Implicit Information Retrieval, Implicit Temporal Context Retrieval

0 Shamoon College of Engineering , Beer Sheva , Israel 1 University of Beira Interior, INESC TEC, Ci2 - Smart Cities Research Center - Polytechnic Institute of Tomar , Portugal 2 University of Innsbruck , Innsbruck , Austria

2022

20 2017 367 376

This paper describes the shared task on Automatic Classification a part of the 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT'23) held at SIGIR 2023. The competition aimed to enhance the capabilities of largescale analysis and cross-comparative studies of literary texts by automating their classification into the respective epochs. We believe that the competition contributed to the field of information retrieval by exposing the first large benchmark dataset and the first study's results with various methods applied to this dataset. This paper presents the details of the contest, the dataset used, the evaluation procedure, and an overview of participating methods.

1. Introduction

Automatic epoch classification in the context of literary texts can be viewed as a form of implicit temporal information retrieval. Literature reflects the language styles, grammatical variations, thoughts, emotions, and perspectives of diferent times. The classification of literary texts into their respective epochs involves extracting implicit temporal information embedded in the language [1], enabling the retrieval of the historical context and characteristics unique to each literary period.

Literature can be classified by movements, genres, or periods. In this competition, we focused on the division of literature into diferent periods, a.k.a. epochs. According to diferent academic sources, some epochs are well-defined, while others may overlap [ 2, 3, 4], which is often a point of contention between scholars. One possible way to categorize literature by epochs from 1700 to our days is as follows: 1. Romanticism (1798-1837) [5]: Romanticism focused on individualism, emphasized emotions over reason, imagination, freedom of form, and the natural world. 2. Victorian Literature (1837-1901) [6]: Named after Queen Victoria’s reign, tended to depict daily life, and focused on realism, social reform, and a growing interest in science and technology. Novel became the leading literary genre during this period. 3. Modernism (1900-1945) [7]: Literature during this period often employed blended writing elements, experimentation with form and language, nonlinear plot, and introspection. 4. Postmodernism (1945-2000) [8]: Postmodernism is characterized by self-reflexivity, unreliable narrators, unrealistic and impossible narratives, parody, dark humor, and irony. 5. Our days (from 2000): Contemporary literature reflects technological advances, globalization, questions conventions, and often breaks traditional writing rules.

Every literary epoch is characterized by its voices, themes, and styles. In recognizing and understanding these epochs, we can acquire a more profound insight into the progression of human thought throughout history and the extensive range of human experiences and creativity. This motivated us to conduct the CoLiE task, which is, to the best of our knowledge, the first to be held on the automatic classification of text into five literary epochs. The main goal is to advance the field of implicit temporal information retrieval from a text and to compare the performances of diferent models and systems on a new dataset.

This paper describes the contest details. Section 2 provides an overview of the task and a description of the dataset. Section 3 presents a summary of participating systems, followed by Section 4, which presents results and discussions. Section 5 draws conclusions and proposes future directions.

2. Task Description and Dataset

The task on Automatic Classification of Literary Epochs (CoLiE) aimed at automatic identification of the following literary epoch of a given text from its writing style: ( 1 ) Romanticism (1798-1837), (2) Victorian Literature (1837-1901), (3) Modernism (1900-1945), (4) Postmodernism (1945-2000), and (5) Our days (from 2000). In this section, we describe the dataset and the format of the competition.

2.1. Dataset

In this competition, we introduce “BookSCE” — a new large-scale dataset of books, mostly published over the last three centuries. BookSCE is built upon the online book repository Project Gutenberg: Free eBooks, which focuses on literature and other written works. The books in BookSCE were annotated with labels that include the book’s meta-data and authors-related information, such as name, residence, age, and publication date. Some labels were automatically extracted from the Project Gutenberg site. When the specific information was not present in the Project Gutenberg database, we tried to automatically retrieve it from other sources, e.g., from the pdf file itself, Wikipedia, and Wikibooks. To verify the automatic annotation, we performed manual label validation on a random dataset sample. Because this competition aimed at automatic epoch classification, we used only a subset of BookSCE with a verified year label converted to the corresponding epoch. The dataset for the CoLiE task consists of around 11K books from literary epochs described at the beginning of Section 2. Each book is split into multiple consequent disjoint 1000-word chunks. Each chunk is provided as a text file. The dataset is divided into training, validation, and testing sets while preserving the epochs ratio in each set. Table 1 summarizes the BookSCE subset compiled for the CoLiE task.

The training and validation sets were released at the beginning of the competition. The test set was released (without labels) a week before the competition’s deadline.

The whole dataset with the corresponding ground-truth labels for the train and validation sets can be downloaded from https://www.kaggle.com/competitions/colie/data. Our decision not to publish the ground truth for the test set is primarily due to our plans to organize future editions of the competition. Compilation of a new test set, including its collection and annotation, is very time- and labor-consuming.

2.2. The Competition Format

The competition was hosted on the Kaggle platform https://www.kaggle.com/competitions/ colie/ - a popular online platform for data science competitions. Kaggle provides a robust infrastructure for competition management, ensuring a smooth and eficient contest experience for organizers and participants alike. Every Kaggle competition has a public and private leaderboard. Competition hosts split the test dataset into two parts, using one part for the public leaderboard and another part for the private leaderboard, 60% and 40% of the test set, respectively, for this competition. Participants are unaware of which samples are public or private. The public leaderboard is visible to the participants when the competition is alive. The private leaderboard is kept secret until after the competition deadline and is used for determining the final rankings. Therefore, the rankings on the public leaderboard are not necessarily the same as those on the private leaderboard.

The evaluation was based on average accuracy: = .

In addition, participants were required to provide a short description of their methods together with the confusion matrix for the validation set.

The input to the classifier is a 1000-word chunk of a book in text format, and its output is a single value (epoch). The submission file must contain two columns: one represents the ifle name (chunk ID) in the test set, and the second is its epoch’s label. For convenience, the participants were provided with the ”sample_submission.csv” file as an example of the submission format.

3. Participating Teams

Seven teams enthusiastically participated in the contest, six of which agreed to share their identities and briefly overview their methodology. Below we present a summary of the participating methods. Readers who are interested in more details should contact the representatives of the teams.

Technology, Poland.

WebSty.

Submitted by Tomasz Walkowiak, CLARIN-PL, Wroclaw University of Science and Each text was vectorized by the TF-IDF weights scaled to z-scores. The method used 5,000 of the most common training set words from the texts for this process.For classification, a multilayer perceptron (MLP) was employed. The network consisted of 5,000 input neurons, two hidden layers (with 1,000 and 500 neurons, respectively), and an output layer (with 5 neurons). The ReLU was used as the activation function in the hidden layers, while SoftMax was applied in the final layer. The dataset includes information about the book identifier for each text. It is in the first column of provided data. This means that texts from the same book can be selected. As an entire book consists of a sequence of texts belonging to the same literary period, the team decided to improve recognition eficiency by leveraging this information [ 9]. To achieve this, they adopted a sequence classification method proposed in [ 10], which utilizes logits from the neural classifier trained for classifying individual texts. The logit is the raw output of the final layer before applying the SoftMax activation function to convert it into probabilities. The logits are calculated by combining the weighted sum of the outputs from the last hidden layer with biases. The sequential classification of texts ( ) from the same book employs the summing of logits (∑ ( )) and is defined as follows: ∑ ( )

The selected class is assigned to all texts from the same book.

Back to the ... Past. Submitted by Pietro Maldini, an independent participant. From each ifle provided, stop words and punctuation were filtered out, and some portion of the first words were taken. This dataset with reduced dimensions was used to train a Deep Neural Network using Keras. At first, the documents were vectorized, after that, they were fed to an Embedding Layer to get a representation for each word. This representation was passed through a Bidirecional GRU layer, then through a Dropout layer, a Dense layer, another Dropout layer, and a final Output layer. The network was trained using AdamW optimizer with a SparseCategoricalCrossentropy loss function. The model predicted a Literary Epoch for each fragment of a book. The predictions for each fragment of the book were combined and used to predict the label of the book.

Behrooz Qiassi. Submitted by Behrooz Qiassi, an independent participant. This method uses feature extraction followed by classification. The TF-IDF vectorizer was employed for feature extraction and the Logistic Regression model was used as the classifier.

AMXingu. Submitted by Daniel Quintão de Moraes, Giuseppe Vicente Batista, and Gustavo Pádua Beato, Instituto Tecnológico de Aeronáutica - ITA. The model consists of a three-step pipeline as follows: ( 1 ) TF-IDF with sublinear term-frequency[11]; (2) TruncatedSVD (Singular Values Decomposition) with 128 components, which is a sparse version of SVD also known as Latent Semantic Analysis [11, 12]; and (3) an XGBoost classifier with 0.05 learning rate [ 13]. Words with a relative maximum document frequency above 0.7 and with absolute minimum document frequency below 2 were excluded from the vocabulary in order to avoid stop words and unimportant words, respectively. TruncatedSVD contained 128 components.

Although the dataset (as well as the expected submission format) had been originally split into chunks, the participants concatenated all book chunks belonging to the same book before classification. Accordingly, they made validation and test sets predictions per book and replicated it for all the book’s chunks before test submission. The team motivated this step by the fact that a book belongs to a single literary epoch, although some models may benefit from chunk splitting (e.g., deep learning methods with limited input dimension).

Sorbonne University. Submitted by Iglika Nikolova-Stoupak, Kyoto University, Gaël Lejeune, Sorbonne University, and Eva Lacroix, Sorbonne University. The team used a sample of 50,000 entries (while keeping the balance between the 5 labels) as train data and the whole validation set as validation data. The pipeline of the best system consists of the following: ( 1 ) Cleaning of the textual data (including removal of capitalization and symbols except common punctuation); (2) Application of the TF-IDF vectorizer from python’s sklearn library on the textual data with the following settings: char_wb analyser with n-gram range (5,6); and (3) Training a Logistic Regression model (with the following settings: penalty “l2”, C “1”, solver “lbfgs”). Debajyoti Mazumder. Submitted by Debajyoti Mazumder, the Department of Data Science and Engineering, Indian Institute of Science Education and Research Bhopal, India. The pretrained RoBERTa model have been used from huggingface1. Pooler output from the pretrained model is taken and a linear layer is stacked on top of it for classification purpose. Only the last layer of RoBERTa-base[14] was trained and the rest layers were frozen. The maximum sequence length 500 was chosen. The learning rate of 2e-4 was chosen with weighted cross entropy loss and AdamW[15] optimizer for mitigating the imbalance in this large dataset. The class weights are given according to the distribution of classes. A stepwise learning rate scheduler 1https://huggingface.co/roberta-base with gamma=0.95 was used, and the model was allowed to run on an early stop strategy with patience=3.

4. Results and Discussion

We received a total of 71 submissions from seven diferent teams. Each team chose the two best submissions that counted toward the final rank. For comparison purposes, we implemented a very simple baseline–logistic regression applied on normalized count vectors, which achieved an accuracy of 0.566. A summary of the overall rankings of the submitted methods is provided in Table 2. Table 3 shows the confusion matrixes on the validation set. The classification accuracy ranges between 65 and 79 percent. The best results were achieved by the ’WebSty’ (first place) and ’Back to the ... Past’ (second place) teams. As can be observed, text classification approaches using traditional classifiers and shallow neural networks above classic text representations (such as TF-IDF) outperformed classification with pretrained language model (such as RoBERTa). This outcome is very interesting, given that language models are reported to outperform traditional approaches in most IR tasks. Moreover, it can be observed from Table 3(f) that the method applied RoBERTa obtained the best results for the ’Our days’ category, which can be explained by the fact that the RoBERTa was pretrained on modern texts. Also, the representation of documents with TF-IDF vectors seems to be a better option than counting word appearances, as the results of a baseline and two other systems that used the same classifier (Logistic Regression) demonstrate. An additional observation from all confusion matrices is that underrepresented categories have much more misclassifications than the ’Modernism’ and ’Victorian’ categories, which constitute the majority of the dataset. Also, we noticed that all classifiers did not distinguish very well between adjacent categories. This may be explained by the way literature has evolved, where the changeover between the writing styles representing diferent epochs was gradual and took place over a long period of time. 2After the end of the competition, the team noted that their submission used misspelled labels which degraded their score. The team decided to rerun the model with the correct labels and reported that their scores went up by 3%. Confusion matrices on the validation set. The classes are: ( 1 ) Romanticism, (2) Victorian, (3) Modernism, (4) PostModernism, and (5) Our days. Rows: true labels; columns: predicted labels.

5. Conclusion

This competition provided an opportunity to compare and evaluate the efectiveness of diferent classification methods to accurately categorize texts into their respective literary epochs. By evaluating the performance of various algorithms and techniques, researchers can determine which methods are most efective in achieving accurate and reliable results. Several interesting observations, some more expected and some less, were reported in this study. In the future, we intend to organize new editions of this competition with new tasks related to literary classification and implicit information retrieval.

Acknowledgments

M. Litvak and I. Rabaev were supported by the Internal SCE grant - Excellence Research Track B, no. EX/06-B-Y22/T1/D3/Yr1. Ricardo Campos and Alípio Jorge were financed by National Funds through the FCT - Fundação para a Ciência e a Tecnologia, I.P. (Portuguese Foundation for Science and Technology) within the project StorySense, with reference 2022.09312.PTDC. The authors would like to thank Milana Michaeli for her assistance in manually checking the BookSCE labels.

[1]

Campos ,

Dias ,

A. M.

Jorge ,

Nunes , Identifying top relevant dates for implicit