=Paper= {{Paper |id=Vol-3290/short_paper2630 |storemode=property |title=Detecting Sequential Genre Change in Eighteenth-Century Texts |pdfUrl=https://ceur-ws.org/Vol-3290/short_paper2630.pdf |volume=Vol-3290 |authors=Jinbin Zhang,Yann Ciarán Ryan,Iiro Rastas,Filip Ginter,Mikko Tolonen,Rohit Babbar |dblpUrl=https://dblp.org/rec/conf/chr/ZhangRRGTB22 }} ==Detecting Sequential Genre Change in Eighteenth-Century Texts== https://ceur-ws.org/Vol-3290/short_paper2630.pdf
Detecting Sequential Genre Change in
Eighteenth-Century Texts
Jinbin Zhang1 , Yann Ciarán Ryan3 , Iiro Rastas4 , Filip Ginter4 , Mikko Tolonen3 and
Rohit Babbar1
1
  Aalto University, Finland
3
  University of Helsinki, Finland
4
  TurkuNLP, University of Turku, Finland


                                         Abstract
                                         Machine classi昀椀cation of historical books into genres is a common task for NLP-based classi昀椀ers and has
                                         a number of applications, from literary analysis to information retrieval. However it is not a straight-
                                         forward task, as genre labels can be ambiguous and subject to temporal change, and moreoever many
                                         books consist of mixed or miscellaneous genres. In this paper we describe a work-in-progress method
                                         by which genre predictions can be used to determine longer sequences of genre change within books,
                                         which we test out with visualisations of some hand-picked texts. We apply state-of-the-art methods to
                                         the task, including a BERT-based transformer and character-level Perceiver model, both pre-trained on
                                         a large collection of eighteenth century works (ECCO), using a new set of hand-annotated documents
                                         created to re昀氀ect historical divisions. Results show that both models perform signi昀椀cantly better than a
                                         linear baseline, particularly when ECCO-BERT is combined with t昀椀df features, though for this task the
                                         character-level model provides no obvious advantage. Initial evaluation of the genre sequence method
                                         shows it may in the future be useful in determining and dividing the multiple genres of miscellaneous
                                         and hybrid historical texts.

                                         Keywords
                                         BERT, text classi昀椀cation, genre change, ECCO, Perceiver




1. Introduction
Thinking about large-scale development of early modern public discourse through the use of
structured data is an exciting opportunity as was established by Moretti some time ago. [19]
Besides the use of already available bibliographic data for “distant reading”, a useful further
element is to use unstructured textual databases as source material for the creation of new
structured data on 昀椀elds that are currently poorly available. [15] One such classi昀椀cation 昀椀eld
is genre. Readily available genre information is o昀琀en sporadic, but the opportunities to use it
– especially when we think that many documents are composed of several sequential genres –
can open a new window to the development of public discourse. With better structured data,
CHR 2022: Computational Humanities Research Conference, December 12–14, 2022, Antwerp, Belgium
£ jinbin.zhang@aalto.fi (J. Zhang); yann.ryan@helsinki.fi (Y. C. Ryan); iitara@utu.fi (I. Rastas); figint@utu.fi
(F. Ginter); mikko.tolonen@helsinki.fi (M. Tolonen); rohit.babbar@aalto.fi (R. Babbar)
ç yann-ryan.github.io (Y. C. Ryan)
ȉ 0000-0001-8186-8677 (J. Zhang); 0000-0003-1878-4838 (Y. C. Ryan); 0000-0003-2892-8911 (M. Tolonen);
0000-0002-3787-8971 (R. Babbar)
                                       © 2022 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                                       243
we will be able to study the systematization of particular genres in a new manner and take a
fresh look on authorship and the relevance of publisher networks.
   Much work in literary history and the history of the book has relied on the analysis of
generic categories (for examples see [20, 33, 34, 35, 2, 19]). Computational genre classi昀椀cation
is a complex problem. Two key reasons are that genre divisions change over time, and not
every book can be unambiguously assigned a single genre label. Existing methods for genre
detection o昀琀en assume each text or pre-de昀椀ned chunk such as a chapter or section can be
classi昀椀ed as a single genre or a distribution of genre probabilities [7, 38, 6], which does not
re昀氀ect the reality of many eighteenth century texts. One important exception to this is the
page-level classi昀椀cation of Underwood et al. [36], subsequently used to detect sequences of
genre using a hidden Markov model. [37]
   This paper describes a number of improvements to existing methods: 昀椀rst, rather than re-
lying on existing modern or broad classi昀椀cation systems, we use a newly-created training set
of documents, with a custom-designed, domain-speci昀椀c taxonomy which attempts to balance
pragmatism with capturing meaningful and 昀椀ne-grained eighteenth-century organisational
categories. Second, we use a BERT transformer model which has been speci昀椀cally trained on
eighteenth century texts, which performs signi昀椀cantly better than base BERT, and third, we
propose a method by which we hope this 昀椀ne-grained classi昀椀cation can be used to represent
books as sequences and combinations of genres.
   We report on and compare results from a number of classi昀椀ers: a document-level classi昀椀er
that uses only one BERT input segment for each document (ECCO-BERT-Seq), a classi昀椀er for
text chunks, which can also be aggregated on a document-level (ECCO-BERT-Chunk), and a
character-level Perceiver model using the same input as ECCO-BERT-Seq.1 The BERT model
[11] has achieved great improvements on various modern language datasets in comparison
to previous deep learning methods. Recently, there have also been some models which are
pre-trained on historical corpora of di昀昀erent languages [21, 16, 39], and pre-trained language
models are also used in the historical domain, such as predicting the year [21], named entity
recognition [16, 13, 1, 27] and emotion analysis. [26, 25] We also face some challenges from
OCR recognition errors [10, 29] when using pre-trained models for historical data.


2. The ECCO Dataset
The data used both for model training and for predictions comes from Eighteenth Century Col-
lections Online (ECCO). ECCO is a set of 180,000 digitised documents published originally in
the eighteenth century, created by the so昀琀ware and education company Gale. [5] These digi-
tised images have been converted into readable text data using Optical Character Recognition
(OCR). Despite its size, a recent study comparing ECCO to the English Short Title Catalogue
(ESTC) has highlighted signi昀椀cant gaps and imbalances[32], and the ESTC itself is known to
be incomplete. [22] These attributes, and the impact of them on several downstream tasks,

1
    In this paper the words ’book’ and ’document’ have distinct meanings. ’Book’ is used to denote an edition of a
    physical book, for example ’there are over 400,000 books listed in the English Short Title Catalogue’. ’Document’
    by contrast, is reserved for a single text document as used for data for the classi昀椀cation method and other tasks.
    Not all documents in the ECCO data map to a single book, and vice-versa.




                                                          244
have been covered in detail in previous papers [30, 8, 14] and are just brie昀氀y outlined here.
First, the distribution of documents in ECCO is uneven and skewed towards the end of the
century and second, the OCR contains signi昀椀cant noise and errors. Additionally, not all texts
are in the English language, and many are reprints of works published in earlier centuries. The
former have been excluded but the latter are retained for our training and test data. Despite
these caveats, ECCO is the largest and most complete source we have for eighteenth-century
text data. Though it has its own institutional history and biases, it is complete enough that it
contains not only the more ‘important’ or ‘literary’ genres, nor is it focused solely on canonical
works. Its data and digitised images are used extensively, forming the basis of many scholarly
enquiries and research questions. [31]


3. Data Annotation
Key to the work leading up to this paper was to create a usable training set of documents anno-
tated with genre labels. We began with a sample set of book records and a set of preliminary
genre labels. These books were then labelled by two annotators with domain expertise. At
this stage, we revisited the labels, and made some adjustments to those which had particu-
larly low inter-annotator agreement. Once the set of genre labels had been 昀椀nalised, we anno-
tated a large set (5,672 individual works, which correspond to 37,574 known editions, of which
30,119 correspond to ECCO documents) with genre information. A昀琀er this second round, we
again checked for inter-annotation agreement, coming to a consensus following a discussion
of each disagreement. The eventual 43 昀椀ne-grained categories were then collapsed into main
categories for some of the classi昀椀cation tasks. These book labels were then mapped to the
equivalent ECCO document IDs. The 昀椀nal set of labels are given in appendix A.
   Existing categorical distinctions were either too broad (for example 昀椀ction and non-昀椀ction)
or too 昀椀ne-grained (for example the many historical literary divisions, particularly poetic) for
our needs. Our categories attempt to re昀氀ect the divisions as found in contemporary sources
such as catalogues. [17] Additionally, they are closely related to the divisions used by modern
domain experts writing on the history of the book, for example the chapters of the highly-
regarded edited collection Books and their Readers in Eighteenth-Century England, which con-
tains chapters organised along similar divisions to our own. [23, 24] We note that other recent
attempts to categorise eighteenth century book genres use a similar system of division. [18]
The selection is intended to provide useful genre categorisation for scholarly inquiry into book
history and book production. The selection was also pragmatic, with the aim of ending up with
a manageable number of genres, for example so that each class had enough data for the training
and test sets. They were also made with particular questions in mind, which we hoped would
help us to analyse works of Scottish Enlightenment thought, for instance helping to distinguish
patterns within scienti昀椀c or philosophical publishing.




                                              245
4. Method
In this section, we introduce the pre-trained ECCO-BERT model, 昀椀ne-tuning models and base-
                                                           �㕁
lines.2 We denote the training dataset as {(�㕋�㕖 , �㕌�㕖 )}�㕖=1 , where �㕋�㕖 is the book, and �㕌�㕖 is the genre
of �㕋�㕖 . Our goal is to learn a function �㕓 (�㕋�㕖 ) to predict the genre for book �㕋�㕖 or the genre of a
chunk in book �㕋�㕖 .

4.1. Multi-granular Classification with ECCO-BERT
ECCO-BERT [21] is a pre-trained language model trained on the ECCO dataset, the con昀椀gu-
ration of which is the same as the bert-base-cased model [11] except for the vocabulary size.
The model is pre-trained with a masked language modelling task, as well as a next sentence
prediction task. The 昀椀ne-tuned ECCO-BERT consists of two parts, one is the transformer en-
coder and the other is the linear layer on the top of mean pooling output of the encoder, which
scores di昀昀erent genres. The Transformer model architecture on which the model is based can
accept inputs up to a relatively short maximum length, in the ECCO-BERT case the standard
maximum of 512 input tokens applies. Inputs longer than this maximum length need to be split
into chunks.




Figure 1: The inference process of ECCO-BERT-Chunk. The document is torn into chunks without
overlap. The model scores each chunk and averages the probability of each chunk as the final prediction.


   Because we want the training and prediction of the model to take into account the full infor-
mation of the document, a document is torn into di昀昀erent chunks of 510 tokens each to train
the model and predict results, since the maximum input size of ECCO-BERT is 512 tokens (510
input tokens and 2 special tokens expected by the model). For training the model, we assume
that each chunk has the same genre as the document, and the model is trained with the re-
sulting (chunk, label) pairs. During the inference procedure, we 昀椀rst split the document into
chunks. The 昀椀ne-tuned model then scores each chunk; the predicted genre probability of the
document is the average of all chunks’ probability. The inference process is shown in Figure
1. We call this model ECCO-BERT-Chunk. For comparison, we also train a model conditioned
only on the 昀椀rst 510 sub-words of the document as input, which is denoted as ECCO-BERT-Seq.
   Although the ECCO-BERT-Chunk model considers all chunks to make the 昀椀nal judgment,
its prediction process is very slow since a book o昀琀en contains a lot of chunks. At the same time,
the much faster ECCO-BERT-Seq is only conditioned on the 昀椀rst 510 sub-words, so it might
lose some important information of other parts in the book. To solve this problem, we trained
2
    The model implementation is available at https://github.com/HPC-HD/ECCO-genre-classi昀椀cation. The original
    ECCO-Bert model has been released and is available at https://huggingface.co/TurkuNLP/eccobert-base-cased-v1




                                                       246
a linear model by concatenating the tf-idf features of the full text with the pooling output of the
昀椀ne-tuned ECCO-BERT-Seq. The input can be denoted as [Φ�㕡�㕓 �㕖�㕑�㕓 (�㕋�㕖 ), Φ�㔸�㔶�㔶�㕂−�㔵�㔸�㕅�㕇 (�㕋�㕖 [∶ 510])],
where Φ represents the transformer encoder and the vectorizer of tf-idf. We call the model
ECCO-BERT-t昀椀df, all results shows in Table 1.

4.2. Baseline Models
There are two baseline models we adopt for comparison. The input of linear model is tf-idf
features of the full document. The model only contains the linear layer, the fan-out of the
linear model is the number of main or sub categories. The bert-base-cased is released by [11],
which we 昀椀ne-tuned directly with our training data.


5. Results
There are 30,119 documents annotated by experts. 6,024 documents were randomly selected
and split into development and test datasets, with 3,012 documents each. The labels contain 10
main categories and 43 sub-categories. The genre labels are presented in A.1.

5.1. Experimental Details
The sequence length of all BERT models is set to be 512. For 昀椀ne-tuning the ECCO-BERT-Seq
model and bert-base-cased model, we only adopt the 昀椀rst 510 sub-words of the document as
input. These models are trained for 100 epochs on 1 NVIDIA V100. ECCO-BERT-Chunk is
昀椀ne-tuned on 4 NVIDIA A100 GPUs; the main category model and the sub-category model
were trained for 21 and 20 epoches respectively, using an early stop strategy.
   The loss function of the linear model is cross entropy. We perform training for 200 epochs
with SGD with momentum [28] and a batch size of 32. The number of tf-idf features is 500,000.
   The ECCO-BERT-t昀椀df models are trained for 220 epochs with SGD with momentum. The
feature extractors are the encoders of 昀椀ne-tuned ECCO-BERT-Seq and vectorizer of linear base
models. In order to make the model make more use of tf-idf features, at the 昀椀rst 200 epoches,
we mask the features from ECCO-BERT-Seq. The number of tf-idf features is 500,000, the
dimension of features extracted from ECCO-BERT-Seq is 768.
   In addition to the primary ECCO-BERT model, we also trained the Perceiver IO model [9]
on the same data as the BERT models. Perceiver is a Transformer model that decouples input
size from overall model size and allows the model to scale linearly with the size of the input
as well as model depth. Perceiver IO generalizes Perceiver further by allowing for arbitrary
outputs. Due to their linear scaling characteristics, the Perceiver models make it practical to use
character-level input data which could result in a model that is more robust against character-
level OCR artefacts in the ECCO dataset. Testing this property is our main motivation for using
Perceiver IO on this task. We pre-trained Perceiver on the ECCO data for 1 million steps with
an e昀昀ective batch size of 768. Training is done similarly to ECCO-BERT, except that the next
sentence prediction task is not used. Fine-tuning for the genre classi昀椀cation task is also similar
to the BERT models, except that un昀椀ltered, byte-level data is used as model inputs.




                                                     247
Table 1
Performance comparison for predicting categories with fine-tuned ECCO-BERT models and baselines

                                   Main categories                     Sub categories
  Type                     Development (acc)      Test (acc)   Development (acc)      Test (acc)
  linear_model                          0.9303       0.9333                 0.8828       0.8904
  bert-base-cased                       0.9316       0.9363                 0.9011       0.9041
  ECCO-Perceiver-Seq                    0.9595       0.9555                 0.9280       0.9329
  ECCO-BERT-Seq                         0.9562       0.9602                 0.9333       0.9416
  ECCO-BERT-Chunk                       0.9622       0.9651                 0.9346       0.9419
  ECCO-BERT-tfidf                       0.9645       0.9688                 0.9442       0.9485


5.2. Genre Model Performance
We report the models’ accuracy for main categories and sub-categories in Table 1. The con-
fusion matrix of ECCO-BERT-t昀椀df is shown in Figure 2. There is a signi昀椀cant gap between
昀椀ne-tuned bert-base-cased model and other models based on ECCO-BERT, since the bert-base-
cased model is pre-trained on modern language corpus, was not exposed to OCR noise during
pre-training, and the language has naturally evolved between 18th century and present-day
English. Although ECCO-BERT-Seq is only conditioned on the 昀椀rst 510 tokens of the docu-
ment, its results are also competitive compared to ECCO-BERT-Chunk and ECCO-BERT-t昀椀df
which consider the full document. As shown in Table 1, ECCO-BERT-t昀椀df performs best since
it combines the transformer feature and t昀椀df of the full document. ECCO-BERT-t昀椀df is also
much faster than ECCO-BERT-Chunk because extracting t昀椀df is much faster than inference of
transformer models.
   Of particular note is the performance of all ECCO-BERT models over base BERT and the
linear model, when looking at the more 昀椀ne-grained categories. Somewhat disappointingly,
the 昀椀ne-tuned Perceiver IO models do not perform better than BERT-based models on this task
in our evaluation. This would indicate that the OCR noise does not interfere with the genre
detection task enough to degrade the performance of BERT-based models.

5.3. Document-level Evaluation and Prediction results
Here we report on both the evaluation of the document-level results for the main categories.
The confusion matrix in 2 shows that the precision of the literature category is the highest while
education is the lowest. We also use the ECCO-BERT-t昀椀df model to predict unlabeled ECCO
data and obtain model-predicted genre distributions. There are 177,494 unlabeled documents
in total. The breakdown of predicted categories are shown in Figure 3. As our label taxonomy
is custom-made, there is no ground truth for the entirety of ECCO to fully evaluate the accuracy
of the predictions. However the predictions roughly match up with our expectations: previous
analyses of the ESTC, using the existing Dewey Decimal System labels, have found that the
most common subject category is religion. [4]




                                              248
                                                                                        Education
                                                                                   Sales Catalogues
                                  Arts 376 0 6 0 1 0 0 0 0 0                        Philosophy 2.25%
              Scientific Improvement 0 274 2 2 3 0 1 1 1 0      800
                                                                                     Arts                      Religion
                            Literature 10 3 918 2 7 1 3 0 4 5
                                                                                                  1.51%
                            Education 0 2 0 78 0 0 0 0 1 0      600
                                                                                               4.21%
    Predictions



                               History 0 3 3 1 378 0 0 2 2 2
                                                                            History         7.40%       23.04%
                                  Law 0 0 0 0 0 199 0 0 0 1     400
                    Sales Catalogues 0 0 2 0 0 0 94 0 0 0                                8.72%
                               Politics 0 1 0 0 2 1 0 72 0 0    200
                           Philosophy 0 0 0 1 0 0 0 2 167 3                             9.01%
                              Religion 1 0 4 2 1 2 1 2 0 362
                                                                0           Politics                           18.98%
                                            Education                                        12.26%                     Literature
                                                  Arts
                              Scientific Improvement
                                            Literature

                                               History
                                                  Law
                                    Sales Catalogues
                                               Politics
                                           Philosophy
                                              Religion
                                                                                                      12.61%
                                                                                       Law
                                                                                                         Scientific Improvement
                                         True Labels


Figure 2: ECCO-BERT-tfidf confusion matrix                             Figure 3: The predicted main categories’ distribution.


6. Fine-grained analysis with ECCO-BERT-Seq
6.1. Sequential Genre Change
As well as using the ECCO-BERT-Seq to generate document-level predictions using average
values, we can use the individual chunk predictions directly. Here we propose a method to
use this paragraph-level detection to detect chunks within documents where the change from
one genre to another is signi昀椀cant and sustained. Because the predicted genre generally os-
cillates signi昀椀cantly from one individual chunk to the next, we needed a method to capture
only sustained changes, ignoring shorter breaks within a ’run’ of the same genre. To do this,
we used the Kleinberg algorithm for detecting ’bursts’ of activity in time-series data. [12] This
uses a hidden Markov process to probabilistically determine when a subsequent event will
occur. When events occur more rapidly and for sustained periods in comparison to this deter-
mination, these are labelled bursts. The detection of the bursts were computed using R bursts
package [3], which implements the Kleinberg algorithm.
   To adapt this method, the most probable prediction for each chunk within each document
was treated as a time series data point for Kleinberg. We have calculated sections for main and
subcategories separately. The method allows for ’fuzzy’ and overlapping sections of genres.
Additionally we have experimented with only retaining highly-probable classi昀椀cations which
helped to further 昀椀lter out noise. There are drawbacks: because the burst method looks for
change rather than simply all clusters of events, currently not all sections are detected if most
of the text is of a single genre.
   To give some examples, we take some exemplary texts and calculate genre bursts. To vi-
sualise the changes in genre, top genre predictions (over .5 probability) are charted as a scat-
terplot in the paragraph sequence, coloured by genre. Burst start and end points are overlaid
as coloured areas. As the method looks for periods of change rather than absolute values, it
ignores the main category of the book (which is detected by the document-level method suc-
cessfully anyway) and in most cases highlights sustained excerpts where the detected genre is
di昀昀erent to the dominant one. Here, we see that David Hume’s Political Discourses (Figure 4,




                                                                      249
A) contains discrete sections on economics (categorised as scienti昀椀c improvement), philosophy
(a section on the balance of power), history (a section on ’ancient nations’) and 昀椀nally law (a
chapter on the idea of the commonwealth). Wealth of Nations (Figure 4, B) begins with a section
on labour and society categorised here as philosophy and smaller sections on law (a discussion
on a speci昀椀c statute), and in the education genre. Most of the book is not classi昀椀ed as its dom-
inant genre (economics and trade, under the higher-level category scienti昀椀c improvement) as
it does not involve change. Villier’s Miscellaneous Works (Figure 4, C) detects a large number
of overlapping genre changes. Finally, Robinson Crusoe (Figure 4, D) is also mostly without
detected bursts, but of note is a section of religious genre, corresponding to a section in the
plot where Crusoe is ill and has prophetic dreams.

      A
          1.0
          0.9
          0.8
          0.7
          0.6
          0.5
                0                     300                              600                             900

      B
          1.0
          0.9
          0.8
          0.7
          0.6
          0.5
                0                                 500                                     1000                            1500

      C
          1.0
          0.9
          0.8
          0.7
          0.6
          0.5
                0                        500                                 1000                              1500

      D
          1.0
          0.9
          0.8
          0.7
          0.6
          0.5
                0                        500                                 1000                              1500


                         Arts                           Literature     History      Sales Catalogues         Philosophy
                         Scientific Improvement         Education      Law          Politics                 Religion



Figure 4: Sections of genre bursts calculated by Kleinberg algorithm. Points are prediction probabili-
ties, filtered to .5 or greater. Coloured shaded areas are sections of genre bursts. Because this method
looks for change rather than absolute values, it ignore the main genre of the book in the first two cases
(politics, and economics).




7. Discussion and Conclusion
In this paper we aimed to describe the process to detect sections of fuzzy and overlapping
genre excerpts within individual editions. The results show that at the level of 昀椀ne-grained
divisions (43 subcategories), a model which combines the t昀椀df feature of the full document and
the features of a 昀椀ne-tuned ECCO-BERT model performs signi昀椀cantly better than baselines,
suggesting they may be particularly useful for such tasks. That the BERT model performed so




                                                                 250
well on 昀椀ne-grained categories is signi昀椀cant because existing methods to look at genre have
generally used very broad divisions (such as 昀椀ction and non-昀椀ction). The kinds of questions we
are interested in use more 昀椀ne-grained categories, for example looking at the rise of medical
textbooks in certain publishers. This kind of sequencing also has other potential uses, for
example document retrieval. On the present task, we did not observe any improvement o昀昀ered
by the Perceiver model, which we speci昀椀cally included to test a character-level model which is
capable of accounting for OCR artefacts. At present, we think this is due to a combination of
two factors: Firstly, the base performance on the task is around 95% accuracy, leaving only very
little headroom for improvement with more advanced models. And secondly, the task is by its
nature a document-level task and the good performance of the linear baseline demonstrates that
enough information is present in the data even without explicitly accounting for OCR errors.
It is therefore possible that the advantages of character-based models such as the Perceiver
will be demonstrated on tasks where the correct modelling of individual word occurrences
in their context plays a more signi昀椀cant role. These would include various text tagging and
information retrieval tasks.
   In our future work we hope to further develop the sequencing method, and investigate the
genres in their own right, for instance looking at the sequence patterns of individual authors,
the relationship between intra-book diversity and the success of particular authors or publish-
ers, and understanding co-occurrence between genres.


References
 [1] B. Baptiste, B. Favre, J. Auguste, and C. Henriot. “Transferring Modern Named Entity
     Recognition to the Historical Domain: How to Take the Step?” In: Workshop on Natural
     Language Processing for Digital Humanities (NLP4DH). 2021.
 [2] B. M. Benedict. “The Paradox of the Anthology: Collecting and Di昀昀érence in Eighteenth-
     Century Britain”. In: New Literary History 34.2 (2003), pp. 231–256. url: http://www.jst
     or.org/stable/20057778.
 [3] J. Binder. bursts: Markov Model for Bursty Behavior in Streams. 2022. url: https://CRAN.R-
     project.org/package=bursts.
 [4] J. Feather. “British Publishing in the Eighteenth Century: a preliminary subject analysis”.
     In: The Library s6-VIII.1 (1986), pp. 32–46. doi: 10.1093/library/s6-VIII.1.32. url: https:
     //doi.org/10.1093/library/s6-VIII.1.32.
 [5] Gale. Eighteenth Century Collections Online. url: https://www.gale.com/intl/primary-so
     urces/eighteenth-century-collections-online.
 [6] A. Goyal and V. Prem Prakash. “Statistical and Deep Learning Approaches for Literary
     Genre Classi昀椀cation”. In: Advances in Data and Information Sciences. Ed. by S. Tiwari,
     M. C. Trivedi, M. L. Kolhe, K. Mishra, and B. K. Singh. Vol. 318. Singapore: Springer
     Singapore, 2022, pp. 297–305. doi: 10.1007/978-981-16-5689-7\_26. url: https://link.spri
     nger.com/10.1007/978-981-16-5689-7%5C%5F26.




                                             251
 [7] S. Gupta, M. Agarwal, and S. Jain. “Automated Genre Classi昀椀cation of Books Using Ma-
     chine Learning and Natural Language Processing”. In: 2019 9th International Conference
     on Cloud Computing, Data Science & Engineering (Con昀氀uence). Noida, India: Ieee, 2019,
     pp. 269–272. doi: 10.1109/confluence.2019.8776935. url: https://ieeexplore.ieee.org/doc
     ument/8776935/.
 [8] M. J. Hill and S. Hengchen. “Quantifying the impact of dirty OCR on historical text anal-
     ysis: Eighteenth Century Collections Online as a case study”. In: Digital Scholarship in
     the Humanities 34.4 (2019), pp. 825–843. doi: 10.1093/llc/fqz024. url: https://academic.o
     up.com/dsh/article/34/4/825/5476122.
 [9] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D.
     Zoran, A. Brock, E. Shelhamer, O. Héna昀昀, M. M. Botvinick, A. Zisserman, O. Vinyals,
     and J. Carreira. Perceiver IO: A General Architecture for Structured Inputs & Outputs. 2021.
     doi: 10.48550/arxiv.2107.14795. url: https://arxiv.org/abs/2107.14795.
[10]   M. Jiang, Y. Hu, G. Worthey, R. C. Dubnicek, T. Underwood, and J. S. Downie. “Impact of
       OCR Quality on BERT Embeddings in the Domain Classi昀椀cation of Book Excerpts.” In:
       Chr. 2021, pp. 266–279.
[11]   J. D. M.-W. C. Kenton and L. K. Toutanova. “Bert: Pre-training of deep bidirectional trans-
       formers for language understanding”. In: Proceedings of naacL-HLT. 2019, pp. 4171–4186.
[12]   J. Kleinberg. “Bursty and Hierarchical Structure in Streams”. In: Data Mining and Knowl-
       edge Discovery 7.4 (2003), pp. 373–397. doi: 10.1023/a:1024940629314. url: https://doi.or
       g/10.1023/A:1024940629314.
[13]   K. Labusch, P. Kulturbesitz, C. Neudecker, and D. Zellhöfer. “BERT for named entity
       recognition in contemporary and historical German”. In: Proceedings of the 15th confer-
       ence on natural language processing. 2019, pp. 9–11.
[14]   L. Lahti, E. Mäkelä, and M. Tolonen. “Quantifying Bias and Uncertainty in Historical
       Data Collections with Probabilistic Programming”. In: (2020). url: https://helda.helsink
       i.fi/handle/10138/327728.
[15]   L. Lahti, J. Marjanen, H. Roivainen, and M. Tolonen. “Bibliographic Data Science and the
       History of the Book (c. 1500–1800)”. In: Cataloging & Classi昀椀cation Quarterly 57.1 (2019),
       pp. 5–23. doi: 10.1080/01639374.2018.1543747. url: https://doi.org/10.1080/01639374.20
       18.1543747.
[16]   E. Manjavacas and L. Fonteyn. “Adapting vs. Pre-training Language Models for Historical
       Languages”. In: Journal of Data Mining & Digital Humanities Nlp4dh (2022). doi: 10.462
       98/jdmdh.9152. url: https://jdmdh.episciences.org/9690.
[17]   J. Manson. A catalogue of the entire and genuine library and prints of Robert Salusbury
       Gotton, Esq. F.A.S. [electronic resource] : Comprehending an extensive and valuable collec-
       tion of books of coins, medals and antiquities, with a few 昀椀nk missals and other manuscripts
       on vellum, which, with some other select parcels of books lately purchased, are now on sale
       for ready money, at the price printed in the catalogue, and on the 昀椀rst leaf of each-book,
       By John Manson, bookseller, No 5, Duke’s-Court, St. Martin’s-Lane, where catalogues (Price
       6d) may be had. [London, 1789, [2], 102 pages.




                                               252
[18]   D. Mazella, C. Willan, D. Bishop, E. Stravoski, W. Barta, and M. James. ““All the modes
       of story”: Genre and the Gendering of Authorship in the Year 1771”. In: ABO: Interactive
       Journal for Women in the Arts, 1640-1830 12.1 (2022). doi: http://doi.org/10.5038/2157-71
       29.12.1.1256. url: https://digitalcommons.usf.edu/abo/vol12/iss1/10.
[19]   F. Moretti. Distant reading. London ; New York: Verso, 2013.
[20]   M. Poovey. “Mary Wollstonecra昀琀: The Gender of Genres in Late Eighteenth-Century
       England”. In: NOVEL: A Forum on Fiction 15.2 (1982), pp. 111–126. url: http://www.jsto
       r.org/stable/1345219.
[21]   I. Rastas, Y. C. Ryan, I. L. I. Tiihonen, M. Qaraei, L. Repo, R. Babbar, E. Mäkelä, M. Tolonen,
       and F. Ginter. “Explainable Publication Year Prediction of Eighteenth Century Texts with
       the BERT Model”. In: Proceedings of the 3rd Workshop on Computational Approaches to
       Historical Language Change. The Association for Computational Linguistics. 2022.
[22]   J. Raven. The business of books: booksellers and the English book trade, 1450-1850. New
       Haven: Yale University Press, 2007.
[23]   I. Rivers, ed. Books and their readers in eighteenth century England. Leicester: Leicester
       Univ. Press [u.a.], 1982.
[24]   I. Rivers, ed. Books and their readers in eighteenth-century England: new essays. London
       New York: Leicester University Press, 2001.
[25]   T. Schmidt, K. Dennerlein, and C. Wol昀昀. “Emotion Classi昀椀cation in German Plays with
       Transformer-based Language Models Pretrained on Historical and Contemporary Lan-
       guage”. In: Association for Computational Linguistics. 2021.
[26]   T. Schmidt, K. Dennerlein, and C. Wol昀昀. “Using Deep Learning for Emotion Analysis of
       18th and 19th Century German Plays”. In: (2021).
[27]   S. Schweter and L. März. “Triple E-E昀昀ective Ensembling of Embeddings and Language
       Models for NER of Historical German.” In: CLEF (Working notes). 2020.
[28]   I. Sutskever, J. Martens, G. Dahl, and G. Hinton. “On the importance of initialization
       and momentum in deep learning”. In: International conference on machine learning. Pmlr.
       2013, pp. 1139–1147.
[29]   K. Todorov and G. Colavizza. “An Assessment of the Impact of OCR Noise on Language
       Models”. In: arXiv preprint arXiv:2202.00470 (2022).
[30]   M. Tolonen, E. Mäkelä, A. Ijaz, and L. Lahti. “Corpus Linguistics and Eighteenth Century
       Collections Online (ECCO)”. In: Research in Corpus Linguistics 9.1 (2021), pp. 19–34. doi:
       10.32714/ricl.09.01.03. url: https://ricl.aelinco.es/index.php/ricl/article/view/161.
[31]   M. Tolonen, E. Mäkelä, A. Ijaz, and L. Lahti. “Corpus Linguistics and Eighteenth Century
       Collections Online (ECCO)”. In: Research in Corpus Linguistics 9.1 (2021), pp. 19–34. doi:
       10.32714/ricl.09.01.03. url: https://ricl.aelinco.es/index.php/ricl/article/view/161.
[32]   M. Tolonen, E. Mäkelä, and L. Lahti. “The Anatomy Of Eighteenth Century Collections
       Online (Ecco)”. In: Eighteenth-century studies 56.1 (2022), pp. 95–123.




                                                253
[33]   T. Underwood. Distant horizons: digital evidence and literary change. Chicago: The Uni-
       versity of Chicago Press, 2019.
[34]   T. Underwood. “Genre Theory and Historicism”. In: Journal of Cultural Analytics 2.2
       (2016). doi: 10.22148/16.008. url: https://culturalanalytics.org/article/11063.
[35]   T. Underwood. “The Life Cycles of Genres”. In: Journal of Cultural Analytics 2.2 (2016).
       doi: 10.22148/16.005. url: https://culturalanalytics.org/article/11061.
[36]   T. Underwood. “Understanding Genre in a Collection of a Million Volumes, Interim Re-
       port”. In: (2014). doi: 10.6084/m9.figshare.1281251.v1. url: https://figshare.com/article
       s/journal%5C%5Fcontribution/Understanding%5C%5FGenre%5C%5Fin%5C%5Fa%5C%5
       FCollection%5C%5Fof%5C%5Fa%5C%5FMillion%5C%5FVolumes%5C%5FInterim%5C%5
       FReport/1281251.
[37]   T. Underwood, M. L. Black, L. Auvil, and B. Capitanu. Mapping Mutable Genres in Struc-
       turally Complex Volumes. 2013. doi: 10.1109/BigData.2013.6691676. url: http://arxiv.org
       /abs/1309.3323.
[38]   J. Worsham and J. Kalita. “Genre Identi昀椀cation and the Compositional E昀昀ect of Genre
       in Literature”. In: Proceedings of the 27th International Conference on Computational Lin-
       guistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018,
       pp. 1963–1973. url: https://aclanthology.org/C18-1167.
[39]   H. Yoo, J. Jin, J. Son, J. Bak, K. Cho, and A. Oh. “HUE: Pretrained Model and Dataset for
       Understanding Hanja Documents of Ancient Korea”. In: Findings of the Association for
       Computational Linguistics: NAACL 2022. Seattle, United States: Association for Compu-
       tational Linguistics, 2022, pp. 1832–1844. url: https://aclanthology.org/2022.findings-n
       aacl.140.


A. Appendix
A.1. The main categories and sub-categories




                                              254
Table 2
The information of main categories and sub-categories
           Main categories          Sub-categories
                                    Fine Arts and Aesthetics
           Arts                     Music, hymns, songs
                                    Theatre, plays, opera
                                    Advice literature
                                    General Education
           Education                Recipe Books
                                    Hobbies & Games
                                    Instructional books
           History                  Biographical History
                                    General History
                                    Acts, proclamations
                                    Appeals
           Law                      Collected bills
                                    Legal essays
                                    Proclamations
                                    Trial accounts
                                    Classics
                                    Collected Works
                                    Criticism
                                    Drama
           Literature               Novels
                                    Other fiction
                                    Periodicals
                                    Poetry
                                    Travel
                                    Human understanding, metaphysics
           Philosophy               Moral Philosophy
                                    Political philosophy
                                    Political essays
           Politics                 Intelligence
                                    Parliamentary speeches
           Religion                 Sermons
                                    Theology
           Sales Catalogues         Sales catalogues, almanacs, directories etcetera
                                    Agriculture and animal husbandry
                                    Economics and trade
                                    Geography, cartography, astronomy and navigation
                                    Languages
           Scientific Improvement   Mathematics
                                    Medicine and anatomy
                                    Natural history
                                    Natural philosophy
                                    Practical trades, mechanics, engineering




                                               255