Detecting Sequential Genre Change in Eighteenth-Century Texts Jinbin Zhang1 , Yann Ciarán Ryan3 , Iiro Rastas4 , Filip Ginter4 , Mikko Tolonen3 and Rohit Babbar1 1 Aalto University, Finland 3 University of Helsinki, Finland 4 TurkuNLP, University of Turku, Finland Abstract Machine classi昀椀cation of historical books into genres is a common task for NLP-based classi昀椀ers and has a number of applications, from literary analysis to information retrieval. However it is not a straight- forward task, as genre labels can be ambiguous and subject to temporal change, and moreoever many books consist of mixed or miscellaneous genres. In this paper we describe a work-in-progress method by which genre predictions can be used to determine longer sequences of genre change within books, which we test out with visualisations of some hand-picked texts. We apply state-of-the-art methods to the task, including a BERT-based transformer and character-level Perceiver model, both pre-trained on a large collection of eighteenth century works (ECCO), using a new set of hand-annotated documents created to re昀氀ect historical divisions. Results show that both models perform signi昀椀cantly better than a linear baseline, particularly when ECCO-BERT is combined with t昀椀df features, though for this task the character-level model provides no obvious advantage. Initial evaluation of the genre sequence method shows it may in the future be useful in determining and dividing the multiple genres of miscellaneous and hybrid historical texts. Keywords BERT, text classi昀椀cation, genre change, ECCO, Perceiver 1. Introduction Thinking about large-scale development of early modern public discourse through the use of structured data is an exciting opportunity as was established by Moretti some time ago. [19] Besides the use of already available bibliographic data for “distant reading”, a useful further element is to use unstructured textual databases as source material for the creation of new structured data on 昀椀elds that are currently poorly available. [15] One such classi昀椀cation 昀椀eld is genre. Readily available genre information is o昀琀en sporadic, but the opportunities to use it – especially when we think that many documents are composed of several sequential genres – can open a new window to the development of public discourse. With better structured data, CHR 2022: Computational Humanities Research Conference, December 12–14, 2022, Antwerp, Belgium £ jinbin.zhang@aalto.fi (J. Zhang); yann.ryan@helsinki.fi (Y. C. Ryan); iitara@utu.fi (I. Rastas); figint@utu.fi (F. Ginter); mikko.tolonen@helsinki.fi (M. Tolonen); rohit.babbar@aalto.fi (R. Babbar) ç yann-ryan.github.io (Y. C. Ryan) ȉ 0000-0001-8186-8677 (J. Zhang); 0000-0003-1878-4838 (Y. C. Ryan); 0000-0003-2892-8911 (M. Tolonen); 0000-0002-3787-8971 (R. Babbar) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 243 we will be able to study the systematization of particular genres in a new manner and take a fresh look on authorship and the relevance of publisher networks. Much work in literary history and the history of the book has relied on the analysis of generic categories (for examples see [20, 33, 34, 35, 2, 19]). Computational genre classi昀椀cation is a complex problem. Two key reasons are that genre divisions change over time, and not every book can be unambiguously assigned a single genre label. Existing methods for genre detection o昀琀en assume each text or pre-de昀椀ned chunk such as a chapter or section can be classi昀椀ed as a single genre or a distribution of genre probabilities [7, 38, 6], which does not re昀氀ect the reality of many eighteenth century texts. One important exception to this is the page-level classi昀椀cation of Underwood et al. [36], subsequently used to detect sequences of genre using a hidden Markov model. [37] This paper describes a number of improvements to existing methods: 昀椀rst, rather than re- lying on existing modern or broad classi昀椀cation systems, we use a newly-created training set of documents, with a custom-designed, domain-speci昀椀c taxonomy which attempts to balance pragmatism with capturing meaningful and 昀椀ne-grained eighteenth-century organisational categories. Second, we use a BERT transformer model which has been speci昀椀cally trained on eighteenth century texts, which performs signi昀椀cantly better than base BERT, and third, we propose a method by which we hope this 昀椀ne-grained classi昀椀cation can be used to represent books as sequences and combinations of genres. We report on and compare results from a number of classi昀椀ers: a document-level classi昀椀er that uses only one BERT input segment for each document (ECCO-BERT-Seq), a classi昀椀er for text chunks, which can also be aggregated on a document-level (ECCO-BERT-Chunk), and a character-level Perceiver model using the same input as ECCO-BERT-Seq.1 The BERT model [11] has achieved great improvements on various modern language datasets in comparison to previous deep learning methods. Recently, there have also been some models which are pre-trained on historical corpora of di昀昀erent languages [21, 16, 39], and pre-trained language models are also used in the historical domain, such as predicting the year [21], named entity recognition [16, 13, 1, 27] and emotion analysis. [26, 25] We also face some challenges from OCR recognition errors [10, 29] when using pre-trained models for historical data. 2. The ECCO Dataset The data used both for model training and for predictions comes from Eighteenth Century Col- lections Online (ECCO). ECCO is a set of 180,000 digitised documents published originally in the eighteenth century, created by the so昀琀ware and education company Gale. [5] These digi- tised images have been converted into readable text data using Optical Character Recognition (OCR). Despite its size, a recent study comparing ECCO to the English Short Title Catalogue (ESTC) has highlighted signi昀椀cant gaps and imbalances[32], and the ESTC itself is known to be incomplete. [22] These attributes, and the impact of them on several downstream tasks, 1 In this paper the words ’book’ and ’document’ have distinct meanings. ’Book’ is used to denote an edition of a physical book, for example ’there are over 400,000 books listed in the English Short Title Catalogue’. ’Document’ by contrast, is reserved for a single text document as used for data for the classi昀椀cation method and other tasks. Not all documents in the ECCO data map to a single book, and vice-versa. 244 have been covered in detail in previous papers [30, 8, 14] and are just brie昀氀y outlined here. First, the distribution of documents in ECCO is uneven and skewed towards the end of the century and second, the OCR contains signi昀椀cant noise and errors. Additionally, not all texts are in the English language, and many are reprints of works published in earlier centuries. The former have been excluded but the latter are retained for our training and test data. Despite these caveats, ECCO is the largest and most complete source we have for eighteenth-century text data. Though it has its own institutional history and biases, it is complete enough that it contains not only the more ‘important’ or ‘literary’ genres, nor is it focused solely on canonical works. Its data and digitised images are used extensively, forming the basis of many scholarly enquiries and research questions. [31] 3. Data Annotation Key to the work leading up to this paper was to create a usable training set of documents anno- tated with genre labels. We began with a sample set of book records and a set of preliminary genre labels. These books were then labelled by two annotators with domain expertise. At this stage, we revisited the labels, and made some adjustments to those which had particu- larly low inter-annotator agreement. Once the set of genre labels had been 昀椀nalised, we anno- tated a large set (5,672 individual works, which correspond to 37,574 known editions, of which 30,119 correspond to ECCO documents) with genre information. A昀琀er this second round, we again checked for inter-annotation agreement, coming to a consensus following a discussion of each disagreement. The eventual 43 昀椀ne-grained categories were then collapsed into main categories for some of the classi昀椀cation tasks. These book labels were then mapped to the equivalent ECCO document IDs. The 昀椀nal set of labels are given in appendix A. Existing categorical distinctions were either too broad (for example 昀椀ction and non-昀椀ction) or too 昀椀ne-grained (for example the many historical literary divisions, particularly poetic) for our needs. Our categories attempt to re昀氀ect the divisions as found in contemporary sources such as catalogues. [17] Additionally, they are closely related to the divisions used by modern domain experts writing on the history of the book, for example the chapters of the highly- regarded edited collection Books and their Readers in Eighteenth-Century England, which con- tains chapters organised along similar divisions to our own. [23, 24] We note that other recent attempts to categorise eighteenth century book genres use a similar system of division. [18] The selection is intended to provide useful genre categorisation for scholarly inquiry into book history and book production. The selection was also pragmatic, with the aim of ending up with a manageable number of genres, for example so that each class had enough data for the training and test sets. They were also made with particular questions in mind, which we hoped would help us to analyse works of Scottish Enlightenment thought, for instance helping to distinguish patterns within scienti昀椀c or philosophical publishing. 245 4. Method In this section, we introduce the pre-trained ECCO-BERT model, 昀椀ne-tuning models and base- �㕁 lines.2 We denote the training dataset as {(�㕋�㕖 , �㕌�㕖 )}�㕖=1 , where �㕋�㕖 is the book, and �㕌�㕖 is the genre of �㕋�㕖 . Our goal is to learn a function �㕓 (�㕋�㕖 ) to predict the genre for book �㕋�㕖 or the genre of a chunk in book �㕋�㕖 . 4.1. Multi-granular Classification with ECCO-BERT ECCO-BERT [21] is a pre-trained language model trained on the ECCO dataset, the con昀椀gu- ration of which is the same as the bert-base-cased model [11] except for the vocabulary size. The model is pre-trained with a masked language modelling task, as well as a next sentence prediction task. The 昀椀ne-tuned ECCO-BERT consists of two parts, one is the transformer en- coder and the other is the linear layer on the top of mean pooling output of the encoder, which scores di昀昀erent genres. The Transformer model architecture on which the model is based can accept inputs up to a relatively short maximum length, in the ECCO-BERT case the standard maximum of 512 input tokens applies. Inputs longer than this maximum length need to be split into chunks. Figure 1: The inference process of ECCO-BERT-Chunk. The document is torn into chunks without overlap. The model scores each chunk and averages the probability of each chunk as the final prediction. Because we want the training and prediction of the model to take into account the full infor- mation of the document, a document is torn into di昀昀erent chunks of 510 tokens each to train the model and predict results, since the maximum input size of ECCO-BERT is 512 tokens (510 input tokens and 2 special tokens expected by the model). For training the model, we assume that each chunk has the same genre as the document, and the model is trained with the re- sulting (chunk, label) pairs. During the inference procedure, we 昀椀rst split the document into chunks. The 昀椀ne-tuned model then scores each chunk; the predicted genre probability of the document is the average of all chunks’ probability. The inference process is shown in Figure 1. We call this model ECCO-BERT-Chunk. For comparison, we also train a model conditioned only on the 昀椀rst 510 sub-words of the document as input, which is denoted as ECCO-BERT-Seq. Although the ECCO-BERT-Chunk model considers all chunks to make the 昀椀nal judgment, its prediction process is very slow since a book o昀琀en contains a lot of chunks. At the same time, the much faster ECCO-BERT-Seq is only conditioned on the 昀椀rst 510 sub-words, so it might lose some important information of other parts in the book. To solve this problem, we trained 2 The model implementation is available at https://github.com/HPC-HD/ECCO-genre-classi昀椀cation. The original ECCO-Bert model has been released and is available at https://huggingface.co/TurkuNLP/eccobert-base-cased-v1 246 a linear model by concatenating the tf-idf features of the full text with the pooling output of the 昀椀ne-tuned ECCO-BERT-Seq. The input can be denoted as [Φ�㕡�㕓 �㕖�㕑�㕓 (�㕋�㕖 ), Φ�㔸�㔶�㔶�㕂−�㔵�㔸�㕅�㕇 (�㕋�㕖 [∶ 510])], where Φ represents the transformer encoder and the vectorizer of tf-idf. We call the model ECCO-BERT-t昀椀df, all results shows in Table 1. 4.2. Baseline Models There are two baseline models we adopt for comparison. The input of linear model is tf-idf features of the full document. The model only contains the linear layer, the fan-out of the linear model is the number of main or sub categories. The bert-base-cased is released by [11], which we 昀椀ne-tuned directly with our training data. 5. Results There are 30,119 documents annotated by experts. 6,024 documents were randomly selected and split into development and test datasets, with 3,012 documents each. The labels contain 10 main categories and 43 sub-categories. The genre labels are presented in A.1. 5.1. Experimental Details The sequence length of all BERT models is set to be 512. For 昀椀ne-tuning the ECCO-BERT-Seq model and bert-base-cased model, we only adopt the 昀椀rst 510 sub-words of the document as input. These models are trained for 100 epochs on 1 NVIDIA V100. ECCO-BERT-Chunk is 昀椀ne-tuned on 4 NVIDIA A100 GPUs; the main category model and the sub-category model were trained for 21 and 20 epoches respectively, using an early stop strategy. The loss function of the linear model is cross entropy. We perform training for 200 epochs with SGD with momentum [28] and a batch size of 32. The number of tf-idf features is 500,000. The ECCO-BERT-t昀椀df models are trained for 220 epochs with SGD with momentum. The feature extractors are the encoders of 昀椀ne-tuned ECCO-BERT-Seq and vectorizer of linear base models. In order to make the model make more use of tf-idf features, at the 昀椀rst 200 epoches, we mask the features from ECCO-BERT-Seq. The number of tf-idf features is 500,000, the dimension of features extracted from ECCO-BERT-Seq is 768. In addition to the primary ECCO-BERT model, we also trained the Perceiver IO model [9] on the same data as the BERT models. Perceiver is a Transformer model that decouples input size from overall model size and allows the model to scale linearly with the size of the input as well as model depth. Perceiver IO generalizes Perceiver further by allowing for arbitrary outputs. Due to their linear scaling characteristics, the Perceiver models make it practical to use character-level input data which could result in a model that is more robust against character- level OCR artefacts in the ECCO dataset. Testing this property is our main motivation for using Perceiver IO on this task. We pre-trained Perceiver on the ECCO data for 1 million steps with an e昀昀ective batch size of 768. Training is done similarly to ECCO-BERT, except that the next sentence prediction task is not used. Fine-tuning for the genre classi昀椀cation task is also similar to the BERT models, except that un昀椀ltered, byte-level data is used as model inputs. 247 Table 1 Performance comparison for predicting categories with fine-tuned ECCO-BERT models and baselines Main categories Sub categories Type Development (acc) Test (acc) Development (acc) Test (acc) linear_model 0.9303 0.9333 0.8828 0.8904 bert-base-cased 0.9316 0.9363 0.9011 0.9041 ECCO-Perceiver-Seq 0.9595 0.9555 0.9280 0.9329 ECCO-BERT-Seq 0.9562 0.9602 0.9333 0.9416 ECCO-BERT-Chunk 0.9622 0.9651 0.9346 0.9419 ECCO-BERT-tfidf 0.9645 0.9688 0.9442 0.9485 5.2. Genre Model Performance We report the models’ accuracy for main categories and sub-categories in Table 1. The con- fusion matrix of ECCO-BERT-t昀椀df is shown in Figure 2. There is a signi昀椀cant gap between 昀椀ne-tuned bert-base-cased model and other models based on ECCO-BERT, since the bert-base- cased model is pre-trained on modern language corpus, was not exposed to OCR noise during pre-training, and the language has naturally evolved between 18th century and present-day English. Although ECCO-BERT-Seq is only conditioned on the 昀椀rst 510 tokens of the docu- ment, its results are also competitive compared to ECCO-BERT-Chunk and ECCO-BERT-t昀椀df which consider the full document. As shown in Table 1, ECCO-BERT-t昀椀df performs best since it combines the transformer feature and t昀椀df of the full document. ECCO-BERT-t昀椀df is also much faster than ECCO-BERT-Chunk because extracting t昀椀df is much faster than inference of transformer models. Of particular note is the performance of all ECCO-BERT models over base BERT and the linear model, when looking at the more 昀椀ne-grained categories. Somewhat disappointingly, the 昀椀ne-tuned Perceiver IO models do not perform better than BERT-based models on this task in our evaluation. This would indicate that the OCR noise does not interfere with the genre detection task enough to degrade the performance of BERT-based models. 5.3. Document-level Evaluation and Prediction results Here we report on both the evaluation of the document-level results for the main categories. The confusion matrix in 2 shows that the precision of the literature category is the highest while education is the lowest. We also use the ECCO-BERT-t昀椀df model to predict unlabeled ECCO data and obtain model-predicted genre distributions. There are 177,494 unlabeled documents in total. The breakdown of predicted categories are shown in Figure 3. As our label taxonomy is custom-made, there is no ground truth for the entirety of ECCO to fully evaluate the accuracy of the predictions. However the predictions roughly match up with our expectations: previous analyses of the ESTC, using the existing Dewey Decimal System labels, have found that the most common subject category is religion. [4] 248 Education Sales Catalogues Arts 376 0 6 0 1 0 0 0 0 0 Philosophy 2.25% Scientific Improvement 0 274 2 2 3 0 1 1 1 0 800 Arts Religion Literature 10 3 918 2 7 1 3 0 4 5 1.51% Education 0 2 0 78 0 0 0 0 1 0 600 4.21% Predictions History 0 3 3 1 378 0 0 2 2 2 History 7.40% 23.04% Law 0 0 0 0 0 199 0 0 0 1 400 Sales Catalogues 0 0 2 0 0 0 94 0 0 0 8.72% Politics 0 1 0 0 2 1 0 72 0 0 200 Philosophy 0 0 0 1 0 0 0 2 167 3 9.01% Religion 1 0 4 2 1 2 1 2 0 362 0 Politics 18.98% Education 12.26% Literature Arts Scientific Improvement Literature History Law Sales Catalogues Politics Philosophy Religion 12.61% Law Scientific Improvement True Labels Figure 2: ECCO-BERT-tfidf confusion matrix Figure 3: The predicted main categories’ distribution. 6. Fine-grained analysis with ECCO-BERT-Seq 6.1. Sequential Genre Change As well as using the ECCO-BERT-Seq to generate document-level predictions using average values, we can use the individual chunk predictions directly. Here we propose a method to use this paragraph-level detection to detect chunks within documents where the change from one genre to another is signi昀椀cant and sustained. Because the predicted genre generally os- cillates signi昀椀cantly from one individual chunk to the next, we needed a method to capture only sustained changes, ignoring shorter breaks within a ’run’ of the same genre. To do this, we used the Kleinberg algorithm for detecting ’bursts’ of activity in time-series data. [12] This uses a hidden Markov process to probabilistically determine when a subsequent event will occur. When events occur more rapidly and for sustained periods in comparison to this deter- mination, these are labelled bursts. The detection of the bursts were computed using R bursts package [3], which implements the Kleinberg algorithm. To adapt this method, the most probable prediction for each chunk within each document was treated as a time series data point for Kleinberg. We have calculated sections for main and subcategories separately. The method allows for ’fuzzy’ and overlapping sections of genres. Additionally we have experimented with only retaining highly-probable classi昀椀cations which helped to further 昀椀lter out noise. There are drawbacks: because the burst method looks for change rather than simply all clusters of events, currently not all sections are detected if most of the text is of a single genre. To give some examples, we take some exemplary texts and calculate genre bursts. To vi- sualise the changes in genre, top genre predictions (over .5 probability) are charted as a scat- terplot in the paragraph sequence, coloured by genre. Burst start and end points are overlaid as coloured areas. As the method looks for periods of change rather than absolute values, it ignores the main category of the book (which is detected by the document-level method suc- cessfully anyway) and in most cases highlights sustained excerpts where the detected genre is di昀昀erent to the dominant one. Here, we see that David Hume’s Political Discourses (Figure 4, 249 A) contains discrete sections on economics (categorised as scienti昀椀c improvement), philosophy (a section on the balance of power), history (a section on ’ancient nations’) and 昀椀nally law (a chapter on the idea of the commonwealth). Wealth of Nations (Figure 4, B) begins with a section on labour and society categorised here as philosophy and smaller sections on law (a discussion on a speci昀椀c statute), and in the education genre. Most of the book is not classi昀椀ed as its dom- inant genre (economics and trade, under the higher-level category scienti昀椀c improvement) as it does not involve change. Villier’s Miscellaneous Works (Figure 4, C) detects a large number of overlapping genre changes. Finally, Robinson Crusoe (Figure 4, D) is also mostly without detected bursts, but of note is a section of religious genre, corresponding to a section in the plot where Crusoe is ill and has prophetic dreams. A 1.0 0.9 0.8 0.7 0.6 0.5 0 300 600 900 B 1.0 0.9 0.8 0.7 0.6 0.5 0 500 1000 1500 C 1.0 0.9 0.8 0.7 0.6 0.5 0 500 1000 1500 D 1.0 0.9 0.8 0.7 0.6 0.5 0 500 1000 1500 Arts Literature History Sales Catalogues Philosophy Scientific Improvement Education Law Politics Religion Figure 4: Sections of genre bursts calculated by Kleinberg algorithm. Points are prediction probabili- ties, filtered to .5 or greater. Coloured shaded areas are sections of genre bursts. Because this method looks for change rather than absolute values, it ignore the main genre of the book in the first two cases (politics, and economics). 7. Discussion and Conclusion In this paper we aimed to describe the process to detect sections of fuzzy and overlapping genre excerpts within individual editions. The results show that at the level of 昀椀ne-grained divisions (43 subcategories), a model which combines the t昀椀df feature of the full document and the features of a 昀椀ne-tuned ECCO-BERT model performs signi昀椀cantly better than baselines, suggesting they may be particularly useful for such tasks. That the BERT model performed so 250 well on 昀椀ne-grained categories is signi昀椀cant because existing methods to look at genre have generally used very broad divisions (such as 昀椀ction and non-昀椀ction). The kinds of questions we are interested in use more 昀椀ne-grained categories, for example looking at the rise of medical textbooks in certain publishers. This kind of sequencing also has other potential uses, for example document retrieval. On the present task, we did not observe any improvement o昀昀ered by the Perceiver model, which we speci昀椀cally included to test a character-level model which is capable of accounting for OCR artefacts. At present, we think this is due to a combination of two factors: Firstly, the base performance on the task is around 95% accuracy, leaving only very little headroom for improvement with more advanced models. And secondly, the task is by its nature a document-level task and the good performance of the linear baseline demonstrates that enough information is present in the data even without explicitly accounting for OCR errors. It is therefore possible that the advantages of character-based models such as the Perceiver will be demonstrated on tasks where the correct modelling of individual word occurrences in their context plays a more signi昀椀cant role. These would include various text tagging and information retrieval tasks. In our future work we hope to further develop the sequencing method, and investigate the genres in their own right, for instance looking at the sequence patterns of individual authors, the relationship between intra-book diversity and the success of particular authors or publish- ers, and understanding co-occurrence between genres. References [1] B. Baptiste, B. Favre, J. Auguste, and C. Henriot. “Transferring Modern Named Entity Recognition to the Historical Domain: How to Take the Step?” In: Workshop on Natural Language Processing for Digital Humanities (NLP4DH). 2021. [2] B. M. Benedict. “The Paradox of the Anthology: Collecting and Di昀昀érence in Eighteenth- Century Britain”. In: New Literary History 34.2 (2003), pp. 231–256. url: http://www.jst or.org/stable/20057778. [3] J. Binder. bursts: Markov Model for Bursty Behavior in Streams. 2022. url: https://CRAN.R- project.org/package=bursts. [4] J. Feather. “British Publishing in the Eighteenth Century: a preliminary subject analysis”. In: The Library s6-VIII.1 (1986), pp. 32–46. doi: 10.1093/library/s6-VIII.1.32. url: https: //doi.org/10.1093/library/s6-VIII.1.32. [5] Gale. Eighteenth Century Collections Online. url: https://www.gale.com/intl/primary-so urces/eighteenth-century-collections-online. [6] A. Goyal and V. Prem Prakash. “Statistical and Deep Learning Approaches for Literary Genre Classi昀椀cation”. In: Advances in Data and Information Sciences. Ed. by S. Tiwari, M. C. Trivedi, M. L. Kolhe, K. Mishra, and B. K. Singh. Vol. 318. Singapore: Springer Singapore, 2022, pp. 297–305. doi: 10.1007/978-981-16-5689-7\_26. url: https://link.spri nger.com/10.1007/978-981-16-5689-7%5C%5F26. 251 [7] S. Gupta, M. Agarwal, and S. Jain. “Automated Genre Classi昀椀cation of Books Using Ma- chine Learning and Natural Language Processing”. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Con昀氀uence). Noida, India: Ieee, 2019, pp. 269–272. doi: 10.1109/confluence.2019.8776935. url: https://ieeexplore.ieee.org/doc ument/8776935/. [8] M. J. Hill and S. Hengchen. “Quantifying the impact of dirty OCR on historical text anal- ysis: Eighteenth Century Collections Online as a case study”. In: Digital Scholarship in the Humanities 34.4 (2019), pp. 825–843. doi: 10.1093/llc/fqz024. url: https://academic.o up.com/dsh/article/34/4/825/5476122. [9] A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, O. Héna昀昀, M. M. Botvinick, A. Zisserman, O. Vinyals, and J. Carreira. Perceiver IO: A General Architecture for Structured Inputs & Outputs. 2021. doi: 10.48550/arxiv.2107.14795. url: https://arxiv.org/abs/2107.14795. [10] M. Jiang, Y. Hu, G. Worthey, R. C. Dubnicek, T. Underwood, and J. S. Downie. “Impact of OCR Quality on BERT Embeddings in the Domain Classi昀椀cation of Book Excerpts.” In: Chr. 2021, pp. 266–279. [11] J. D. M.-W. C. Kenton and L. K. Toutanova. “Bert: Pre-training of deep bidirectional trans- formers for language understanding”. In: Proceedings of naacL-HLT. 2019, pp. 4171–4186. [12] J. Kleinberg. “Bursty and Hierarchical Structure in Streams”. In: Data Mining and Knowl- edge Discovery 7.4 (2003), pp. 373–397. doi: 10.1023/a:1024940629314. url: https://doi.or g/10.1023/A:1024940629314. [13] K. Labusch, P. Kulturbesitz, C. Neudecker, and D. Zellhöfer. “BERT for named entity recognition in contemporary and historical German”. In: Proceedings of the 15th confer- ence on natural language processing. 2019, pp. 9–11. [14] L. Lahti, E. Mäkelä, and M. Tolonen. “Quantifying Bias and Uncertainty in Historical Data Collections with Probabilistic Programming”. In: (2020). url: https://helda.helsink i.fi/handle/10138/327728. [15] L. Lahti, J. Marjanen, H. Roivainen, and M. Tolonen. “Bibliographic Data Science and the History of the Book (c. 1500–1800)”. In: Cataloging & Classi昀椀cation Quarterly 57.1 (2019), pp. 5–23. doi: 10.1080/01639374.2018.1543747. url: https://doi.org/10.1080/01639374.20 18.1543747. [16] E. Manjavacas and L. Fonteyn. “Adapting vs. Pre-training Language Models for Historical Languages”. In: Journal of Data Mining & Digital Humanities Nlp4dh (2022). doi: 10.462 98/jdmdh.9152. url: https://jdmdh.episciences.org/9690. [17] J. Manson. A catalogue of the entire and genuine library and prints of Robert Salusbury Gotton, Esq. F.A.S. [electronic resource] : Comprehending an extensive and valuable collec- tion of books of coins, medals and antiquities, with a few 昀椀nk missals and other manuscripts on vellum, which, with some other select parcels of books lately purchased, are now on sale for ready money, at the price printed in the catalogue, and on the 昀椀rst leaf of each-book, By John Manson, bookseller, No 5, Duke’s-Court, St. Martin’s-Lane, where catalogues (Price 6d) may be had. [London, 1789, [2], 102 pages. 252 [18] D. Mazella, C. Willan, D. Bishop, E. Stravoski, W. Barta, and M. James. ““All the modes of story”: Genre and the Gendering of Authorship in the Year 1771”. In: ABO: Interactive Journal for Women in the Arts, 1640-1830 12.1 (2022). doi: http://doi.org/10.5038/2157-71 29.12.1.1256. url: https://digitalcommons.usf.edu/abo/vol12/iss1/10. [19] F. Moretti. Distant reading. London ; New York: Verso, 2013. [20] M. Poovey. “Mary Wollstonecra昀琀: The Gender of Genres in Late Eighteenth-Century England”. In: NOVEL: A Forum on Fiction 15.2 (1982), pp. 111–126. url: http://www.jsto r.org/stable/1345219. [21] I. Rastas, Y. C. Ryan, I. L. I. Tiihonen, M. Qaraei, L. Repo, R. Babbar, E. Mäkelä, M. Tolonen, and F. Ginter. “Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model”. In: Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change. The Association for Computational Linguistics. 2022. [22] J. Raven. The business of books: booksellers and the English book trade, 1450-1850. New Haven: Yale University Press, 2007. [23] I. Rivers, ed. Books and their readers in eighteenth century England. Leicester: Leicester Univ. Press [u.a.], 1982. [24] I. Rivers, ed. Books and their readers in eighteenth-century England: new essays. London New York: Leicester University Press, 2001. [25] T. Schmidt, K. Dennerlein, and C. Wol昀昀. “Emotion Classi昀椀cation in German Plays with Transformer-based Language Models Pretrained on Historical and Contemporary Lan- guage”. In: Association for Computational Linguistics. 2021. [26] T. Schmidt, K. Dennerlein, and C. Wol昀昀. “Using Deep Learning for Emotion Analysis of 18th and 19th Century German Plays”. In: (2021). [27] S. Schweter and L. März. “Triple E-E昀昀ective Ensembling of Embeddings and Language Models for NER of Historical German.” In: CLEF (Working notes). 2020. [28] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. “On the importance of initialization and momentum in deep learning”. In: International conference on machine learning. Pmlr. 2013, pp. 1139–1147. [29] K. Todorov and G. Colavizza. “An Assessment of the Impact of OCR Noise on Language Models”. In: arXiv preprint arXiv:2202.00470 (2022). [30] M. Tolonen, E. Mäkelä, A. Ijaz, and L. Lahti. “Corpus Linguistics and Eighteenth Century Collections Online (ECCO)”. In: Research in Corpus Linguistics 9.1 (2021), pp. 19–34. doi: 10.32714/ricl.09.01.03. url: https://ricl.aelinco.es/index.php/ricl/article/view/161. [31] M. Tolonen, E. Mäkelä, A. Ijaz, and L. Lahti. “Corpus Linguistics and Eighteenth Century Collections Online (ECCO)”. In: Research in Corpus Linguistics 9.1 (2021), pp. 19–34. doi: 10.32714/ricl.09.01.03. url: https://ricl.aelinco.es/index.php/ricl/article/view/161. [32] M. Tolonen, E. Mäkelä, and L. Lahti. “The Anatomy Of Eighteenth Century Collections Online (Ecco)”. In: Eighteenth-century studies 56.1 (2022), pp. 95–123. 253 [33] T. Underwood. Distant horizons: digital evidence and literary change. Chicago: The Uni- versity of Chicago Press, 2019. [34] T. Underwood. “Genre Theory and Historicism”. In: Journal of Cultural Analytics 2.2 (2016). doi: 10.22148/16.008. url: https://culturalanalytics.org/article/11063. [35] T. Underwood. “The Life Cycles of Genres”. In: Journal of Cultural Analytics 2.2 (2016). doi: 10.22148/16.005. url: https://culturalanalytics.org/article/11061. [36] T. Underwood. “Understanding Genre in a Collection of a Million Volumes, Interim Re- port”. In: (2014). doi: 10.6084/m9.figshare.1281251.v1. url: https://figshare.com/article s/journal%5C%5Fcontribution/Understanding%5C%5FGenre%5C%5Fin%5C%5Fa%5C%5 FCollection%5C%5Fof%5C%5Fa%5C%5FMillion%5C%5FVolumes%5C%5FInterim%5C%5 FReport/1281251. [37] T. Underwood, M. L. Black, L. Auvil, and B. Capitanu. Mapping Mutable Genres in Struc- turally Complex Volumes. 2013. doi: 10.1109/BigData.2013.6691676. url: http://arxiv.org /abs/1309.3323. [38] J. Worsham and J. Kalita. “Genre Identi昀椀cation and the Compositional E昀昀ect of Genre in Literature”. In: Proceedings of the 27th International Conference on Computational Lin- guistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, 2018, pp. 1963–1973. url: https://aclanthology.org/C18-1167. [39] H. Yoo, J. Jin, J. Son, J. Bak, K. Cho, and A. Oh. “HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea”. In: Findings of the Association for Computational Linguistics: NAACL 2022. Seattle, United States: Association for Compu- tational Linguistics, 2022, pp. 1832–1844. url: https://aclanthology.org/2022.findings-n aacl.140. A. Appendix A.1. The main categories and sub-categories 254 Table 2 The information of main categories and sub-categories Main categories Sub-categories Fine Arts and Aesthetics Arts Music, hymns, songs Theatre, plays, opera Advice literature General Education Education Recipe Books Hobbies & Games Instructional books History Biographical History General History Acts, proclamations Appeals Law Collected bills Legal essays Proclamations Trial accounts Classics Collected Works Criticism Drama Literature Novels Other fiction Periodicals Poetry Travel Human understanding, metaphysics Philosophy Moral Philosophy Political philosophy Political essays Politics Intelligence Parliamentary speeches Religion Sermons Theology Sales Catalogues Sales catalogues, almanacs, directories etcetera Agriculture and animal husbandry Economics and trade Geography, cartography, astronomy and navigation Languages Scientific Improvement Mathematics Medicine and anatomy Natural history Natural philosophy Practical trades, mechanics, engineering 255