Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts

Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts MingJiang YuerongHu GlenWorthey RyanCDubnicek TedUnderwood StephenDownie University of Illinois

Urbana-Champaign USA

Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts 0FBD0CFA81A156F1739416997D9FAF6C GROBID - A machine learning software for extracting information from scholarly documents Optical Character Recognition BERT Resilience Word Embeddings Text Analysis Parallel Corpora HathiTrust Digital Humanities Digital Libraries Data Curation

Digital humanities (DH) scholars have been increasingly interested in using BERT for document representation in computational text analysis. However, most word embeddings, including BERT embeddings, have been developed using "clean" corpora, while DH research is usually based on digitized texts with optical character recognition (OCR) errors. Will these errors introduced by the digitization process reduce BERT's performance and distort the research findings? To shed light on the impact of OCR quality on BERT models, we conducted an empirical study on the resilience of BERT embeddings (pre-trained and fine-tuned) to OCR errors by measuring BERT's ability to enable classification of book excerpts by subject domain. We developed specialized parallel corpora for this task consisting of matching pairs of OCR'd text (19,049 volumes) and "clean" re-keyed text (4,660 volumes) from English-language books in six domains published from 1780 to 1993. This study is the first to systematically quantify OCR impact on contextualized word embedding techniques with a use case of OCR'd book datasets curated by digital libraries (DL). Experimental results show that pre-trained BERT is less robust when used on OCR'd texts; however, fine-tuning pre-trained BERT on OCR'd texts significantly improves its resilience to OCR noise in classification tasks according to the changes of classifier performance. These findings should assist DH scholars who are interested in using BERT for scholarly purposes.

Introduction

The accessibility of ever-growing digitized textual curations in digital libraries (DL) and the rapid development of natural language processing (NLP) techniques have opened up a variety of new research opportunities to humanities scholars for computational text analysis [19,12,13]. In recent years, BERT (Bidirectional Encoder Representations from Transformers) has been widely used as a fundamental text representation tool in text-based computing, for it focuses on encoding the contextual meaning of words into a vector space [7,24]. There are two main reasons for its popularity. First, in encoding word tokens rather than word types (i.e., distinct words), BERT is helpful in identifying the correct meaning of a homonym within CHR 2021: Computational Humanities Research Conference, November 17-19, 2021, Amsterdam, The Netherlands mjiang17@illinois.edu (M. Jiang); yuerong2@illinois.edu (Y. Hu); gworthey@illinois.edu (G. Worthey); rdubnic2@illinois.edu (R.C. Dubnicek); tunder@illinois.edu (T. Underwood); jdownie@illinois.edu (J.S. Downie) 0000-0002-3604-166X (M. Jiang); 0000-0001-8375-9108 (Y. Hu); 0000-0003-2785-0040 (G. Worthey); 0000-0001-7153-7030 (R.C. Dubnicek); 0000-0001-8960-1846 (T. Underwood); 0000-0001-9784-5090 (J.S. Downie) its context (e.g., bank in "river bank" and "savings bank"). Second, BERT can leverage the general linguistic knowledge it has learned from a massive, high-resource corpus such as Wikipedia to serve specialized and lower-resource downstream tasks, such as movie review sentiment classification [1]. So far, BERT has produced promising improvements in both (1) fundamental text analysis, e.g., text segmentation [1], named entity recognition [28,16], and post-OCR correction [28,20]; and (2) specific research topics, e.g., historical analysis of semantic change in lexical/grammatical constructions [24,18,9], literary genre analysis [30,4], literary event detection [25], and computational narrative intelligence [23].

Digital humanities (DH) scholars working with computational analysis have been increasingly interested in using this technique for their research on digitized texts. However, a majority of large DL text curations and other historical text collections are machine-transcribed and include varying degrees of optical character recognition (OCR) noise. Such noise might decrease the generally impressive performance of BERT because it was originally developed on born-digital texts without OCR errors [7]. Even though existing OCR systems have significantly improved through advances in AI techniques (e.g., image recognition) and persistent efforts of digital curators (e.g., the Library of Congress, HathiTrust Digital Library), OCR noise can hardly ever be completely eliminated given its ubiquity, its uneven distribution, and the heterogeneous nature of its source texts. Meanwhile, advanced NLP techniques like BERT are generally limited in their transparency and interpretability, which is even worse when processing OCR'd texts. [17]. Such uncertainty might reduce the credibility of digital humanities research when applying BERT-based computations to OCR'd texts for further analysis.

Therefore, we believe BERT's performance on OCR'd texts is an important problem to look into. This study aims to empirically investigate this problem with three research questions:

(1) Would the original BERT model [7] (pre-trained on Wikipedia and free Web books) work as well with OCR'd texts containing noise? (2) If we fine-tune the pre-trained BERT using a corpus with a certain amount of OCR noise, would this result in any improvements for processing OCR'd texts in downstream tasks? and (3) What are the quantifiable impacts of OCR quality on both pre-trained and fine-tuned BERT models?

To shed light on the interaction between OCR'd texts and BERT, we focused on measuring the ability of BERT to encode digitized texts' semantics and comparing the performance of BERT encoding on clean (i.e., re-keyed) versus OCR'd texts. The texts we used were book excerpts generated from ∼4,000 pairs of book volumes selected from a parallel corpus of digital English-language books, with 4,660 human-proofread "lean" volumes from Project Gutenberg (Gutenberg) and their matching pairs of 19,049 OCR'd volumes from HathiTrust Digital Library (HathiTrust) [12]. Books in this corpus cover six subject domains published from 1780 to 1993. We chose subject domain classification as the application downstream from BERT in order to quantify its encoding performance, because document classification in general is a popular application for digital humanists studying subject, genre, authorship, and many other features of their texts. [34,27]. Specifically, we investigated both the generic embedding obtained from the pre-trained BERT model and the domain-adapted embedding by fine-tuning the pre-trained BERT on the downstream training corpus (i.e., either clean or noisy).

The remainder of this paper is organized as follows. In section 2, we review related work on BERT and OCR'd texts. In section 3, we provide detailed information about the parallel book dataset that we created and leveraged, and how we built the book excerpt corpora needed for our experiments. In section 4, we describe our research design and workflow. We also give explanations for the specific decisions made and methods adopted. In section 5, we present our experimental results and findings. Finally in section 6, we discuss our conclusions and future

Related Work

BERT used in existing work for digital history and literary studies generally plays a text preprocessing role by encoding text information into vectors for further computation. Popular research topics in this field mainly focus on the diachronic analysis of literary texts [24,18,9,30] and narrative understanding [25,23]. Regarding data sources, commonly used corpora typically come from Project Gutenberg [25], the Corpus of Historical American English [9], and OCR'd text collections organized in DL [24,18]. Although BERT has shown its power in representing clean texts, some empirical studies [24,14,6] have witnessed a drop of its performance on processing digitized texts containing OCR errors. Inspired by that, we are interested in advancing the understanding of BERT's applicability on OCR'd noisy texts.

Based on a literature review on OCR noise analysis, common error types include character misidentification (e.g., "inserted"→"insorted"), broken words (e.g., un-rejoined hyphenated words "talking"→"talk-ing"), incorrectly joined words (e.g., "the belief"→"thebelief"), and meaningless symbols (e.g., OCR attempts to recognize hand-written marginalia) [3,8]. Given the various patterns and random distribution of OCR noise, even the state-ofthe-art techniques for OCR correction cannot completely filter the OCR noise out.

Prior work on the impact of uncorrected OCR'd texts on other NLP tasks can be divided into two groups: (1) those quantifying impact by measuring the performance differences of a set of popular NLP techniques applied on a parallel corpus consisting of OCR'd and clean texts [11,26,5]; and (2) those analyzing OCR impact by interviewing scholar-users for their feedback on the use of digital archives and NLP techniques for computational textual analysis [29]. Popular NLP tasks adopted in existing studies include tokenization, sentence segmentation, named entity recognition, dependency parsing, topic modeling, information retrieval, text classification, collocation, and authorial attribution [11,26,5]. Most studies show that OCR errors lead to a consistent negative influence on NLP tasks, even for some tasks that have been considered "solved" (e.g., sentence segmentation) [26]. In this research, we extend prior work by studying the impact of OCR quality on BERT-based text representations, where we particularly explore BERT's ability to encode the intrinsic semantic features of OCR-impacted texts in comparison with its encoding of parallel clean texts.

Data and Corpora Preparation

The source data for this study is a parallel corpus of English monographs [12] collected from two real-world digital libraries: (1) Gutenberg for a human-proofread "clean" corpus; and, (2) HathiTrust for an OCR'd "noisy" corpus. This corpus has a total of 4,660 Gutenberg volumes in 6 domains (i.e., fiction, social science, agriculture, medicine, business, world war history), each of which is matched with several different copies (4 on average) of the same work held in HathiTrust.

Since classification is a supervised learning task, we started by preparing three parallel data splits from the raw corpus for training, validation, and testing, respectively. Considering the many-to-one matching relationship between Hathitrust and Gutenberg volumes, in order to make the clean and OCR'd version of each data split, aligned by volume, and to avoid volume duplication in splits with clean data, we first split Gutenberg data by randomly selecting 10% of 4,660 Gutenberg volumes for validation (465 volumes), 10% for testing (467 volumes), and the rest for training. Then we randomly picked one paired HathiTrust copy of each Gutenberg volume to build corresponding training, validation and testing splits of OCR'd texts.

Following [2,21], data distribution and downstream corpus size also influence the embeddings' encoding ability, in addition to text quality, especially for the fine-tuned BERT embedding. Taking these two variables into consideration, we modified the original parallel training split by resampling the data into three types of parallel training corpus: (1) a small balanced corpus (SB) containing 1000 books with an equal number of books per genre; (2) a small unbalanced corpus (SU) containing 1000 books with a different number of books per genre; and (3) a large unbalanced corpus (LU) containing 3000 books with a different number of books per genre. Table 1 shows the details of each type of training corpus. Given the highly skewed data distribution in the original parallel corpus (e.g., fiction volumes comprise 88%) [12], our unbalanced corpora were generated by a slight smoothing based on the exponentially smoothed weighting method [10], where we empirically set the smoothing factor as 0.3.

There are two main challenges in the encoding of book content by BERT. First, booklength texts and the computational cost of BERT models make it expensive to encode each volume's full text. Moreover, BERT models are restricted to processing at most 512 tokens at a time, which limits their encoding abilities on long sentences. To address these issues, we followed prior work [31,32] by parsing the full content per volume into a set of word sequences with at most n tokens and randomly sampled k continuous word sequences as a text chunk to feed into BERT. Referring to prior studies' parameter settings and our own hardware computing constraints, we set n = 128 and k = 15 (∼1920 tokens per chunk). Recent studies on subject domain and genre classification [31,32] show that book chunks should be sufficient for predicting an entire book's subject, and with this premise, we decided to focus on parallel book excerpts for our study. Although this method could not process complete volumes, the random sampling strategy is helpful in augmenting the book content to be trained or tested as much as possible, which compensates for the limits on text length.

To make each classifier's predictions on clean versus OCR'd test set comparable, the sampled text chunks from each pair of test volumes were aligned by an existing text alignment algorithm [33]. We manually examined a random sample of chunk pairs to ensure alignment accuracy. Furthermore, for a statistical significance test of the classification results, we grouped all the sampled chunk pairs into a set of parallel testing folds. In the end, our parallel testing corpus consists of 20 parallel testing folds, where each parallel fold contains one unique pair of text chunks extracted from a pair of Gutenberg and HathiTrust volumes(20 × 467 = 9340 parallel examples in total).

Research Design and Workflow

The primary goal of this study is to analyze the performance of BERT embeddings in encoding book excerpts into n D-dimensional (D=768) token vectors for book domain classification based on the parallel clean and OCR'd texts. We measured and compared BERT embeddings' encoding ability in different classifiers using macro-averaged precision (P), recall (R), and F1 score (F1). Considering the potential influence of experimental settings on BERT embeddings' performance, we analyzed the classification outcomes based on the model settings and data characteristics respectively. Figure 1

Domain Classifier Construction

With the encoded BERT token representations per excerpt, we first generate a single chunklevel feature vector by averaging token vectors, one of standard practices popularly used in prior work [22] The detailed implementation of model training is as follows. We used the Adam optimizer [15] to train all classification models with 20 epochs1 . As to the learning rate, for pre-trained BERT-based classifiers, we set this parameter as 2.0e-3 for for the Gutenberg corpus and 2.5e-3 for the HathiTrust corpus respectively, while for fine-tuned classifiers, we set both of them 2.5e-5. Our empirical setting for this parameter was based on the resultant classifier's performance on the validation set in order to find the optimal one. The batch size was set as 40 (book excerpts) for all the models.

Analysis of BERT Encoding on Clean Versus OCR'd Texts

Model-based measurement

Based on the classification results of 12 generated classifiers on our parallel testing corpus, we analyzed the relations among BERT embedding types (i.e., pre-trained or fine-tuned BERT), the source of training and testing data, and the sampling strategy of training corpora by pairwise comparison of any two of three variables. Our goals were: (1) finding the optimal BERT embedding with the highest resilience against OCR errors; and (2) identifying the optimal sampling strategy for building the training corpus that most significantly improves the BERT embedding performance.

Given that the above analysis primarily focused on the comparison of BERT-based classifiers' overall performance, we further proposed a fine-grained investigation of BERT embeddings' resilience to OCR errors regarding the amount of noise. To conduct this investigation, we first prepared three subsets of OCR'd testing data containing different amounts of OCR errors. The level of OCR noise was measured by the character-level error rate (CER) based on the comparison of each OCR'd book excerpt with its paired clean text. After sorting the OCR'd excerpts by their CER in an ascending order, from this ranked excerpt list, we separately sampled 1500 excerpts at the top, middle, and the bottom position as the low-, medium-, and high-noisy testing subsets. Figure2 displays the distribution of CER in each testing subset, where the average CER per subset is around 0.40, 0.54, and 0.65, respectively. We then evaluated each classifier's predictability on each subset. Note that, in this analysis, we only considered those classifiers trained on the corpus with the identified optimal sampling strategy. To further look into the resilience of BERT embeddings with respect to the change of the downstream classification's training corpus source, rather than exploring each individual classifier's results, we measured the divergence of classification results between the classifier trained on the clean versus the OCR'd texts for each type of BERT embedding.

Content-based measurement

Although each book in the raw parallel corpus was assigned to a single subject domain tag, given the diversity of content-based characteristics (e.g., topics, genres, narrative styles) inherent in a book-sized text and its randomly sampled excerpts, it is possible that the input

CER Low

Medium High data itself might bring challenges for a BERT-based classifier to identify its annotated domain tag. Moreover, whether and how such challenges occur with OCR'd texts vary from those occurring with clean texts is uncertain. For instance, if all BERT-based classifiers fail to classify either clean or OCR'd excerpts of the same book correctly, one potential reason for this result could be that the original book includes more than one subject. In contrast, if all classification models work well on the clean texts only, it is likely that OCR noise is resulting in different predictions. To address these concerns, we started by exploring semantic associations among misclassified domains by visualizing the confusion matrix of each classifier. To further capture book excerpts' individual features for understanding their influence on classification, we then grouped the predictions made per classifier on individual excerpts by book, to measure the consistency of classifiers' prediction accuracy at the book level. This measurement is based on calculating the number of testing excerpts of the same book that were assigned to the same correct domain across different classifiers on average. Given the quantitative outcomes, we sampled some cases with poor prediction accuracy, and explored potential reasons for misclassification by close reading of the book content.

Outcomes and Findings

Resilience of BERT embeddings

Table 2 provides an overview of the classification results grouped by (1) source of training and testing data (Gutenberg or HathiTrust); (2) sampling strategy of parallel training corpus (small-balanced, small-unbalanced and large-unbalanced); and (3) type of BERT embedding (pre-trained or fine-tuned). Overall, we observe that classifiers built with fine-tuned BERT outperformed those built with pre-trained BERT by 20% (F1 score) based on the balanced training corpora and 10% (F1 score) based on the unbalanced training corpora. This result indicates that the fine-tuning process, intended to adapt the generic BERT embedding space to fit into a specific text corpus (either clean or OCR'd), will substantially improve the encoding ability of BERT for digitized literary texts even with the distortion of OCR noise.

Regarding the influence of training sampling strategies to BERT encoding, in general, unbalanced corpora were more helpful in training classifiers than balanced corpora, which suggests that excessive artifact intervention of training data distribution indeed could hurt BERT's encoding ability. Table 3 further shows the paired t-test scores of the statistical difference of performance between any two comparable classifiers that differ only in either size or data distribution of training corpus. It is to be noted that differences between any two compared classifiers' performances over 20 testing folds follow an approximately normal distribution based on the Shapiro-Wilks Test. According to the results, pre-trained BERT-based classifiers are all sensitive to both size and data distribution in the training corpus (p-value < 0.05 at least). However, the increase in size of the OCR'd training corpus has no significant impact on fine-tuned BERT embedding. This observation may be understood as a positive signal to humanities scholars that a small training corpus is enough to achieve optimal performance of fine-tuned BERT when working with OCR'd texts. Comparatively, training corpus size (t-test score from -0.71 to 3.32 where p-value < 0.01 at most) is less influential on BERT embeddings' performance than is training data distribution (t-test score from 2.05 to 15.54 where the majority of p-values < 0.001).

Similar to the analysis of training sampling strategies, we compared classifiers' performance with respect to the source of training data. Table 4 shows the paired t-test results. Pretrained BERT-based classifiers were significantly more sensitive to their training data source when these classifiers were built on unbalanced training corpora (p-value tends to be < 0.001). In particular, the growth of training corpus size increased such sensitivity (t-test score increased from 4.09*** to 5.85 when testing on the clean corpus, and from 3.49** to 4.31*** when testing on the OCR'd corpus). Meanwhile, for fine-tuned BERT, classifiers only showed their sensitivity to the source of training data in small unbalanced training corpora (t-test score was -2.86** when testing on the clean corpus, and -2.10* when testing on the OCR'd corpus). According to the F1 score of these classifiers' prediction results shown in Table 2, we found that, compared with fine-tuning on clean texts, fine-tuning on OCR'd texts improved BERT-based classifiers' performance by ∼2%, which suggests that potential OCR noise in the small-unbalanced corpus for BERT fine-tuning can boost the resulting embedding's encoding performance.

Impact of the amount of OCR noise on BERT encoding.

Given three testing sample sets with different levels of OCR noise (see details of data preparation in section 4.2.1), Table 5 shows the divergence of F1 score between classifiers built with either pre-trained or fine-tuned BERT embeddings on each sample set. This divergence was calculated by the subtraction of classification results using OCR'd texts for training from the one using clean texts for training. Overall, we found that classifiers obtained greater benefit from clean training data compared with OCR'd data except in the case of fine-tuned BERT-based classifiers making predictions on the low-noise testing data. Regarding the classification divergence across the three testing sample sets, we observed a gradual decrease in difference on testing samples with low (4.88%), medium (3.96%), and high (0.70%) level of OCR noise when classifiers employed pre-trained BERT for text encoding, while the pattern was the opposite in classifiers built with fine-tuned BERT (i.e., -1.96% for low noisy group, 1.43% for medium noisy group, and 3.79% for high noisy group). We further compared the absolute differences of classification results between two classifiers per embedding type, and found that testing samples with lower-level OCR noise were more sensitive to the training data source than those with higher-level noise in pre-trained BERT-based classifiers. On the contrary, for the classifiers built with the fine-tuned BERT, the largest performance difference was found in the testing set with a high amount of dirty OCR size.

Here are three major conclusions. First, the consistency of text quality in an embedding's pre-training corpus, downstream training, and downstream testing corpus is helpful in improving pre-trained BERT's applicability for literary text classification. Second, the heterogeneous nature of OCR noise can improve the generalization ability of fine-tuned embeddings to process texts with comparatively low levels of OCR noise. Finally, fine-tuned BERT-based classifiers are more stable with regard to changes in the source of training corpus than pre-trained BERTbased classifiers, which further confirms that fine-tuned BERT outperforms pre-trained BERT in its resilience to OCR errors.

Error analysis by content-based measurement.

Figure 3 shows eight confusion matrix heatmaps for the eight classifiers trained on the large unbalanced corpora. In each matrix, the diagonal values in comparatively darker blue cells represent the ratio of correct predictions, while the other values indicate the ratio of misclassifications (actual VS predicted). The higher the value is, the darker its corresponding cell color. For example, in the first matrix (fine-tuned, G→G), the value "0.45%" in the cell at the upper left corner indicates that 0.45% of "world war history" excerpts were misclassified as "agriculture" by the fine-tuned BERT-based classifier, which was trained and tested on Gutenberg texts. For both pre-trained and fine-tuned BERT-based classifications, we found that book excerpts in the business domain were more likely to be misclassified as fiction (25.4% on average) and social science (19.8% on average), while book excerpts in the medicine domain were more likely to be mistakenly classified as social science, especially with fine-tuned BERT-based classifiers trained on the OCR'd texts (32.86% misclassifications in H→G classification and 27.86% misclassifications in H→H). By looking more closely at social-science instances, we observed that the pattern of misclassifications was different in the classifier built with pre-trained BERT compared with that built with fine-tuned BERT. Specifically, in the classifications using pre-trained BERT for text encoding, prediction errors mainly concentrated in the domains of business (10% on average), medicine (8.5% on average), and fiction (7.5% on average). Meanwhile, for fine-tuned BERT-based classification, fiction (17% on average) and medicine (11% on average) were the top two misclassifications for social-science excerpts.

Comparing prediction errors with respect to the source of data for training and testing, we found that the pattern of misclassification in fine-tuned BERT-based classifications tended to be similar among all four types of classification. However, the ratio of errors per domain in pre-trained BERT-based classifications was likely to be different depending on the classifiers' training corpus source. For example, business instances tended to be misclassified as fiction (25%-28%) when the training corpus is clean, but as social science (23%-27%) when using OCR'd texts for training. Similarly, medicine instances have an markedly higher ratio of misclassification as social science (27.89%-32.86%) in the OCR'd training corpus compared with the clean one (11.43%-16.43%). These observations reaffirm that fine-tunbed BERT is more robust for processing OCR'd texts compared with pre-trained BERT.

We further looked into the prediction consistency of all BERT-based classifiers on each book in both clean and OCR'd versions. Given two aligned lists (i.e., clean and OCR'd) of book-level average prediction accuracy across different classifiers, we found that there was a large overlap of books with comparatively low accuracy in clean versus OCR'd corpus, which suggests that content-based characteristics of these particular books may be the main cause of recurring prediction mistakes. We verified this hypothesis by manually checking the books with the lowest prediction scores, and confirmed that these books had heterogeneous genre-related features which were confusing even for human readers. For instance, the book The Story of My Life by Helen Keller is generally considered a classic "social science" work because of its main subject and its many non-fiction features. However, this is a classic autobiography composed of touching stories of a great woman struggling with severe disability, first published in 1903. "A", "B", "F", "M", "W", "S" represent "agriculture", "business", "fiction", "medicine", "world war history" and "social science".

G -> G G -> H H -> G H -> H

Therefore, it is less surprising and even understandable for the models to label its instances as "medicine" or "fiction" based on their learning of the training data.

Conclusions and Future Work

We have investigated the resilience of pre-trained and fine-tuned BERT embeddings for encoding OCR'd texts through a case study of classifying book excerpts into subject domains. To the best of our knowledge, this is the first empirical study to systematically quantify the influence of OCR quality on BERT. By changing BERT embedding types and classification model settings, we built 12 BERT-based classifiers using book excerpt corpora extracted from a large parallel book corpus of aligned clean and OCR'd volumes sourced from two well-known digital libraries. Our analysis shows that the original BERT embedding pre-trained on born-digital texts is not resilient to OCR noise, at least according to its classification accuracy. However, fine-tuning the pre-trained BERT on OCR'd texts will significantly improve BERT's resilience to OCR noise, and hence will benefit downstream applications. Besides, fine-tuned BERT outperforms the pre-trained one in its encoding stability with regards to changes in training corpus size and training data source. For both types of BERT embedding, unbalanced training corpora benefit embeddings' resilience to OCR noise in downstream classifications. Our findings suggest that DH scholars should consider employing fine-tuned BERT for digitized-text-based scholarly research, particularly when their research involves document classification.

While our experiments yield significantly positive evidence for fine-tuned BERT embeddings' resilience to OCR noise in the use-case of document classification, the impact of OCR noise on BERT for other downstream tasks remain under-investigated. For example, it is possible that BERT could react to OCR noise differently at more fine-grained levels, such as sentence-level tasks (e.g., next sentence prediction, sentence-based sentiment analysis, etc.) and word-level (e.g., part-of-speech tagging, etc.). Therefore, future work focusing on BERT's performance on

w 1 ,1w 2 , …, w n A clean or OCR'd book excerpt Text Encoding Classifier Construction Model-based Measurement ➔ BERT embedding types ➔ Sampling strategy of training corpora ➔ Source of training / testing data Content-based Measurement ➔ Book characteristics (e.g., genre, topics)

Figure 1 :1Figure 1: Overview of study workflow

Figure 2 :2Figure 2: Distribution of the amount of OCR noise measured by CER in three sampled testing subsets. Each set contains 1500 examples.

Figure 3 :3Figure3: Confusion matrices of classification models built on the large-unbalanced training corpora. Labels "A", "B", "F", "M", "W", "S" represent "agriculture", "business", "fiction", "medicine", "world war history" and "social science".

Table 11Statistics of three parallel training corporaFiction Social_Science Agriculture World_War_History Medicine Business TotalSmall_Balanced(SB) Small_Unbalanced(SU) Large_Unbalanced(LU)167 355 1164167 152 423167 148 409167 130 341166 122 359166 1000 93 1000 304 3000work.

, for further excerpt classification. With 2 types of BERT embedding, 3 types of training data sampling, and 2 aligned training corpora, in total, this study built 12 classifiers. Considering that our primary goal is to explore BERT embeddings' resilience against OCR errors rather than improving classification performance, we employed a fundamental multiperception neural network model with three layers for building classifiers. With respect to the training process, by feeding the set of training examples, the model was expected to learn a weighting matrix for predicting the mapping probability per example into each domain class, where each training example was assigned to the domain with the highest probability. Following the standard practice of applying deep learning techniques for classification[1,30], our model was optimized by a cross-entropy loss function during training to maximize the model predictability (i.e., F1 score). To compare the consistency of predictions with and without OCR errors, we proposed two types of classifications: (1) both training and testing corpora are either clean or noisy (i.e., containing OCR errors); and (2) one is clean and the other is noisy.

Table 22Classification results on three training corpora (%). P, R, and F1 denotes precision, recall, and F1 score, respectively. All evaluation indicators are at macro-level and represent the average value of results over 20 folds of testing samples. The highest F1 score per classification strategy in each training setting is highlighted in bold.Pre-trained 70.31 67.05 66.28 70.67 64.68 65.48 60.24 66.38 60.89 62.32 65.01 60.98 SU Fine-tuned 75.23 77.71 74.39 75.25 77.50 74.71 75.79 78.83 76.27 74.94 79.45 76.20 Pre-trained 64.30 74.16 67.71 65.79 72.69 67.88 59.59 73.44 64.38 60.17 72.82 64.86 LU Fine-tuned 76.02 79.51 76.60 75.71 79.78 76.71 74.60 80.33 76.10 73.86 80.01 75.72

G→GG→HH→GH→HPRF1PRF1PRF1PRF1Pre-trained 49.88 71.67 53.24 49.97 70.44 52.64 47.14 74.50 53.33 46.97 73.25 53.05 SB Fine-tuned 69.06 79.75 72.65 70.00 79.17 72.99 68.07 79.06 71.54 68.93 78.07 71.70

Table 33Paired t-test shows the differences of classification results varied by training strategies. The statistical significance is represented by p-value (one-tailed), where *p < 0.05, **p < 0.01, and ***p < 0.001.G→GG→HH→GH→HPre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tunedSU vs. SB 15.54*** LU vs. SU 1.85*2.33* 2.65**11.06*** 2.42*2.05* 1.99*7.76*** 3.32**6.44*** -0.227.87*** 2.94**5.07*** -0.71

Table 44Paired t-test shows the differences of classification results varied by training data source. The statistical significance is shown by p-value (one-tailed), where *p < 0.05, **p < 0.01, and ***p < 0.001.SBSULUPre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tunedG→G vs. H→G -0.13 G→H vs. H→H -0.681.41 1.374.09*** 3.49**-2.86** -2.10*5.85*** 4.31***0.73 1.17

Table 55Divergence of classification results (F1 score) by changing the training corpus source from clean to OCR'd texts on three testing sample sets with different levels of OCR error.Low Noisy Medium Noisy High NoisyLU Pre-trained LU Fine-tuned4.88% -1.96%3.96% 1.43%0.70% 3.79%

The number of epochs was optimized empirically by trying a set of values (i.e.,15, , 30, 50).

OCR'd texts both at different text granularities and for different downstream NLP tasks would be useful to deepen our understanding of how OCR impacts this contextualized embedding technology. Furthermore, since our corpora consist exclusively of English-language books from the 18th and 19th centuries, expanding this study to curated datasets from other historical periods, languages, and publication types would be a very worthwhile future exercise.

Docbert: BERT for document classification AAdhikari ARam RTang JLin arXiv:1904.08398 2019 arXiv preprint Evaluating the stability of embedding-based word similarities MAntoniak DMimno Transactions of the Association for Computational Linguistics 6 2018 Assessing the Impact of OCR Errors in Information Retrieval GTBazzo GALorentz DSVargas VPMoreira Proceedings of European Conference on Information Retrieval European Conference on Information Retrieval Springer 2020 Language Models & Literary Clichés: Analyzing North Korean Poetry with BERT Ben 2020 Impact of OCR errors on the use of digital libraries: Towards a better access to information GChiron ADoucet MCoustaty MVisani J.-PMoreux Proceedings of 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Ieee 2017 SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection ACuba Gyllensten EGogoulou AEkgren MSahlgren Proceedings of the Fourteenth Workshop on Semantic Evaluation the Fourteenth Workshop on Semantic Evaluation

Barcelona

2020 International Committee for Computational Linguistics BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JDevlin M.-WChang KLee KToutanova Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Association for Computational Linguistics 2019 1 Classification and distribution of optical character recognition errors JEsakov DPLopresti JSSandberg International Society for Optics and Photonics 1994 2181 Document Recognition What about Grammar? Using BERT Embeddings to Explore Functional-Semantic Shifts of Semi-Lexical and Grammatical Constructions LFonteyn Proceedings of the Workshop on Computational Humanities Research the Workshop on Computational Humanities Research 2020 Exponential smoothing: The state of the art ESGardnerJr Journal of Forecasting 4 1 1985 Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study MJHill SHengchen Digital Scholarship in the Humanities 34 4 2019 The Gutenberg-HathiTrust parallel corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts MJiang YHu GWorthey RCDubnicek BCapitanu DKudeki JSDownie iConference 2021. 2021 Poster Macroanalysis: Digital methods and literary history MLJockers 2013 University of Illinois Press SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces VKanjirangat SMitrovic AAntonucci FRinaldi Proceedings of the Fourteenth Workshop on Semantic Evaluation the Fourteenth Workshop on Semantic Evaluation

Barcelona

2020 International Committee for Computational Linguistics Adam: A method for stochastic optimization DPKingma JBa arXiv:1412.6980 2014 arXiv preprint BERT for Named Entity Recognition in Contemporary and Historical German KLabusch PKulturbesitz CNeudecker DZellhöfer Proceedings of the 15th Conference on Natural Language Processing the 15th Conference on Natural Language Processing 2019 Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP TLinzen GChrupała YBelinkov DHupkes the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Florence, Italy

Association for Computational Linguistics 2019 Leveraging contextual embeddings for detecting diachronic semantic shift MMartinc PKNovak SPollak arXiv:1912.01072 2019 arXiv preprint Quantitative analysis of culture using millions of digitized books J.-BMichel YKShen APAiden AVeres MKGray JPPickett DHoiberg DClancy PNorvig JOrwant Science 331 2011 Neural machine translation with BERT for post-OCR error detection and correction TT HNguyen AJatowt N.-VNguyen MCoustaty ADoucet Proceedings of the ACM/IEEE Joint Conference on Digital Libraries the ACM/IEEE Joint Conference on Digital Libraries 2020 Performance analysis for classification in balanced and unbalanced data set SPadma SSKumar RManavalan Proceedings of the 6th International Conference on Industrial and Information Systems the 6th International Conference on Industrial and Information Systems Ieee 2011 Document Embedding Techniques SPalachy 2019 Counterfactual Story Reasoning and Generation LQin ABosselut AHoltzman CBhagavatula EClark YChoi Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Hong Kong, China

2019 Association for Computational Linguistics SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection DSchlechtweg BMcgillivray SHengchen HDubossarsky NTahmasebi Proceedings of the Fourteenth Workshop on Semantic Evaluation the Fourteenth Workshop on Semantic Evaluation

Barcelona

2020 International Committee for Computational Linguistics Literary Event Detection MSims JHPark DBamman Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics

Florence, Italy

2019 Association for Computational Linguistics Assessing the Impact of OCR Quality on Downstream NLP Tasks DVan Strien KBeelen MCArdanuy KHosseini BMcgillivray GColavizza Proceedings of the 12th International Conference on Agents and Artificial Intelligence the 12th International Conference on Agents and Artificial Intelligence 2020 1 Text analysis using deep neural networks in digital humanities and information science OSuissa AElmalech MZhitomirsky-Geffet Journal of the Association for Information Science and Technology 2021 Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition KTodorova GColavizzaa Proceedings of the Workshop on Computational Humanities Research the Workshop on Computational Humanities Research 2020 Impact analysis of OCR quality on research tasks in digital archives MCTraub JVan Ossenbruggen LHardman Proceedings of International Conference on Theory and Practice of Digital Libraries International Conference on Theory and Practice of Digital Libraries Springer 2015 Do humanists need BERT? TUnderwood 2019 Genre Identification and the Compositional Effect of Genre in Literature JWorsham JKalita Proceedings of the 27th International Conference on Computational Linguistics the 27th International Conference on Computational Linguistics

Santa Fe, New Mexico, USA

2018 Towards literary genre identification: Applied neural networks for large text classification JMWorsham 2018 University of Colorado Colorado Springs A fast alignment scheme for automatic OCR evaluation of books IZYalniz RManmatha Proceedings of 2011 International Conference on Document Analysis and Recognition 2011 International Conference on Document Analysis and Recognition Ieee 2011 An evaluation of text classification methods for literary study BYu Literary and Linguistic Computing 23 3 2008