<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ming</forename><surname>Jiang</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Yuerong</forename><surname>Hu</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Glen</forename><surname>Worthey</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Ryan</forename><forename type="middle">C</forename><surname>Dubnicek</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Ted</forename><surname>Underwood</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Stephen</forename><surname>Downie</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">University of Illinois</orgName>
								<address>
									<settlement>Urbana-Champaign</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
							</affiliation>
						</author>
						<title level="a" type="main">Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0FBD0CFA81A156F1739416997D9FAF6C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T19:43+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Optical Character Recognition</term>
					<term>BERT Resilience</term>
					<term>Word Embeddings</term>
					<term>Text Analysis</term>
					<term>Parallel Corpora</term>
					<term>HathiTrust</term>
					<term>Digital Humanities</term>
					<term>Digital Libraries</term>
					<term>Data Curation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Digital humanities (DH) scholars have been increasingly interested in using BERT for document representation in computational text analysis. However, most word embeddings, including BERT embeddings, have been developed using "clean" corpora, while DH research is usually based on digitized texts with optical character recognition (OCR) errors. Will these errors introduced by the digitization process reduce BERT's performance and distort the research findings? To shed light on the impact of OCR quality on BERT models, we conducted an empirical study on the resilience of BERT embeddings (pre-trained and fine-tuned) to OCR errors by measuring BERT's ability to enable classification of book excerpts by subject domain. We developed specialized parallel corpora for this task consisting of matching pairs of OCR'd text (19,049 volumes) and "clean" re-keyed text (4,660 volumes) from English-language books in six domains published from 1780 to 1993. This study is the first to systematically quantify OCR impact on contextualized word embedding techniques with a use case of OCR'd book datasets curated by digital libraries (DL). Experimental results show that pre-trained BERT is less robust when used on OCR'd texts; however, fine-tuning pre-trained BERT on OCR'd texts significantly improves its resilience to OCR noise in classification tasks according to the changes of classifier performance. These findings should assist DH scholars who are interested in using BERT for scholarly purposes.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The accessibility of ever-growing digitized textual curations in digital libraries (DL) and the rapid development of natural language processing (NLP) techniques have opened up a variety of new research opportunities to humanities scholars for computational text analysis <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>. In recent years, BERT (Bidirectional Encoder Representations from Transformers) has been widely used as a fundamental text representation tool in text-based computing, for it focuses on encoding the contextual meaning of words into a vector space <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b23">24]</ref>. There are two main reasons for its popularity. First, in encoding word tokens rather than word types (i.e., distinct words), BERT is helpful in identifying the correct meaning of a homonym within CHR 2021: Computational Humanities Research Conference, <ref type="bibr">November 17-19, 2021</ref>, Amsterdam, The Netherlands mjiang17@illinois.edu (M. Jiang); yuerong2@illinois.edu (Y. Hu); gworthey@illinois.edu (G. Worthey); rdubnic2@illinois.edu (R.C. Dubnicek); tunder@illinois.edu (T. Underwood); jdownie@illinois.edu (J.S. Downie) 0000-0002-3604-166X (M. Jiang); 0000-0001-8375-9108 (Y. Hu); 0000-0003-2785-0040 (G. Worthey); 0000-0001-7153-7030 (R.C. Dubnicek); 0000-0001-8960-1846 (T. Underwood); 0000-0001-9784-5090 (J.S. Downie) its context (e.g., bank in "river bank" and "savings bank"). Second, BERT can leverage the general linguistic knowledge it has learned from a massive, high-resource corpus such as Wikipedia to serve specialized and lower-resource downstream tasks, such as movie review sentiment classification <ref type="bibr" target="#b0">[1]</ref>. So far, BERT has produced promising improvements in both (1) fundamental text analysis, e.g., text segmentation <ref type="bibr" target="#b0">[1]</ref>, named entity recognition <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b15">16]</ref>, and post-OCR correction <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b19">20]</ref>; and (2) specific research topics, e.g., historical analysis of semantic change in lexical/grammatical constructions <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b8">9]</ref>, literary genre analysis <ref type="bibr" target="#b29">[30,</ref><ref type="bibr" target="#b3">4]</ref>, literary event detection <ref type="bibr" target="#b24">[25]</ref>, and computational narrative intelligence <ref type="bibr" target="#b22">[23]</ref>.</p><p>Digital humanities (DH) scholars working with computational analysis have been increasingly interested in using this technique for their research on digitized texts. However, a majority of large DL text curations and other historical text collections are machine-transcribed and include varying degrees of optical character recognition (OCR) noise. Such noise might decrease the generally impressive performance of BERT because it was originally developed on born-digital texts without OCR errors <ref type="bibr" target="#b6">[7]</ref>. Even though existing OCR systems have significantly improved through advances in AI techniques (e.g., image recognition) and persistent efforts of digital curators (e.g., the Library of Congress, HathiTrust Digital Library), OCR noise can hardly ever be completely eliminated given its ubiquity, its uneven distribution, and the heterogeneous nature of its source texts. Meanwhile, advanced NLP techniques like BERT are generally limited in their transparency and interpretability, which is even worse when processing OCR'd texts. <ref type="bibr" target="#b16">[17]</ref>. Such uncertainty might reduce the credibility of digital humanities research when applying BERT-based computations to OCR'd texts for further analysis.</p><p>Therefore, we believe BERT's performance on OCR'd texts is an important problem to look into. This study aims to empirically investigate this problem with three research questions:</p><p>(1) Would the original BERT model <ref type="bibr" target="#b6">[7]</ref> (pre-trained on Wikipedia and free Web books) work as well with OCR'd texts containing noise? (2) If we fine-tune the pre-trained BERT using a corpus with a certain amount of OCR noise, would this result in any improvements for processing OCR'd texts in downstream tasks? and (3) What are the quantifiable impacts of OCR quality on both pre-trained and fine-tuned BERT models?</p><p>To shed light on the interaction between OCR'd texts and BERT, we focused on measuring the ability of BERT to encode digitized texts' semantics and comparing the performance of BERT encoding on clean (i.e., re-keyed) versus OCR'd texts. The texts we used were book excerpts generated from ∼4,000 pairs of book volumes selected from a parallel corpus of digital English-language books, with 4,660 human-proofread "lean" volumes from Project Gutenberg (Gutenberg) and their matching pairs of 19,049 OCR'd volumes from HathiTrust Digital Library (HathiTrust) <ref type="bibr" target="#b11">[12]</ref>. Books in this corpus cover six subject domains published from 1780 to 1993. We chose subject domain classification as the application downstream from BERT in order to quantify its encoding performance, because document classification in general is a popular application for digital humanists studying subject, genre, authorship, and many other features of their texts. <ref type="bibr" target="#b33">[34,</ref><ref type="bibr" target="#b26">27]</ref>. Specifically, we investigated both the generic embedding obtained from the pre-trained BERT model and the domain-adapted embedding by fine-tuning the pre-trained BERT on the downstream training corpus (i.e., either clean or noisy).</p><p>The remainder of this paper is organized as follows. In section 2, we review related work on BERT and OCR'd texts. In section 3, we provide detailed information about the parallel book dataset that we created and leveraged, and how we built the book excerpt corpora needed for our experiments. In section 4, we describe our research design and workflow. We also give explanations for the specific decisions made and methods adopted. In section 5, we present our experimental results and findings. Finally in section 6, we discuss our conclusions and future </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>BERT used in existing work for digital history and literary studies generally plays a text preprocessing role by encoding text information into vectors for further computation. Popular research topics in this field mainly focus on the diachronic analysis of literary texts <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b29">30]</ref> and narrative understanding <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b22">23]</ref>. Regarding data sources, commonly used corpora typically come from Project Gutenberg <ref type="bibr" target="#b24">[25]</ref>, the Corpus of Historical American English <ref type="bibr" target="#b8">[9]</ref>, and OCR'd text collections organized in DL <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b17">18]</ref>. Although BERT has shown its power in representing clean texts, some empirical studies <ref type="bibr" target="#b23">[24,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b5">6]</ref> have witnessed a drop of its performance on processing digitized texts containing OCR errors. Inspired by that, we are interested in advancing the understanding of BERT's applicability on OCR'd noisy texts.</p><p>Based on a literature review on OCR noise analysis, common error types include character misidentification (e.g., "inserted"→"insorted"), broken words (e.g., un-rejoined hyphenated words "talking"→"talk-ing"), incorrectly joined words (e.g., "the belief"→"thebelief"), and meaningless symbols (e.g., OCR attempts to recognize hand-written marginalia) <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b7">8]</ref>. Given the various patterns and random distribution of OCR noise, even the state-ofthe-art techniques for OCR correction cannot completely filter the OCR noise out.</p><p>Prior work on the impact of uncorrected OCR'd texts on other NLP tasks can be divided into two groups: (1) those quantifying impact by measuring the performance differences of a set of popular NLP techniques applied on a parallel corpus consisting of OCR'd and clean texts <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b4">5]</ref>; and (2) those analyzing OCR impact by interviewing scholar-users for their feedback on the use of digital archives and NLP techniques for computational textual analysis <ref type="bibr" target="#b28">[29]</ref>. Popular NLP tasks adopted in existing studies include tokenization, sentence segmentation, named entity recognition, dependency parsing, topic modeling, information retrieval, text classification, collocation, and authorial attribution <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b25">26,</ref><ref type="bibr" target="#b4">5]</ref>. Most studies show that OCR errors lead to a consistent negative influence on NLP tasks, even for some tasks that have been considered "solved" (e.g., sentence segmentation) <ref type="bibr" target="#b25">[26]</ref>. In this research, we extend prior work by studying the impact of OCR quality on BERT-based text representations, where we particularly explore BERT's ability to encode the intrinsic semantic features of OCR-impacted texts in comparison with its encoding of parallel clean texts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Data and Corpora Preparation</head><p>The source data for this study is a parallel corpus of English monographs <ref type="bibr" target="#b11">[12]</ref> collected from two real-world digital libraries: (1) Gutenberg for a human-proofread "clean" corpus; and, (2) HathiTrust for an OCR'd "noisy" corpus. This corpus has a total of 4,660 Gutenberg volumes in 6 domains (i.e., fiction, social science, agriculture, medicine, business, world war history), each of which is matched with several different copies (4 on average) of the same work held in HathiTrust.</p><p>Since classification is a supervised learning task, we started by preparing three parallel data splits from the raw corpus for training, validation, and testing, respectively. Considering the many-to-one matching relationship between Hathitrust and Gutenberg volumes, in order to make the clean and OCR'd version of each data split, aligned by volume, and to avoid volume duplication in splits with clean data, we first split Gutenberg data by randomly selecting 10% of 4,660 Gutenberg volumes for validation (465 volumes), 10% for testing (467 volumes), and the rest for training. Then we randomly picked one paired HathiTrust copy of each Gutenberg volume to build corresponding training, validation and testing splits of OCR'd texts.</p><p>Following <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b20">21]</ref>, data distribution and downstream corpus size also influence the embeddings' encoding ability, in addition to text quality, especially for the fine-tuned BERT embedding. Taking these two variables into consideration, we modified the original parallel training split by resampling the data into three types of parallel training corpus: (1) a small balanced corpus (SB) containing 1000 books with an equal number of books per genre; (2) a small unbalanced corpus (SU) containing 1000 books with a different number of books per genre; and (3) a large unbalanced corpus (LU) containing 3000 books with a different number of books per genre. Table <ref type="table" target="#tab_0">1</ref> shows the details of each type of training corpus. Given the highly skewed data distribution in the original parallel corpus (e.g., fiction volumes comprise 88%) <ref type="bibr" target="#b11">[12]</ref>, our unbalanced corpora were generated by a slight smoothing based on the exponentially smoothed weighting method <ref type="bibr" target="#b9">[10]</ref>, where we empirically set the smoothing factor as 0.3.</p><p>There are two main challenges in the encoding of book content by BERT. First, booklength texts and the computational cost of BERT models make it expensive to encode each volume's full text. Moreover, BERT models are restricted to processing at most 512 tokens at a time, which limits their encoding abilities on long sentences. To address these issues, we followed prior work <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b31">32]</ref> by parsing the full content per volume into a set of word sequences with at most n tokens and randomly sampled k continuous word sequences as a text chunk to feed into BERT. Referring to prior studies' parameter settings and our own hardware computing constraints, we set n = 128 and k = 15 (∼1920 tokens per chunk). Recent studies on subject domain and genre classification <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b31">32]</ref> show that book chunks should be sufficient for predicting an entire book's subject, and with this premise, we decided to focus on parallel book excerpts for our study. Although this method could not process complete volumes, the random sampling strategy is helpful in augmenting the book content to be trained or tested as much as possible, which compensates for the limits on text length.</p><p>To make each classifier's predictions on clean versus OCR'd test set comparable, the sampled text chunks from each pair of test volumes were aligned by an existing text alignment algorithm <ref type="bibr" target="#b32">[33]</ref>. We manually examined a random sample of chunk pairs to ensure alignment accuracy. Furthermore, for a statistical significance test of the classification results, we grouped all the sampled chunk pairs into a set of parallel testing folds. In the end, our parallel testing corpus consists of 20 parallel testing folds, where each parallel fold contains one unique pair of text chunks extracted from a pair of Gutenberg and HathiTrust volumes(20 × 467 = 9340 parallel examples in total).  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Research Design and Workflow</head><p>The primary goal of this study is to analyze the performance of BERT embeddings in encoding book excerpts into n D-dimensional (D=768) token vectors for book domain classification based on the parallel clean and OCR'd texts. We measured and compared BERT embeddings' encoding ability in different classifiers using macro-averaged precision (P), recall (R), and F1 score (F1). Considering the potential influence of experimental settings on BERT embeddings' performance, we analyzed the classification outcomes based on the model settings and data characteristics respectively. Figure <ref type="figure" target="#fig_0">1</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Domain Classifier Construction</head><p>With the encoded BERT token representations per excerpt, we first generate a single chunklevel feature vector by averaging token vectors, one of standard practices popularly used in prior work <ref type="bibr" target="#b21">[22]</ref> The detailed implementation of model training is as follows. We used the Adam optimizer <ref type="bibr" target="#b14">[15]</ref> to train all classification models with 20 epochs<ref type="foot" target="#foot_0">1</ref> . As to the learning rate, for pre-trained BERT-based classifiers, we set this parameter as 2.0e-3 for for the Gutenberg corpus and 2.5e-3 for the HathiTrust corpus respectively, while for fine-tuned classifiers, we set both of them 2.5e-5. Our empirical setting for this parameter was based on the resultant classifier's performance on the validation set in order to find the optimal one. The batch size was set as 40 (book excerpts) for all the models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Analysis of BERT Encoding on Clean Versus OCR'd Texts</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1.">Model-based measurement</head><p>Based on the classification results of 12 generated classifiers on our parallel testing corpus, we analyzed the relations among BERT embedding types (i.e., pre-trained or fine-tuned BERT), the source of training and testing data, and the sampling strategy of training corpora by pairwise comparison of any two of three variables. Our goals were: (1) finding the optimal BERT embedding with the highest resilience against OCR errors; and (2) identifying the optimal sampling strategy for building the training corpus that most significantly improves the BERT embedding performance.</p><p>Given that the above analysis primarily focused on the comparison of BERT-based classifiers' overall performance, we further proposed a fine-grained investigation of BERT embeddings' resilience to OCR errors regarding the amount of noise. To conduct this investigation, we first prepared three subsets of OCR'd testing data containing different amounts of OCR errors. The level of OCR noise was measured by the character-level error rate (CER) based on the comparison of each OCR'd book excerpt with its paired clean text. After sorting the OCR'd excerpts by their CER in an ascending order, from this ranked excerpt list, we separately sampled 1500 excerpts at the top, middle, and the bottom position as the low-, medium-, and high-noisy testing subsets. Figure2 displays the distribution of CER in each testing subset, where the average CER per subset is around 0.40, 0.54, and 0.65, respectively. We then evaluated each classifier's predictability on each subset. Note that, in this analysis, we only considered those classifiers trained on the corpus with the identified optimal sampling strategy. To further look into the resilience of BERT embeddings with respect to the change of the downstream classification's training corpus source, rather than exploring each individual classifier's results, we measured the divergence of classification results between the classifier trained on the clean versus the OCR'd texts for each type of BERT embedding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2.">Content-based measurement</head><p>Although each book in the raw parallel corpus was assigned to a single subject domain tag, given the diversity of content-based characteristics (e.g., topics, genres, narrative styles) inherent in a book-sized text and its randomly sampled excerpts, it is possible that the input</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CER Low</head><p>Medium High  data itself might bring challenges for a BERT-based classifier to identify its annotated domain tag. Moreover, whether and how such challenges occur with OCR'd texts vary from those occurring with clean texts is uncertain. For instance, if all BERT-based classifiers fail to classify either clean or OCR'd excerpts of the same book correctly, one potential reason for this result could be that the original book includes more than one subject. In contrast, if all classification models work well on the clean texts only, it is likely that OCR noise is resulting in different predictions. To address these concerns, we started by exploring semantic associations among misclassified domains by visualizing the confusion matrix of each classifier. To further capture book excerpts' individual features for understanding their influence on classification, we then grouped the predictions made per classifier on individual excerpts by book, to measure the consistency of classifiers' prediction accuracy at the book level. This measurement is based on calculating the number of testing excerpts of the same book that were assigned to the same correct domain across different classifiers on average. Given the quantitative outcomes, we sampled some cases with poor prediction accuracy, and explored potential reasons for misclassification by close reading of the book content.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Outcomes and Findings</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Resilience of BERT embeddings</head><p>Table <ref type="table" target="#tab_3">2</ref> provides an overview of the classification results grouped by (1) source of training and testing data (Gutenberg or HathiTrust); (2) sampling strategy of parallel training corpus (small-balanced, small-unbalanced and large-unbalanced); and (3) type of BERT embedding (pre-trained or fine-tuned). Overall, we observe that classifiers built with fine-tuned BERT outperformed those built with pre-trained BERT by 20% (F1 score) based on the balanced training corpora and 10% (F1 score) based on the unbalanced training corpora. This result indicates that the fine-tuning process, intended to adapt the generic BERT embedding space to fit into a specific text corpus (either clean or OCR'd), will substantially improve the encoding ability of BERT for digitized literary texts even with the distortion of OCR noise.</p><p>Regarding the influence of training sampling strategies to BERT encoding, in general, unbalanced corpora were more helpful in training classifiers than balanced corpora, which suggests that excessive artifact intervention of training data distribution indeed could hurt BERT's encoding ability. Table <ref type="table" target="#tab_4">3</ref> further shows the paired t-test scores of the statistical difference of performance between any two comparable classifiers that differ only in either size or data distribution of training corpus. It is to be noted that differences between any two compared classifiers' performances over 20 testing folds follow an approximately normal distribution based on the Shapiro-Wilks Test. According to the results, pre-trained BERT-based classifiers are all sensitive to both size and data distribution in the training corpus (p-value &lt; 0.05 at least). However, the increase in size of the OCR'd training corpus has no significant impact on fine-tuned BERT embedding. This observation may be understood as a positive signal to humanities scholars that a small training corpus is enough to achieve optimal performance of fine-tuned BERT when working with OCR'd texts. Comparatively, training corpus size (t-test score from -0.71 to 3.32 where p-value &lt; 0.01 at most) is less influential on BERT embeddings' performance than is training data distribution (t-test score from 2.05 to 15.54 where the majority of p-values &lt; 0.001).</p><p>Similar to the analysis of training sampling strategies, we compared classifiers' performance with respect to the source of training data. Table <ref type="table" target="#tab_5">4</ref> shows the paired t-test results. Pretrained BERT-based classifiers were significantly more sensitive to their training data source when these classifiers were built on unbalanced training corpora (p-value tends to be &lt; 0.001). In particular, the growth of training corpus size increased such sensitivity (t-test score increased from 4.09*** to 5.85 when testing on the clean corpus, and from 3.49** to 4.31*** when testing on the OCR'd corpus). Meanwhile, for fine-tuned BERT, classifiers only showed their sensitivity to the source of training data in small unbalanced training corpora (t-test score was -2.86** when testing on the clean corpus, and -2.10* when testing on the OCR'd corpus). According to the F1 score of these classifiers' prediction results shown in Table <ref type="table" target="#tab_3">2</ref>, we found that, compared with fine-tuning on clean texts, fine-tuning on OCR'd texts improved BERT-based classifiers' performance by ∼2%, which suggests that potential OCR noise in the small-unbalanced corpus for BERT fine-tuning can boost the resulting embedding's encoding performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Impact of the amount of OCR noise on BERT encoding.</head><p>Given three testing sample sets with different levels of OCR noise (see details of data preparation in section 4.2.1), Table <ref type="table" target="#tab_6">5</ref> shows the divergence of F1 score between classifiers built with either pre-trained or fine-tuned BERT embeddings on each sample set. This divergence was calculated by the subtraction of classification results using OCR'd texts for training from the one using clean texts for training. Overall, we found that classifiers obtained greater benefit from clean training data compared with OCR'd data except in the case of fine-tuned BERT-based classifiers making predictions on the low-noise testing data. Regarding the classification divergence across the three testing sample sets, we observed a gradual decrease in difference on testing samples with low (4.88%), medium (3.96%), and high (0.70%) level of OCR noise when classifiers employed pre-trained BERT for text encoding, while the pattern was the opposite in classifiers built with fine-tuned BERT (i.e., -1.96% for low noisy group, 1.43% for medium noisy group, and 3.79% for high noisy group). We further compared the absolute differences of classification results between two classifiers per embedding type, and found that testing samples with lower-level OCR noise were more sensitive to the training data source than those with higher-level noise in pre-trained BERT-based classifiers. On the contrary, for the classifiers built with the fine-tuned BERT, the largest performance difference was found in the testing set with a high amount of dirty OCR size.</p><p>Here are three major conclusions. First, the consistency of text quality in an embedding's pre-training corpus, downstream training, and downstream testing corpus is helpful in improving pre-trained BERT's applicability for literary text classification. Second, the heterogeneous nature of OCR noise can improve the generalization ability of fine-tuned embeddings to process texts with comparatively low levels of OCR noise. Finally, fine-tuned BERT-based classifiers are more stable with regard to changes in the source of training corpus than pre-trained BERTbased classifiers, which further confirms that fine-tuned BERT outperforms pre-trained BERT in its resilience to OCR errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Error analysis by content-based measurement.</head><p>Figure <ref type="figure" target="#fig_3">3</ref> shows eight confusion matrix heatmaps for the eight classifiers trained on the large unbalanced corpora. In each matrix, the diagonal values in comparatively darker blue cells represent the ratio of correct predictions, while the other values indicate the ratio of misclassifications (actual VS predicted). The higher the value is, the darker its corresponding cell color. For example, in the first matrix (fine-tuned, G→G), the value "0.45%" in the cell at the upper left corner indicates that 0.45% of "world war history" excerpts were misclassified as "agriculture" by the fine-tuned BERT-based classifier, which was trained and tested on Gutenberg texts. For both pre-trained and fine-tuned BERT-based classifications, we found that book excerpts in the business domain were more likely to be misclassified as fiction (25.4% on average) and social science (19.8% on average), while book excerpts in the medicine domain were more likely to be mistakenly classified as social science, especially with fine-tuned BERT-based classifiers trained on the OCR'd texts (32.86% misclassifications in H→G classification and 27.86% misclassifications in H→H). By looking more closely at social-science instances, we observed that the pattern of misclassifications was different in the classifier built with pre-trained BERT compared with that built with fine-tuned BERT. Specifically, in the classifications using pre-trained BERT for text encoding, prediction errors mainly concentrated in the domains of business (10% on average), medicine (8.5% on average), and fiction (7.5% on average). Meanwhile, for fine-tuned BERT-based classification, fiction (17% on average) and medicine (11% on average) were the top two misclassifications for social-science excerpts.</p><p>Comparing prediction errors with respect to the source of data for training and testing, we found that the pattern of misclassification in fine-tuned BERT-based classifications tended to be similar among all four types of classification. However, the ratio of errors per domain in pre-trained BERT-based classifications was likely to be different depending on the classifiers' training corpus source. For example, business instances tended to be misclassified as fiction (25%-28%) when the training corpus is clean, but as social science (23%-27%) when using OCR'd texts for training. Similarly, medicine instances have an markedly higher ratio of misclassification as social science (27.89%-32.86%) in the OCR'd training corpus compared with the clean one (11.43%-16.43%). These observations reaffirm that fine-tunbed BERT is more robust for processing OCR'd texts compared with pre-trained BERT.</p><p>We further looked into the prediction consistency of all BERT-based classifiers on each book in both clean and OCR'd versions. Given two aligned lists (i.e., clean and OCR'd) of book-level average prediction accuracy across different classifiers, we found that there was a large overlap of books with comparatively low accuracy in clean versus OCR'd corpus, which suggests that content-based characteristics of these particular books may be the main cause of recurring prediction mistakes. We verified this hypothesis by manually checking the books with the lowest prediction scores, and confirmed that these books had heterogeneous genre-related features which were confusing even for human readers. For instance, the book The Story of My Life by Helen Keller is generally considered a classic "social science" work because of its main subject and its many non-fiction features. However, this is a classic autobiography composed of touching stories of a great woman struggling with severe disability, first published in 1903. "A", "B", "F", "M", "W", "S" represent "agriculture", "business", "fiction", "medicine", "world war history" and "social science".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G -&gt; G G -&gt; H H -&gt; G H -&gt; H</head><p>Therefore, it is less surprising and even understandable for the models to label its instances as "medicine" or "fiction" based on their learning of the training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Work</head><p>We have investigated the resilience of pre-trained and fine-tuned BERT embeddings for encoding OCR'd texts through a case study of classifying book excerpts into subject domains. To the best of our knowledge, this is the first empirical study to systematically quantify the influence of OCR quality on BERT. By changing BERT embedding types and classification model settings, we built 12 BERT-based classifiers using book excerpt corpora extracted from a large parallel book corpus of aligned clean and OCR'd volumes sourced from two well-known digital libraries. Our analysis shows that the original BERT embedding pre-trained on born-digital texts is not resilient to OCR noise, at least according to its classification accuracy. However, fine-tuning the pre-trained BERT on OCR'd texts will significantly improve BERT's resilience to OCR noise, and hence will benefit downstream applications. Besides, fine-tuned BERT outperforms the pre-trained one in its encoding stability with regards to changes in training corpus size and training data source. For both types of BERT embedding, unbalanced training corpora benefit embeddings' resilience to OCR noise in downstream classifications. Our findings suggest that DH scholars should consider employing fine-tuned BERT for digitized-text-based scholarly research, particularly when their research involves document classification.</p><p>While our experiments yield significantly positive evidence for fine-tuned BERT embeddings' resilience to OCR noise in the use-case of document classification, the impact of OCR noise on BERT for other downstream tasks remain under-investigated. For example, it is possible that BERT could react to OCR noise differently at more fine-grained levels, such as sentence-level tasks (e.g., next sentence prediction, sentence-based sentiment analysis, etc.) and word-level (e.g., part-of-speech tagging, etc.). Therefore, future work focusing on BERT's performance on</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>w 1 ,</head><label>1</label><figDesc>w 2 , …, w n A clean or OCR'd book excerpt Text Encoding Classifier Construction Model-based Measurement ➔ BERT embedding types ➔ Sampling strategy of training corpora ➔ Source of training / testing data Content-based Measurement ➔ Book characteristics (e.g., genre, topics)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overview of study workflow</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Distribution of the amount of OCR noise measured by CER in three sampled testing subsets. Each set contains 1500 examples.</figDesc><graphic coords="7,203.15,54.92,238.98,140.56" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure3: Confusion matrices of classification models built on the large-unbalanced training corpora. Labels "A", "B", "F", "M", "W", "S" represent "agriculture", "business", "fiction", "medicine", "world war history" and "social science".</figDesc><graphic coords="11,40.61,35.48,682.78,281.08" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Statistics of three parallel training corporaFiction Social_Science Agriculture World_War_History Medicine Business Total</figDesc><table><row><cell>Small_Balanced(SB) Small_Unbalanced(SU) Large_Unbalanced(LU)</cell><cell>167 355 1164</cell><cell>167 152 423</cell><cell>167 148 409</cell><cell>167 130 341</cell><cell>166 122 359</cell><cell>166 1000 93 1000 304 3000</cell></row><row><cell>work.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>, for further excerpt classification. With 2 types of BERT embedding, 3 types of training data sampling, and 2 aligned training corpora, in total, this study built 12 classifiers. Considering that our primary goal is to explore BERT embeddings' resilience against OCR errors rather than improving classification performance, we employed a fundamental multiperception neural network model with three layers for building classifiers. With respect to the training process, by feeding the set of training examples, the model was expected to learn a weighting matrix for predicting the mapping probability per example into each domain class, where each training example was assigned to the domain with the highest probability. Following the standard practice of applying deep learning techniques for classification<ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b29">30]</ref>, our model was optimized by a cross-entropy loss function during training to maximize the model predictability (i.e., F1 score). To compare the consistency of predictions with and without OCR errors, we proposed two types of classifications: (1) both training and testing corpora are either clean or noisy (i.e., containing OCR errors); and (2) one is clean and the other is noisy.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Classification results on three training corpora (%). P, R, and F1 denotes precision, recall, and F1 score, respectively. All evaluation indicators are at macro-level and represent the average value of results over 20 folds of testing samples. The highest F1 score per classification strategy in each training setting is highlighted in bold.Pre-trained 70.31 67.05 66.28 70.67 64.68 65.48 60.24 66.38 60.89 62.32 65.01 60.98 SU Fine-tuned 75.23 77.71 74.39 75.25 77.50 74.71 75.79 78.83 76.27 74.94 79.45 76.20 Pre-trained 64.30 74.16 67.71 65.79 72.69 67.88 59.59 73.44 64.38 60.17 72.82 64.86 LU Fine-tuned 76.02 79.51 76.60 75.71 79.78 76.71 74.60 80.33 76.10 73.86 80.01 75.72</figDesc><table><row><cell></cell><cell>G→G</cell><cell></cell><cell></cell><cell>G→H</cell><cell></cell><cell></cell><cell>H→G</cell><cell></cell><cell></cell><cell>H→H</cell><cell></cell></row><row><cell>P</cell><cell>R</cell><cell>F1</cell><cell>P</cell><cell>R</cell><cell>F1</cell><cell>P</cell><cell>R</cell><cell>F1</cell><cell>P</cell><cell>R</cell><cell>F1</cell></row><row><cell cols="12">Pre-trained 49.88 71.67 53.24 49.97 70.44 52.64 47.14 74.50 53.33 46.97 73.25 53.05 SB Fine-tuned 69.06 79.75 72.65 70.00 79.17 72.99 68.07 79.06 71.54 68.93 78.07 71.70</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc>Paired t-test shows the differences of classification results varied by training strategies. The statistical significance is represented by p-value (one-tailed), where *p &lt; 0.05, **p &lt; 0.01, and ***p &lt; 0.001.</figDesc><table><row><cell></cell><cell>G→G</cell><cell></cell><cell>G→H</cell><cell></cell><cell>H→G</cell><cell></cell><cell>H→H</cell></row><row><cell cols="8">Pre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tuned</cell></row><row><cell>SU vs. SB 15.54*** LU vs. SU 1.85*</cell><cell>2.33* 2.65**</cell><cell>11.06*** 2.42*</cell><cell>2.05* 1.99*</cell><cell>7.76*** 3.32**</cell><cell>6.44*** -0.22</cell><cell>7.87*** 2.94**</cell><cell>5.07*** -0.71</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4</head><label>4</label><figDesc>Paired t-test shows the differences of classification results varied by training data source. The statistical significance is shown by p-value (one-tailed), where *p &lt; 0.05, **p &lt; 0.01, and ***p &lt; 0.001.</figDesc><table><row><cell>SB</cell><cell></cell><cell>SU</cell><cell></cell><cell>LU</cell><cell></cell></row><row><cell cols="6">Pre-trained Fine-tuned Pre-trained Fine-tuned Pre-trained Fine-tuned</cell></row><row><cell>G→G vs. H→G -0.13 G→H vs. H→H -0.68</cell><cell>1.41 1.37</cell><cell>4.09*** 3.49**</cell><cell>-2.86** -2.10*</cell><cell>5.85*** 4.31***</cell><cell>0.73 1.17</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 5</head><label>5</label><figDesc>Divergence of classification results (F1 score) by changing the training corpus source from clean to OCR'd texts on three testing sample sets with different levels of OCR error.</figDesc><table><row><cell></cell><cell cols="3">Low Noisy Medium Noisy High Noisy</cell></row><row><cell>LU Pre-trained LU Fine-tuned</cell><cell>4.88% -1.96%</cell><cell>3.96% 1.43%</cell><cell>0.70% 3.79%</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The number of epochs was optimized empirically by trying a set of values (i.e.,15,  </note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="20" xml:id="foot_1">, 30, 50).</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>OCR'd texts both at different text granularities and for different downstream NLP tasks would be useful to deepen our understanding of how OCR impacts this contextualized embedding technology. Furthermore, since our corpora consist exclusively of English-language books from the 18th and 19th centuries, expanding this study to curated datasets from other historical periods, languages, and publication types would be a very worthwhile future exercise.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Docbert: BERT for document classification</title>
		<author>
			<persName><forename type="first">A</forename><surname>Adhikari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1904.08398</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Evaluating the stability of embedding-based word similarities</title>
		<author>
			<persName><forename type="first">M</forename><surname>Antoniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mimno</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="107" to="119" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Assessing the Impact of OCR Errors in Information Retrieval</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">T</forename><surname>Bazzo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Lorentz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Vargas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">P</forename><surname>Moreira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of European Conference on Information Retrieval</title>
				<meeting>European Conference on Information Retrieval</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="102" to="109" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Language Models &amp; Literary Clichés: Analyzing North Korean Poetry with BERT</title>
		<author>
			<persName><surname>Ben</surname></persName>
		</author>
		<ptr target="https://digitalnk.com/blog/2020/10/01/language-models-literary-cliches-analyzing-north-korean-poetry-with-bert/" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Impact of OCR errors on the use of digital libraries: Towards a better access to information</title>
		<author>
			<persName><forename type="first">G</forename><surname>Chiron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Visani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-P</forename><surname>Moreux</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</title>
				<meeting>2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</meeting>
		<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection</title>
		<author>
			<persName><forename type="first">A</forename><surname>Cuba Gyllensten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gogoulou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ekgren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sahlgren</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2020.semeval-1.12" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourteenth Workshop on Semantic Evaluation</title>
				<meeting>the Fourteenth Workshop on Semantic Evaluation<address><addrLine>Barcelona</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="112" to="118" />
		</imprint>
	</monogr>
	<note>International Committee for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Classification and distribution of optical character recognition errors</title>
		<author>
			<persName><forename type="first">J</forename><surname>Esakov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Lopresti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Sandberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Society for Optics and Photonics</title>
				<imprint>
			<date type="published" when="1994">1994</date>
			<biblScope unit="volume">2181</biblScope>
			<biblScope unit="page" from="204" to="216" />
		</imprint>
	</monogr>
	<note>Document Recognition</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">What about Grammar? Using BERT Embeddings to Explore Functional-Semantic Shifts of Semi-Lexical and Grammatical Constructions</title>
		<author>
			<persName><forename type="first">L</forename><surname>Fonteyn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Computational Humanities Research</title>
				<meeting>the Workshop on Computational Humanities Research</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="257" to="268" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Exponential smoothing: The state of the art</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">S</forename><surname>Gardner</surname><genName>Jr</genName></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Forecasting</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="28" />
			<date type="published" when="1985">1985</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Hill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hengchen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Digital Scholarship in the Humanities</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="825" to="843" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The Gutenberg-HathiTrust parallel corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts</title>
		<author>
			<persName><forename type="first">M</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Worthey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Dubnicek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Capitanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kudeki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Downie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">iConference</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
		</imprint>
	</monogr>
	<note>Poster</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Macroanalysis: Digital methods and literary history</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Jockers</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
			<publisher>University of Illinois Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces</title>
		<author>
			<persName><forename type="first">V</forename><surname>Kanjirangat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mitrovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Antonucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rinaldi</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2020.semeval-1.26" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourteenth Workshop on Semantic Evaluation</title>
				<meeting>the Fourteenth Workshop on Semantic Evaluation<address><addrLine>Barcelona</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="214" to="221" />
		</imprint>
	</monogr>
	<note>International Committee for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Adam: A method for stochastic optimization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">BERT for Named Entity Recognition in Contemporary and Historical German</title>
		<author>
			<persName><forename type="first">K</forename><surname>Labusch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kulturbesitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Neudecker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zellhöfer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th Conference on Natural Language Processing</title>
				<meeting>the 15th Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="8" to="11" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<ptr target="https://aclanthology.org/W19-4800" />
		<title level="m">Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</title>
				<editor>
			<persName><forename type="first">T</forename><surname>Linzen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Chrupała</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Belinkov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Hupkes</surname></persName>
		</editor>
		<meeting>the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Leveraging contextual embeddings for detecting diachronic semantic shift</title>
		<author>
			<persName><forename type="first">M</forename><surname>Martinc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K</forename><surname>Novak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pollak</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.01072</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Quantitative analysis of culture using millions of digitized books</title>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">K</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Aiden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Veres</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Pickett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hoiberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Clancy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Norvig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Orwant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Science</title>
		<imprint>
			<biblScope unit="volume">331</biblScope>
			<biblScope unit="page" from="176" to="182" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Neural machine translation with BERT for post-OCR error detection and correction</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">T H</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jatowt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N.-V</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Coustaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doucet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM/IEEE Joint Conference on Digital Libraries</title>
				<meeting>the ACM/IEEE Joint Conference on Digital Libraries</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="333" to="336" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Performance analysis for classification in balanced and unbalanced data set</title>
		<author>
			<persName><forename type="first">S</forename><surname>Padma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Manavalan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th International Conference on Industrial and Information Systems</title>
				<meeting>the 6th International Conference on Industrial and Information Systems</meeting>
		<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="300" to="304" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Document Embedding Techniques</title>
		<author>
			<persName><forename type="first">S</forename><surname>Palachy</surname></persName>
		</author>
		<ptr target="https://towardsdatascience.com/document-embedding-techniques-fed3e7a6a25d%5C#ecd3" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Counterfactual Story Reasoning and Generation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bosselut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holtzman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bhagavatula</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)<address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="5043" to="5053" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection</title>
		<author>
			<persName><forename type="first">D</forename><surname>Schlechtweg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mcgillivray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hengchen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Dubossarsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tahmasebi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourteenth Workshop on Semantic Evaluation</title>
				<meeting>the Fourteenth Workshop on Semantic Evaluation<address><addrLine>Barcelona</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1" to="23" />
		</imprint>
	</monogr>
	<note>International Committee for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Literary Event Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sims</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bamman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 57th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3623" to="3634" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Assessing the Impact of OCR Quality on Downstream NLP Tasks</title>
		<author>
			<persName><forename type="first">D</forename><surname>Van Strien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Beelen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Ardanuy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mcgillivray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Colavizza</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th International Conference on Agents and Artificial Intelligence</title>
				<meeting>the 12th International Conference on Agents and Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="484" to="496" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Text analysis using deep neural networks in digital humanities and information science</title>
		<author>
			<persName><forename type="first">O</forename><surname>Suissa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Elmalech</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhitomirsky-Geffet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the Association for Information Science and Technology</title>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>Todorova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Colavizzaa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Computational Humanities Research</title>
				<meeting>the Workshop on Computational Humanities Research</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="310" to="339" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Impact analysis of OCR quality on research tasks in digital archives</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Traub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Van Ossenbruggen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hardman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of International Conference on Theory and Practice of Digital Libraries</title>
				<meeting>International Conference on Theory and Practice of Digital Libraries</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="252" to="263" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Do humanists need BERT?</title>
		<author>
			<persName><forename type="first">T</forename><surname>Underwood</surname></persName>
		</author>
		<ptr target="https://tedunderwood.com/category/methodology/genre-comparison/" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Genre Identification and the Compositional Effect of Genre in Literature</title>
		<author>
			<persName><forename type="first">J</forename><surname>Worsham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kalita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 27th International Conference on Computational Linguistics</title>
				<meeting>the 27th International Conference on Computational Linguistics<address><addrLine>Santa Fe, New Mexico, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1963" to="1973" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Towards literary genre identification: Applied neural networks for large text classification</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Worsham</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
		<respStmt>
			<orgName>University of Colorado Colorado Springs</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">A fast alignment scheme for automatic OCR evaluation of books</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">Z</forename><surname>Yalniz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Manmatha</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 2011 International Conference on Document Analysis and Recognition</title>
				<meeting>2011 International Conference on Document Analysis and Recognition</meeting>
		<imprint>
			<publisher>Ieee</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="754" to="758" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">An evaluation of text classification methods for literary study</title>
		<author>
			<persName><forename type="first">B</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="327" to="343" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
