<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Impact of OCR Quality on BERT Embeddings in the Domain Classification of Book Excerpts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ming Jiang</string-name>
          <email>mjiang17@illinois.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuerong Hu</string-name>
          <email>yuerong2@illinois.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Glen Worthey</string-name>
          <email>gworthey@illinois.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryan C. Dubnicek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ted Underwood</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J Stephen Downie</string-name>
          <email>jdownie@illinois.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Illinois</institution>
          ,
          <addr-line>Urbana-Champaign</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>266</fpage>
      <lpage>279</lpage>
      <abstract>
        <p>Digital humanities (DH) scholars have been increasingly interested in using BERT for document representation in computational text analysis. However, most word embeddings, including BERT embeddings, have been developed using “clean” corpora, while DH research is usually based on digitized texts with optical character recognition (OCR) errors. Will these errors introduced by the digitization process reduce BERT's performance and distort the research findings? To shed light on the impact of OCR quality on BERT models, we conducted an empirical study on the resilience of BERT embeddings (pre-trained and fine-tuned) to OCR errors by measuring BERT's ability to enable classification of book excerpts by subject domain. We developed specialized parallel corpora for this task consisting of matching pairs of OCR'd text (19,049 volumes) and “clean” re-keyed text (4,660 volumes) from English-language books in six domains published from 1780 to 1993. This study is the first to systematically quantify OCR impact on contextualized word embedding techniques with a use case of OCR'd book datasets curated by digital libraries (DL). Experimental results show that pre-trained BERT is less robust when used on OCR'd texts; however, fine-tuning pre-trained BERT on OCR'd texts significantly improves its resilience to OCR noise in classification tasks according to the changes of classifier performance. These findings should assist DH scholars who are interested in using BERT for scholarly purposes.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Optical Character Recognition</kwd>
        <kwd>BERT Resilience</kwd>
        <kwd>Word Embeddings</kwd>
        <kwd>Text Analysis</kwd>
        <kwd>Parallel Corpora</kwd>
        <kwd>HathiTrust</kwd>
        <kwd>Digital Humanities</kwd>
        <kwd>Digital Libraries</kwd>
        <kwd>Data Curation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The accessibility of ever-growing digitized textual curations in digital libraries (DL) and the
rapid development of natural language processing (NLP) techniques have opened up a variety
of new research opportunities to humanities scholars for computational text analysis [
        <xref ref-type="bibr" rid="ref20">19, 12,
13</xref>
        ]. In recent years, BERT (Bidirectional Encoder Representations from Transformers) has
been widely used as a fundamental text representation tool in text-based computing, for it
focuses on encoding the contextual meaning of words into a vector space [
        <xref ref-type="bibr" rid="ref25 ref7">7, 24</xref>
        ]. There are
two main reasons for its popularity. First, in encoding word tokens rather than word types
(i.e., distinct words), BERT is helpful in identifying the correct meaning of a homonym within
its context (e.g., bank in “river bank” and “savings bank”). Second, BERT can leverage
the general linguistic knowledge it has learned from a massive, high-resource corpus such as
Wikipedia to serve specialized and lower-resource downstream tasks, such as movie review
sentiment classification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. So far, BERT has produced promising improvements in both
(1) fundamental text analysis, e.g., text segmentation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], named entity recognition [28, 16],
and post-OCR correction [
        <xref ref-type="bibr" rid="ref21">28, 20</xref>
        ]; and (2) specific research topics, e.g., historical analysis of
semantic change in lexical/grammatical constructions [
        <xref ref-type="bibr" rid="ref19 ref25 ref9">24, 18, 9</xref>
        ], literary genre analysis [
        <xref ref-type="bibr" rid="ref31 ref4">30,
4</xref>
        ], literary event detection [25], and computational narrative intelligence[
        <xref ref-type="bibr" rid="ref24">23</xref>
        ].
      </p>
      <p>
        Digital humanities (DH) scholars working with computational analysis have been
increasingly interested in using this technique for their research on digitized texts. However, a majority
of large DL text curations and other historical text collections are machine-transcribed and
include varying degrees of optical character recognition (OCR) noise. Such noise might
decrease the generally impressive performance of BERT because it was originally developed on
born-digital texts without OCR errors [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Even though existing OCR systems have
significantly improved through advances in AI techniques (e.g., image recognition) and persistent
eforts of digital curators (e.g., the Library of Congress, HathiTrust Digital Library), OCR
noise can hardly ever be completely eliminated given its ubiquity, its uneven distribution, and
the heterogeneous nature of its source texts. Meanwhile, advanced NLP techniques like BERT
are generally limited in their transparency and interpretability, which is even worse when
processing OCR’d texts. [
        <xref ref-type="bibr" rid="ref18">17</xref>
        ]. Such uncertainty might reduce the credibility of digital humanities
research when applying BERT-based computations to OCR’d texts for further analysis.
      </p>
      <p>
        Therefore, we believe BERT’s performance on OCR’d texts is an important problem to look
into. This study aims to empirically investigate this problem with three research questions:
(1) Would the original BERT model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (pre-trained on Wikipedia and free Web books) work
as well with OCR’d texts containing noise? (2) If we fine-tune the pre-trained BERT using
a corpus with a certain amount of OCR noise, would this result in any improvements for
processing OCR’d texts in downstream tasks? and (3) What are the quantifiable impacts of
OCR quality on both pre-trained and fine-tuned BERT models?
      </p>
      <p>
        To shed light on the interaction between OCR’d texts and BERT, we focused on measuring
the ability of BERT to encode digitized texts’ semantics and comparing the performance of
BERT encoding on clean (i.e., re-keyed) versus OCR’d texts. The texts we used were book
excerpts generated from ∼4,000 pairs of book volumes selected from a parallel corpus of digital
English-language books, with 4,660 human-proofread “lean” volumes from Project Gutenberg
(Gutenberg) and their matching pairs of 19,049 OCR’d volumes from HathiTrust Digital
Library (HathiTrust) [12]. Books in this corpus cover six subject domains published from 1780
to 1993. We chose subject domain classification as the application downstream from BERT
in order to quantify its encoding performance, because document classification in general is a
popular application for digital humanists studying subject, genre, authorship, and many other
features of their texts. [
        <xref ref-type="bibr" rid="ref28 ref34">34, 27</xref>
        ]. Specifically, we investigated both the generic embedding
obtained from the pre-trained BERT model and the domain-adapted embedding by fine-tuning
the pre-trained BERT on the downstream training corpus (i.e., either clean or noisy).
      </p>
      <p>The remainder of this paper is organized as follows. In section 2, we review related work on
BERT and OCR’d texts. In section 3, we provide detailed information about the parallel book
dataset that we created and leveraged, and how we built the book excerpt corpora needed for
our experiments. In section 4, we describe our research design and workflow. We also give
explanations for the specific decisions made and methods adopted. In section 5, we present our
experimental results and findings. Finally in section 6, we discuss our conclusions and future</p>
      <p>Fiction Social_Science Agriculture World_War_History Medicine Business Total</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        BERT used in existing work for digital history and literary studies generally plays a text
preprocessing role by encoding text information into vectors for further computation. Popular
research topics in this field mainly focus on the diachronic analysis of literary texts [
        <xref ref-type="bibr" rid="ref19 ref25 ref31 ref9">24, 18,
9, 30</xref>
        ] and narrative understanding [
        <xref ref-type="bibr" rid="ref24">25, 23</xref>
        ]. Regarding data sources, commonly used corpora
typically come from Project Gutenberg [25], the Corpus of Historical American English [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
and OCR’d text collections organized in DL [
        <xref ref-type="bibr" rid="ref19 ref25">24, 18</xref>
        ]. Although BERT has shown its power
in representing clean texts, some empirical studies [
        <xref ref-type="bibr" rid="ref15 ref25 ref6">24, 14, 6</xref>
        ] have witnessed a drop of its
performance on processing digitized texts containing OCR errors. Inspired by that, we are
interested in advancing the understanding of BERT’s applicability on OCR’d noisy texts.
      </p>
      <p>
        Based on a literature review on OCR noise analysis, common error types include character
misidentification (e.g., “inse rted”→“insorted”), broken words (e.g., un-rejoined hyphenated
words “talking”→“talk- ing”), incorrectly joined words (e.g., “the
belief”→“thebelief”), and meaningless symbols (e.g., OCR attempts to recognize hand-written marginalia)
[
        <xref ref-type="bibr" rid="ref8">3, 8</xref>
        ]. Given the various patterns and random distribution of OCR noise, even the
state-ofthe-art techniques for OCR correction cannot completely filter the OCR noise out.
      </p>
      <p>
        Prior work on the impact of uncorrected OCR’d texts on other NLP tasks can be divided
into two groups: (1) those quantifying impact by measuring the performance diferences of
a set of popular NLP techniques applied on a parallel corpus consisting of OCR’d and clean
texts [
        <xref ref-type="bibr" rid="ref11 ref27 ref5">11, 26, 5</xref>
        ]; and (2) those analyzing OCR impact by interviewing scholar-users for their
feedback on the use of digital archives and NLP techniques for computational textual analysis
[29]. Popular NLP tasks adopted in existing studies include tokenization, sentence
segmentation, named entity recognition, dependency parsing, topic modeling, information retrieval,
text classification, collocation, and authorial attribution [
        <xref ref-type="bibr" rid="ref11 ref27 ref5">11, 26, 5</xref>
        ]. Most studies show that
OCR errors lead to a consistent negative influence on NLP tasks, even for some tasks that have
been considered “solved” (e.g., sentence segmentation)[
        <xref ref-type="bibr" rid="ref27">26</xref>
        ]. In this research, we extend prior
work by studying the impact of OCR quality on BERT-based text representations, where we
particularly explore BERT’s ability to encode the intrinsic semantic features of OCR-impacted
texts in comparison with its encoding of parallel clean texts.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data and Corpora Preparation</title>
      <p>The source data for this study is a parallel corpus of English monographs [12] collected from
two real-world digital libraries: (1) Gutenberg for a human-proofread “clean” corpus; and, (2)
HathiTrust for an OCR’d “noisy” corpus. This corpus has a total of 4,660 Gutenberg volumes
in 6 domains (i.e., fiction, social science, agriculture, medicine, business, world war history),
each of which is matched with several diferent copies (4 on average) of the same work held in
HathiTrust.</p>
      <p>Since classification is a supervised learning task, we started by preparing three parallel data
splits from the raw corpus for training, validation, and testing, respectively. Considering the
many-to-one matching relationship between Hathitrust and Gutenberg volumes, in order to
make the clean and OCR’d version of each data split, aligned by volume, and to avoid volume
duplication in splits with clean data, we first split Gutenberg data by randomly selecting 10%
of 4,660 Gutenberg volumes for validation (465 volumes), 10% for testing (467 volumes), and
the rest for training. Then we randomly picked one paired HathiTrust copy of each Gutenberg
volume to build corresponding training, validation and testing splits of OCR’d texts.</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref2 ref22">2, 21</xref>
        ], data distribution and downstream corpus size also influence the
embeddings’ encoding ability, in addition to text quality, especially for the fine-tuned BERT
embedding. Taking these two variables into consideration, we modified the original parallel training
split by resampling the data into three types of parallel training corpus: (1) a small balanced
corpus (SB) containing 1000 books with an equal number of books per genre; (2) a small
unbalanced corpus (SU) containing 1000 books with a diferent number of books per genre;
and (3) a large unbalanced corpus (LU) containing 3000 books with a diferent number of
books per genre. Table 1 shows the details of each type of training corpus. Given the highly
skewed data distribution in the original parallel corpus (e.g., fiction volumes comprise 88%)
[12], our unbalanced corpora were generated by a slight smoothing based on the exponentially
smoothed weighting method [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where we empirically set the smoothing factor as 0.3.
      </p>
      <p>
        There are two main challenges in the encoding of book content by BERT. First,
booklength texts and the computational cost of BERT models make it expensive to encode each
volume’s full text. Moreover, BERT models are restricted to processing at most 512 tokens
at a time, which limits their encoding abilities on long sentences. To address these issues,
we followed prior work [
        <xref ref-type="bibr" rid="ref32">31, 32</xref>
        ] by parsing the full content per volume into a set of word
sequences with at most n tokens and randomly sampled k continuous word sequences as a text
chunk to feed into BERT. Referring to prior studies’ parameter settings and our own hardware
computing constraints, we set n = 128 and k = 15 (∼1920 tokens per chunk). Recent studies
on subject domain and genre classification [
        <xref ref-type="bibr" rid="ref32">31, 32</xref>
        ] show that book chunks should be sufficient
for predicting an entire book’s subject, and with this premise, we decided to focus on parallel
book excerpts for our study. Although this method could not process complete volumes, the
random sampling strategy is helpful in augmenting the book content to be trained or tested
as much as possible, which compensates for the limits on text length.
      </p>
      <p>
        To make each classifier’s predictions on clean versus OCR’d test set comparable, the sampled
text chunks from each pair of test volumes were aligned by an existing text alignment algorithm
[
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. We manually examined a random sample of chunk pairs to ensure alignment accuracy.
Furthermore, for a statistical significance test of the classification results, we grouped all the
sampled chunk pairs into a set of parallel testing folds. In the end, our parallel testing corpus
consists of 20 parallel testing folds, where each parallel fold contains one unique pair of text
chunks extracted from a pair of Gutenberg and HathiTrust volumes(20 × 467 = 9340 parallel
examples in total).
w1, w2,
…, wn
A clean or OCR’d
book excerpt
      </p>
      <p>Book Excerpt Domain Classification</p>
      <p>Text Encoding</p>
      <p>Classifier</p>
      <p>Construction
token
vectors
t1
t2
…
tn
BERT</p>
      <p>Mean
Pooling
d1</p>
    </sec>
    <sec id="sec-4">
      <title>4. Research Design and Workflow</title>
      <p>Model-based Measurement
➔ BERT embedding types
➔ Sampling strategy of training corpora
➔ Source of training / testing data
Content-based Measurement
➔ Book characteristics (e.g., genre,
topics)</p>
      <p>Outcomes
The primary goal of this study is to analyze the performance of BERT embeddings in
encoding book excerpts into n D-dimensional (D=768) token vectors for book domain classification
based on the parallel clean and OCR’d texts. We measured and compared BERT embeddings’
encoding ability in diferent classifiers using macro-averaged precision (P), recall (R), and F1
score (F1). Considering the potential influence of experimental settings on BERT
embeddings’ performance, we analyzed the classification outcomes based on the model settings and
data characteristics respectively. Figure 1 visualizes the overall workflow of this study, which
includes two stages: (1) building classifiers based on text representations ofered by BERT
embeddings on book excerpts; and (2) quantifying BERT embeddings’ performance in diferent
classification settings to analyze BERT embeddings’ resilience to OCR noise.</p>
      <sec id="sec-4-1">
        <title>4.1. Domain Classifier Construction</title>
        <p>
          With the encoded BERT token representations per excerpt, we first generate a single
chunklevel feature vector by averaging token vectors, one of standard practices popularly used in
prior work [
          <xref ref-type="bibr" rid="ref23">22</xref>
          ], for further excerpt classification. With 2 types of BERT embedding, 3 types of
training data sampling, and 2 aligned training corpora, in total, this study built 12 classifiers.
Considering that our primary goal is to explore BERT embeddings’ resilience against OCR
errors rather than improving classification performance, we employed a fundamental
multiperception neural network model with three layers for building classifiers. With respect to
the training process, by feeding the set of training examples, the model was expected to learn
a weighting matrix for predicting the mapping probability per example into each domain
class, where each training example was assigned to the domain with the highest probability.
Following the standard practice of applying deep learning techniques for classification [
          <xref ref-type="bibr" rid="ref1 ref31">1, 30</xref>
          ],
our model was optimized by a cross-entropy loss function during training to maximize the
model predictability (i.e., F1 score). To compare the consistency of predictions with and
without OCR errors, we proposed two types of classifications: (1) both training and testing
corpora are either clean or noisy (i.e., containing OCR errors); and (2) one is clean and the
other is noisy.
        </p>
        <p>
          The detailed implementation of model training is as follows. We used the Adam optimizer
[
          <xref ref-type="bibr" rid="ref12">15</xref>
          ] to train all classification models with 20 epochs 1. As to the learning rate, for pre-trained
BERT-based classifiers, we set this parameter as 2.0e-3 for for the Gutenberg corpus and 2.5e-3
for the HathiTrust corpus respectively, while for fine-tuned classifiers, we set both of them
2.5e5. Our empirical setting for this parameter was based on the resultant classifier’s performance
on the validation set in order to find the optimal one. The batch size was set as 40 (book
excerpts) for all the models.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Analysis of BERT Encoding on Clean Versus OCR’d Texts</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Model-based measurement</title>
          <p>Based on the classification results of 12 generated classifiers on our parallel testing corpus, we
analyzed the relations among BERT embedding types (i.e., pre-trained or fine-tuned BERT),
the source of training and testing data, and the sampling strategy of training corpora by
pairwise comparison of any two of three variables. Our goals were: (1) finding the optimal
BERT embedding with the highest resilience against OCR errors; and (2) identifying the
optimal sampling strategy for building the training corpus that most significantly improves
the BERT embedding performance.</p>
          <p>Given that the above analysis primarily focused on the comparison of BERT-based classifiers’
overall performance, we further proposed a fine-grained investigation of BERT embeddings’
resilience to OCR errors regarding the amount of noise. To conduct this investigation, we first
prepared three subsets of OCR’d testing data containing diferent amounts of OCR errors.
The level of OCR noise was measured by the character-level error rate (CER) based on the
comparison of each OCR’d book excerpt with its paired clean text. After sorting the OCR’d
excerpts by their CER in an ascending order, from this ranked excerpt list, we separately
sampled 1500 excerpts at the top, middle, and the bottom position as the low-,
medium, and high-noisy testing subsets. Figure2 displays the distribution of CER in each testing
subset, where the average CER per subset is around 0.40, 0.54, and 0.65, respectively. We
then evaluated each classifier’s predictability on each subset. Note that, in this analysis, we
only considered those classifiers trained on the corpus with the identified optimal sampling
strategy. To further look into the resilience of BERT embeddings with respect to the change of
the downstream classification’s training corpus source, rather than exploring each individual
classifier’s results, we measured the divergence of classification results between the classifier
trained on the clean versus the OCR’d texts for each type of BERT embedding.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Content-based measurement</title>
          <p>Although each book in the raw parallel corpus was assigned to a single subject domain tag,
given the diversity of content-based characteristics (e.g., topics, genres, narrative styles)
inherent in a book-sized text and its randomly sampled excerpts, it is possible that the input
1The number of epochs was optimized empirically by trying a set of values (i.e., 15, 20, 30, 50).</p>
          <p>Low</p>
          <p>Medium</p>
          <p>High
data itself might bring challenges for a BERT-based classifier to identify its annotated
domain tag. Moreover, whether and how such challenges occur with OCR’d texts vary from
those occurring with clean texts is uncertain. For instance, if all BERT-based classifiers fail
to classify either clean or OCR’d excerpts of the same book correctly, one potential reason
for this result could be that the original book includes more than one subject. In contrast,
if all classification models work well on the clean texts only, it is likely that OCR noise is
resulting in diferent predictions. To address these concerns, we started by exploring semantic
associations among misclassified domains by visualizing the confusion matrix of each
classiifer. To further capture book excerpts’ individual features for understanding their influence
on classification, we then grouped the predictions made per classifier on individual excerpts
by book, to measure the consistency of classifiers’ prediction accuracy at the book level. This
measurement is based on calculating the number of testing excerpts of the same book that
were assigned to the same correct domain across diferent classifiers on average. Given the
quantitative outcomes, we sampled some cases with poor prediction accuracy, and explored
potential reasons for misclassification by close reading of the book content.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Outcomes and Findings</title>
      <sec id="sec-5-1">
        <title>5.1. Resilience of BERT embeddings</title>
        <p>Table 2 provides an overview of the classification results grouped by (1) source of training
and testing data (Gutenberg or HathiTrust); (2) sampling strategy of parallel training corpus
(small-balanced, small-unbalanced and large-unbalanced); and (3) type of BERT embedding
(pre-trained or fine-tuned). Overall, we observe that classifiers built with fine-tuned BERT
outperformed those built with pre-trained BERT by 20% (F1 score) based on the balanced
training corpora and 10% (F1 score) based on the unbalanced training corpora. This result
indicates that the fine-tuning process, intended to adapt the generic pre-trained BERT
embedding space to fit into a specific text corpus (either clean or OCR’d), will substantially improve
the encoding ability of BERT for digitized literary texts even with the distortion of OCR noise.</p>
        <p>Regarding the influence of training sampling strategies to BERT encoding, in general,
unbalanced corpora were more helpful in training classifiers than balanced corpora, which suggests
that excessive artifact intervention of training data distribution indeed could hurt BERT’s
encoding ability. Table 3 further shows the paired t-test scores of the statistical diference
of performance between any two comparable classifiers that difer only in either size or data
distribution of training corpus. It is to be noted that diferences between any two compared
classifiers’ performances over 20 testing folds follow an approximately normal distribution
based on the Shapiro-Wilks Test. According to the results, pre-trained BERT-based classifiers
are all sensitive to both size and data distribution in the training corpus (p-value &lt; 0.05 at
least). However, the increase in size of the OCR’d training corpus has no significant impact
on fine-tuned BERT embedding. This observation may be understood as a positive signal to
humanities scholars that a small training corpus is enough to achieve optimal performance of
ifne-tuned BERT when working with OCR’d texts. Comparatively, training corpus size (t-test
score from -0.71 to 3.32 where p-value &lt; 0.01 at most) is less influential on BERT
embeddings’ performance than is training data distribution (t-test score from 2.05 to 15.54 where the
majority of p-values &lt; 0.001).</p>
        <p>Similar to the analysis of training sampling strategies, we compared classifiers’ performance
with respect to the source of training data. Table 4 shows the paired t-test results.
Pretrained BERT-based classifiers were significantly more sensitive to their training data source
when these classifiers were built on unbalanced training corpora (p-value tends to be &lt; 0.001).
In particular, the growth of training corpus size increased such sensitivity (t-test score
increased from 4.09*** to 5.85 when testing on the clean corpus, and from 3.49** to 4.31***
when testing on the OCR’d corpus). Meanwhile, for fine-tuned BERT, classifiers only showed
their sensitivity to the source of training data in small unbalanced training corpora (t-test
score was -2.86** when testing on the clean corpus, and -2.10* when testing on the OCR’d
corpus). According to the F1 score of these classifiers’ prediction results shown in Table 2, we
found that, compared with fine-tuning on clean texts, fine-tuning on OCR’d texts improved
BERT-based classifiers’ performance by ∼2%, which suggests that potential OCR noise in the
small-unbalanced corpus for BERT fine-tuning can boost the resulting embedding’s encoding
performance.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Impact of the amount of OCR noise on BERT encoding.</title>
        <p>Given three testing sample sets with diferent levels of OCR noise (see details of data
preparation in section 4.2.1), Table 5 shows the divergence of F1 score between classifiers built with
either pre-trained or fine-tuned BERT embeddings on each sample set. This divergence was
calculated by the subtraction of classification results using OCR’d texts for training from the
one using clean texts for training.</p>
        <p>Overall, we found that classifiers obtained greater benefit from clean training data compared
with OCR’d data except in the case of fine-tuned BERT-based classifiers making predictions
on the low-noise testing data. Regarding the classification divergence across the three testing
sample sets, we observed a gradual decrease in diference on testing samples with low (4.88%),
medium (3.96%), and high (0.70%) level of OCR noise when classifiers employed pre-trained
BERT for text encoding, while the pattern was the opposite in classifiers built with fine-tuned
BERT (i.e., -1.96% for low noisy group, 1.43% for medium noisy group, and 3.79% for high
noisy group). We further compared the absolute diferences of classification results between
two classifiers per embedding type, and found that testing samples with lower-level OCR noise
were more sensitive to the training data source than those with higher-level noise in pre-trained
BERT-based classifiers. On the contrary, for the classifiers built with the fine-tuned BERT,
the largest performance diference was found in the testing set with a high amount of dirty
OCR size.</p>
        <p>Here are three major conclusions. First, the consistency of text quality in an embedding’s
pre-training corpus, downstream training, and downstream testing corpus is helpful in
improving pre-trained BERT’s applicability for literary text classification. Second, the heterogeneous
nature of OCR noise can improve the generalization ability of fine-tuned embeddings to process
texts with comparatively low levels of OCR noise. Finally, fine-tuned BERT-based classifiers
are more stable with regard to changes in the source of training corpus than pre-trained
BERTbased classifiers, which further confirms that fine-tuned BERT outperforms pre-trained BERT
in its resilience to OCR errors.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Error analysis by content-based measurement.</title>
        <p>represent the ratio of correct predictions, while the other values indicate the ratio of
misclassifications (actual VS predicted). The higher the value is, the darker its corresponding cell
color. For example, in the first matrix (fine-tuned, G →G), the value “0.45%” in the cell at
the upper left corner indicates that 0.45% of ”world war history” excerpts were misclassified
as ”agriculture” by the fine-tuned BERT-based classifier, which was trained and tested on
Gutenberg texts. For both pre-trained and fine-tuned BERT-based classifications, we found
that book excerpts in the business domain were more likely to be misclassified as fiction ( 25.4%
on average) and social science (19.8% on average), while book excerpts in the medicine
domain were more likely to be mistakenly classified as social science, especially with fine-tuned
BERT-based classifiers trained on the OCR’d texts (32.86% misclassifications in H →G
classification and 27.86% misclassifications in H →H). By looking more closely at social-science
instances, we observed that the pattern of misclassifications was diferent in the classifier built
with pre-trained BERT compared with that built with fine-tuned BERT. Specifically, in the
classifications using pre-trained BERT for text encoding, prediction errors mainly concentrated
in the domains of business (10% on average), medicine (8.5% on average), and fiction ( 7.5%
on average). Meanwhile, for fine-tuned BERT-based classification, fiction ( 17% on average)
and medicine (11% on average) were the top two misclassifications for social-science excerpts.</p>
        <p>Comparing prediction errors with respect to the source of data for training and testing, we
found that the pattern of misclassification in fine-tuned BERT-based classifications tended to
be similar among all four types of classification. However, the ratio of errors per domain in
pre-trained BERT-based classifications was likely to be diferent depending on the classifiers’
training corpus source. For example, business instances tended to be misclassified as fiction
(25%-28%) when the training corpus is clean, but as social science (23%-27%) when using
OCR’d texts for training. Similarly, medicine instances have an markedly higher ratio of
misclassification as social science (27.89%-32.86%) in the OCR’d training corpus compared
with the clean one (11.43%-16.43%). These observations reaffirm that fine-tunbed BERT is
more robust for processing OCR’d texts compared with pre-trained BERT.</p>
        <p>We further looked into the prediction consistency of all BERT-based classifiers on each
book in both clean and OCR’d versions. Given two aligned lists (i.e., clean and OCR’d) of
book-level average prediction accuracy across diferent classifiers, we found that there was a
large overlap of books with comparatively low accuracy in clean versus OCR’d corpus, which
suggests that content-based characteristics of these particular books may be the main cause of
recurring prediction mistakes. We verified this hypothesis by manually checking the books with
the lowest prediction scores, and confirmed that these books had heterogeneous genre-related
features which were confusing even for human readers. For instance, the book The Story of My
Life by Helen Keller is generally considered a classic “social science” work because of its main
subject and its many non-fiction features. However, this is a classic autobiography composed
of touching stories of a great woman struggling with severe disability, first published in 1903.
Pre-trained</p>
        <p>A</p>
        <p>B F M S W</p>
        <p>Predicted</p>
        <p>Therefore, it is less surprising and even understandable for the models to label its instances as
“medicine” or “fiction” based on their learning of the training data.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>We have investigated the resilience of pre-trained and fine-tuned BERT embeddings for
encoding OCR’d texts through a case study of classifying book excerpts into subject domains. To
the best of our knowledge, this is the first empirical study to systematically quantify the
influence of OCR quality on BERT. By changing BERT embedding types and classification model
settings, we built 12 BERT-based classifiers using book excerpt corpora extracted from a large
parallel book corpus of aligned clean and OCR’d volumes sourced from two well-known digital
libraries. Our analysis shows that the original BERT embedding pre-trained on born-digital
texts is not resilient to OCR noise, at least according to its classification accuracy. However,
ifne-tuning the pre-trained BERT on OCR’d texts will significantly improve BERT’s resilience
to OCR noise, and hence will benefit downstream applications. Besides, fine-tuned BERT
outperforms the pre-trained one in its encoding stability with regards to changes in training corpus
size and training data source. For both types of BERT embedding, unbalanced training
corpora benefit embeddings’ resilience to OCR noise in downstream classifications. Our findings
suggest that DH scholars should consider employing fine-tuned BERT for digitized-text-based
scholarly research, particularly when their research involves document classification.</p>
      <p>While our experiments yield significantly positive evidence for fine-tuned BERT embeddings’
resilience to OCR noise in the use-case of document classification, the impact of OCR noise on
BERT for other downstream tasks remain under-investigated. For example, it is possible that
BERT could react to OCR noise diferently at more fine-grained levels, such as sentence-level
tasks (e.g., next sentence prediction, sentence-based sentiment analysis, etc.) and word-level
(e.g., part-of-speech tagging, etc.). Therefore, future work focusing on BERT’s performance on
OCR’d texts both at diferent text granularities and for diferent downstream NLP tasks would
be useful to deepen our understanding of how OCR impacts this contextualized embedding
technology. Furthermore, since our corpora consist exclusively of English-language books from
the 18th and 19th centuries, expanding this study to curated datasets from other historical
periods, languages, and publication types would be a very worthwhile future exercise.
[13]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adhikari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          . “
          <article-title>Docbert: BERT for document classification”</article-title>
          . In: arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08398</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Antoniak</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          . “
          <article-title>Evaluating the stability of embedding-based word similarities”</article-title>
          .
          <source>In: Transactions of the Association for Computational Linguistics</source>
          <volume>6</volume>
          (
          <year>2018</year>
          ), pp.
          <fpage>107</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Bazzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Lorentz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Vargas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. P.</given-names>
            <surname>Moreira</surname>
          </string-name>
          .
          <article-title>“Assessing the Impact of OCR Errors in Information Retrieval”</article-title>
          .
          <source>In: Proceedings of European Conference on Information Retrieval</source>
          . Springer.
          <year>2020</year>
          , pp.
          <fpage>102</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ben. Language</given-names>
            <surname>Models</surname>
          </string-name>
          &amp;
          <article-title>Literary Clichés: Analyzing North Korean Poetry with BERT</article-title>
          .
          <year>2020</year>
          . url: https://digitalnk.com/blog/2020/10/01/language- models
          <string-name>
            <surname>-</surname>
          </string-name>
          literary
          <string-name>
            <surname>-</surname>
          </string-name>
          clichesanalyzing
          <article-title>-north-korean-poetry-with-bert/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chiron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doucet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coustaty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Visani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Moreux</surname>
          </string-name>
          . “
          <article-title>Impact of OCR errors on the use of digital libraries: Towards a better access to information”</article-title>
          .
          <source>In: Proceedings of 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          .
          <source>Ieee</source>
          .
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuba Gyllensten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gogoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ekgren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          . “
          <source>SenseCluster at SemEval2020 Task</source>
          <volume>1</volume>
          :
          <string-name>
            <given-names>Unsupervised</given-names>
            <surname>Lexical Semantic Change</surname>
          </string-name>
          <article-title>Detection”</article-title>
          .
          <source>In: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          . Barcelona (online):
          <source>International Committee for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>118</lpage>
          . url: https://aclanthology.org/
          <year>2020</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Esakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Lopresti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Sandberg</surname>
          </string-name>
          . “
          <article-title>Classification and distribution of optical character recognition errors”</article-title>
          .
          <source>In: Document Recognition</source>
          . Vol.
          <volume>2181</volume>
          .
          <source>International Society for Optics and Photonics</source>
          .
          <year>1994</year>
          , pp.
          <fpage>204</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fonteyn</surname>
          </string-name>
          . “
          <article-title>What about Grammar? Using BERT Embeddings to Explore FunctionalSemantic Shifts of Semi-Lexical and Grammatical Constructions”</article-title>
          .
          <source>In: Proceedings of the Workshop on Computational Humanities Research</source>
          .
          <year>2020</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. S. Gardner</given-names>
            <surname>Jr</surname>
          </string-name>
          . “
          <article-title>Exponential smoothing: The state of the art”</article-title>
          .
          <source>In: Journal of Forecasting 4.1</source>
          (
          <issue>1985</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Hill</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hengchen</surname>
          </string-name>
          . “
          <article-title>Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities 34.4</source>
          (
          <issue>2019</issue>
          ), pp.
          <fpage>825</fpage>
          -
          <lpage>843</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [15] [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          , G. Worthey,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Dubnicek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Capitanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kudeki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Downie</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>“The Gutenberg-HathiTrust parallel corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts”</article-title>
          .
          <source>In: iConference 2021 (Poster)</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Jockers</surname>
          </string-name>
          . Macroanalysis:
          <article-title>Digital methods and literary history</article-title>
          . University of Illinois Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kanjirangat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mitrovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Antonucci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          . “
          <article-title>SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces”</article-title>
          .
          <source>In: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          . Barcelona (online):
          <source>International Committee for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>214</fpage>
          -
          <lpage>221</lpage>
          . url: https: //aclanthology.org/
          <year>2020</year>
          .semeval-
          <volume>1</volume>
          .
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          . “
          <article-title>Adam: A method for stochastic optimization”</article-title>
          .
          <source>In: arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Labusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kulturbesitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Neudecker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Zellhöfer</surname>
          </string-name>
          . “
          <article-title>BERT for Named Entity Recognition in Contemporary and Historical German”</article-title>
          .
          <source>In: Proceedings of the 15th Conference on Natural Language Processing</source>
          .
          <year>2019</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          , G. Chrupała,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          , and D. Hupkes, eds.
          <source>Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence</source>
          , Italy: Association for Computational Linguistics,
          <year>2019</year>
          . url: https://aclanthology.org/ W19-4800.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Novak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pollak</surname>
          </string-name>
          . “
          <article-title>Leveraging contextual embeddings for detecting diachronic semantic shift”</article-title>
          . In: arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>01072</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [19]
          <string-name>
            <surname>J.-B. Michel</surname>
            ,
            <given-names>Y. K.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Aiden</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Veres</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Pickett</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hoiberg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Clancy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Norvig</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Orwant</surname>
          </string-name>
          , et al. “
          <article-title>Quantitative analysis of culture using millions of digitized books”</article-title>
          .
          <source>In: Science 331.6014</source>
          (
          <year>2011</year>
          ), pp.
          <fpage>176</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T. T. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.-V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coustaty</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. Doucet. “</surname>
          </string-name>
          <article-title>Neural machine translation with BERT for post-OCR error detection and correction”</article-title>
          .
          <source>In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries</source>
          .
          <year>2020</year>
          , pp.
          <fpage>333</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Padma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manavalan</surname>
          </string-name>
          . “
          <article-title>Performance analysis for classification in balanced and unbalanced data set”</article-title>
          .
          <source>In: Proceedings of the 6th International Conference on Industrial and Information Systems</source>
          . Ieee.
          <year>2011</year>
          , pp.
          <fpage>300</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Palachy</surname>
          </string-name>
          . Document Embedding Techniques.
          <year>2019</year>
          . url: https://towardsdatascience. com/document-embedding
          <source>-techniques-fed3e7a6a25d%5C#ecd3.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          , E. Clark, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          . “
          <article-title>Counterfactual Story Reasoning and Generation”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          .
          <source>Hong Kong</source>
          , China: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>5043</fpage>
          -
          <lpage>5053</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Schlechtweg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McGillivray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hengchen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dubossarsky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Tahmasebi</surname>
          </string-name>
          . “
          <article-title>SemEval2020 Task 1: Unsupervised Lexical Semantic Change Detection”</article-title>
          .
          <source>In: Proceedings of the Fourteenth Workshop on Semantic Evaluation</source>
          . Barcelona (online):
          <source>International Committee for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Sims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Park</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Bamman</surname>
          </string-name>
          . “
          <article-title>Literary Event Detection”</article-title>
          . In:
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          . Florence, Italy: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>3623</fpage>
          -
          <lpage>3634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [26]
          <string-name>
            <surname>D. van Strien</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Beelen</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          <string-name>
            <surname>Ardanuy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hosseini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>McGillivray</surname>
            , and
            <given-names>G. Colavizza.</given-names>
          </string-name>
          “
          <article-title>Assessing the Impact of OCR Quality on Downstream NLP Tasks</article-title>
          .”
          <source>In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence</source>
          ,
          <fpage>1</fpage>
          .
          <year>2020</year>
          , pp.
          <fpage>484</fpage>
          -
          <lpage>496</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [27] [28] [29]
          <string-name>
            <given-names>O.</given-names>
            <surname>Suissa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmalech</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhitomirsky-Gefet</surname>
          </string-name>
          .
          <article-title>“Text analysis using deep neural networks in digital humanities and information science”</article-title>
          .
          <source>In: Journal of the Association for Information Science and Technology</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorova</surname>
          </string-name>
          and
          <string-name>
            <surname>G. Colavizzaa.</surname>
          </string-name>
          “
          <article-title>Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition”</article-title>
          .
          <source>In: Proceedings of the Workshop on Computational Humanities Research</source>
          .
          <year>2020</year>
          , pp.
          <fpage>310</fpage>
          -
          <lpage>339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>M. C. Traub</surname>
            ,
            <given-names>J. Van Ossenbruggen</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Hardman</surname>
          </string-name>
          . “
          <article-title>Impact analysis of OCR quality on research tasks in digital archives”</article-title>
          .
          <source>In: Proceedings of International Conference on Theory and Practice of Digital Libraries</source>
          . Springer.
          <year>2015</year>
          , pp.
          <fpage>252</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Underwood</surname>
          </string-name>
          . Do humanists need BERT?
          <year>2019</year>
          . url: https : / / tedunderwood . com / category/methodology/genre-comparison/.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Worsham</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalita</surname>
          </string-name>
          . “
          <article-title>Genre Identification and the Compositional Efect of Genre in Literature”</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe</source>
          , New Mexico, USA,
          <year>2018</year>
          , pp.
          <fpage>1963</fpage>
          -
          <lpage>1973</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>I. Z.</given-names>
            <surname>Yalniz</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          . “
          <article-title>A fast alignment scheme for automatic OCR evaluation of books”</article-title>
          .
          <source>In: Proceedings of 2011 International Conference on Document Analysis and Recognition. Ieee</source>
          .
          <year>2011</year>
          , pp.
          <fpage>754</fpage>
          -
          <lpage>758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>B. Yu. “</surname>
          </string-name>
          <article-title>An evaluation of text classification methods for literary study”</article-title>
          .
          <source>In: Literary and Linguistic Computing 23.3</source>
          (
          <issue>2008</issue>
          ), pp.
          <fpage>327</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>