<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Page Embeddings: Extracting and Classifying Historical Documents with Generic Vector Representations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carsten Schnober</string-name>
          <email>c.schnober@esciencecenter.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renate Smit</string-name>
          <email>renate.smit@huygens.knaw.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manjusha Kuruppath</string-name>
          <email>manjusha.kuruppath@huygens.knaw.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kay Pepping</string-name>
          <email>kay.pepping@huygens.knaw.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leon van Wissen</string-name>
          <email>l.vanwissen@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lodewijk Petram</string-name>
          <email>lodewijk.petram@huygens.knaw.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Humanities, University of Amsterdam (UvA)</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Huygens Institute, Royal Netherlands Academy of Arts and Sciences (KNAW)</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Netherlands eScience Center</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>999</fpage>
      <lpage>1011</lpage>
      <abstract>
        <p>We propose a neural network architecture designed to generate region and page embeddings for boundary detection and classification of documents within a large and heterogeneous historical archive. Our approach is versatile and can be applied to other tasks and datasets. This method enhances the accessibility of historical archives and promotes a more inclusive utilization of historical materials. From its founding in 1602 until its demise at the end of the eighteenth century, the V O1Cengaged in long-distance trade between Asia and Europe. Additionally, within Asia, it competed with local shippers and merchants, and attempted to assert its influence over a vast region surrounding the Indian Ocean, centered around modern-day Indonesia. Today, the company is renowned for its modern organizational structure and notorious for its brutal conduct, including active engagement in the slave trade. The company's bureaucracy required detailed reports of all activities in Asia. As a result, hundreds of thousands of documents (as shown in Figure1) were drawn up in all the company's Asian outposts, copied in Batavia, bundled, and sent to the Netherlands, where they are now preserved in the National Archives.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Sequence Tagging</kwd>
        <kwd>Document Metadata Enhancement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Advances in HTR (Handwritten Text Recognition) technology have enabled the digitization of
handwritten texts that had previously only been readable by humans, often requiring a special
training. Since 2019, Transkribus9[] and Loghi [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ] have been used to automatically transcribe
the contents of the VOC archives [
        <xref ref-type="bibr" rid="ref18">22, 11</xref>
        ].
      </p>
      <p>
        In order to make the contents of the archive even more accessible, the task at hand is to
identify the boundaries between the diferent documents in the archival inventories and to
classify them. This poses challenges in defining what a document is, assessing the reusability
of traditional finding aids, such as the one created by the TANAP projec2t(Towards A New Age
of Partnership, 1999-2007) [
        <xref ref-type="bibr" rid="ref1 ref17">1, 21</xref>
        ], and creating a useful categorization for documents. These
tasks are hugely important because they promote a more inclusive use of archival materials.
Users no longer have to rely on existing indices, often created from the point of view of ruling
institutions, and in the case of the VOC archives, the colonizer. Our approach helps to make
certain kinds of documents, such as letters from local rulers which have never been indexed
individually, more findable.
      </p>
      <p>
        Both our source cod3eand the data [
        <xref ref-type="bibr" rid="ref18">22</xref>
        ] are publicly available.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Data Model and Embeddings</title>
      <p>We present a stacked embedding model for vectorizing digitized scans of historical documents,
as done e.g. for combining text and images1[0] or diferent models [23]. We use representations
of regions (region embeddings) as building blocks for vectorized page representations (page
embeddings).
2TANAP description (in Dutch):https://www.historici.nl/resource/tanap-towards-a-new-age-of-partnership/
3Source code:https://github.com/LAHTeR/document_segmentation/</p>
      <p>Region representations are generated from the output of the Loghi HTR syste m12][. They
can be derived from other systems, and can be generalized to any comparable workflow.</p>
      <sec id="sec-2-1">
        <title>Region Representations</title>
        <p>Our HTR system works on the level of scans, each comprising one (default) or two pages. A
scan is divided into regions, with the following information elements:
• Text: the text lines extracted from a region.
• Type: one of ten possible categories such as ‘paragraph’, ‘page-number’,
‘signaturemark’.</p>
        <p>• Coordinates: a list of two-dimensional coordinates that define the contour of a region.</p>
        <p>With our primary task of document boundary detection (Secti3o)nin mind, we performed
a manual analysis of 80 random documents to understand which features indicate document
boundaries. We identified a few cleartextual patterns that indicate document beginnings,
such as salutations, lists of attendees or addressees, or the explicit mention of the document
type in a page header. Document endings are often indicated by signatures, closing salutations
etc. (Figure2). In roughly a quarter of the investigated documents, no textual clues explicitly
signalling document boundaries could be identified.</p>
        <p>Visual clues were more dispersed across individual instances, e.g. the presence of page
numbers on a page or large initials (Figur5e,s6). None of those clues could be derived from a region
without its context. Therefore, we decided to use only thteext and the type features from the
list above, while skipping thecoordinates.</p>
        <p>
          The text is embedded through a language model. For the latter, we use a SentenceBERT
model [17] for Dutch4, based on RobBERT-2022 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In our initial task (Section3), we have
compared the SentenceBERT results to using GysBert13[] v25, a standard BERT model5[] for
historic Dutch. As described in5[], we use the special[CLS] token to represent the text of a
region. Ultimately, we concatenate the region type and the text embeddings to form a region
embedding.
4https://huggingface.co/NetherlandsForensicInstitute/robbert-2022-dutch-sentence-transformers
5GysBert-v2: https://huggingface.co/emanjavacas/GysBERT-v2
        </p>
        <p>The SentenceBERT model clearly outperforms the GysBertv2 approach (Ta1b)l,ewhile
requiring significantly less memory.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Page Representations</title>
        <p>Each region embeddings in a page is fed into a bi-directional LSTM laye7r, 2[0] and a linear
layer, which generates a vector representation of the entire page.</p>
        <p>The LSTM layer iterates over the region embeddings in the order specified by the HTR output.
There are, however, special layout arrangements including marginalia, columns, injections etc.
that make the choice of the reading order subject to interpretation and use case.</p>
        <p>The resulting page embeddings serve as input for document boundary identification and
document classification (Sections 3, 4).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Document Boundary Detection</title>
      <p>
        Related works tasks are highly data-specific and have been tailored towards modern business
documents [
        <xref ref-type="bibr" rid="ref13 ref6">15, 6</xref>
        ].
      </p>
      <p>In our context, adocument is defined as a sequence of  pages with a begin page, an end
page, and ≥ 0 pages in between (INSIDE). Document lengths vary between a single page and
800 pages.</p>
      <p>
        An inventory comprises between 155 and 2655 pages, with an average of 885. It also
contains pages that are not part of a document, for instance empty pages, covers, or tables of
contents. This fits the established IOB schema for sequence tagging (INSIDE-OUTSIDE-BEGIN )
[
        <xref ref-type="bibr" rid="ref14">16</xref>
        ]. From our annotations, we additionally have markers forEthNeD pages of each document.
      </p>
      <p>This page-based conceptualization fails to model more fine-grained cases in which a
document begins on the same page as another document ends, with up to three documents in our
annotations. Given that there is no objectively correct order of regions (see
Sect2io),nannotating on the region level would require a drastically increased annotation efort with multiple
annotators, which makes the generation of meaningful amounts of training data practically
impossible.</p>
      <sec id="sec-3-1">
        <title>Training Data</title>
        <p>As a primary dataset, we have manually annotated all pages of 16 randomly selected complete
inventories from the VOC archives for the purpose of training a machine learning sequence
tagger.</p>
        <p>
          From a user’s perspective, detecting the document boundaries is the most important part of
the task, as they segment an inventory into usable units, i.e. documents. As indicated in
Sections 1 and 2, the definition of a document is inherently ambiguous and use case-dependent.
While meaningful from an archival perspective, the documents defined in the context of the
TANAP project [
          <xref ref-type="bibr" rid="ref17">21</xref>
          ] turned out too coarse-grained for the purpose of historical research which
focusses on content rather than chronological or administrative document boundaries.
Therefore, the annotations made for this work follow a more fine-grained definition of documents.
        </p>
        <p>In order to augment our data withsaecondary and tertiary dataset, we have re-used two
large sets of annotations that were created for unrelated purposes. Instead of annotating
inventories, the annotators focussed on finding specific documents within inventories and marked
their respective boundaries. Because not all documents in an inventory were annotated, we
cannot make assumptions about un-annotated pages, hence there are nOoUTSIDE pages
available from these annotations. In order to approximate a realistic context, we have added blank
OUTSIDE pages around those documents.</p>
        <p>Furthermore, a part of these additional data (thsecondary dataset) has originally been
annotated for research about a specific category of documentsG(enerale Missiven). These
documents happen to be extraordinarily long, follow specific conventions, and cover specific topics.
Initial experiments have shown that adding those hundreds of non-representative documents
results in low accuracy for identifying other types of documents. To prevent that skew, we
have used random sub-samples of the secondary and tertiary datasets respectively equal to
the size of our primary dataset. The union of these three datasets result in our total training
dataset.</p>
        <p>In total, the annotated dataset we use for training and validation comprises 12,000 pages
from 48 inventories. Roughly 8,200 of them arIeNSIDE pages, 2,200 boundary pages, and 1,800
OUTSIDE pages.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Data Model</title>
        <p>The boundary pages include 1,000 plaiBnEGIN and END pages respectively, plus 200 that are
both: pages on which one or more documents end, and another one starts. In an initial
experiment, we trained a classifier that explicitly models each of these as separate classes. It achieved
a precision of only 0.06 on these pages. Qualitative analysis quickly revealed that these pages
were hardly distinguishable from others in terms of content and context, which explains the
catastrophic performance.</p>
        <p>We adapted our data model to merge problematic categories without compromising too
much on the usefulness. The result is a slight variation of the origiInOaBl format that identifies
documents by identifyingINSIDE, OUTSIDE and BOUNDARY pages; with the latter unifying
BEGIN and END pages. Conceptually, this workaround results in a schema that resembIlOeBs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Machine Learning Model</title>
        <p>
          We use the page embeddings introduced in Section2 as input to another bi-directional LSTM
layer [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and the page labels introduced above as objectives for optimizing the neural network
weights. The cross entropy loss1[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is weighted by inverse class frequencies to balance out
the skewed distribution of page classes in the dataset.
        </p>
        <p>The output of the LSTM is passed through a standard linear layer and a softmax laye3r] [to
determine the output class per page. Figur3e provides a schematic illustration of the model.
The final output additionally passes a set of simple heuristics to avoid impossible output
sequences such as INSIDE-OUTSIDE and OUTSIDE-INSIDE – there must always be aBOUNDARY
page to indicate a document beginning or ending. This heuristic approach is skipped during
training.</p>
        <p>In more complex sequence tagging tasks like Named Entity Recognition (NER),
state-of-theart models often combine an LSTM layer with an additional Conditional Random Field (CR8F]) [
to model transition probabilities, rendering illogical sequences improbable. Our task, however,
does not have diferent classes per page type, resulting in much fewer possible sequences so
that we can define constraints heuristically instead of applying a CRF.</p>
        <p>We have used region and page embedding sizes of 128 and 64 parameters respectively. Both
LSTM bi-directional layers use 64 parameters as well. We iteratively increased all these
configurations up to 512 parameters per layer. Those changes did not lead to changes in the results,
so we use the smallest network architecture for minimizing resource consumption. All results
are thus based on the 128/64 embedding and layer sizes.</p>
        <p>The lion’s share of the training time is consumed during the inference of the text embeddings.
Since we do not adapt the language model weights, we can cache the text embeddings during
the first training iteration which enables us running many training epochs within seconds. The
model performance converged after 9 to 10 epochs – roughly 5 minutes on a consumer laptop
–, so we stopped the training after 50 epochs.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Results</title>
        <p>We have evaluated our model by randomly sampling 80% of the three datasets for training, and
20% for validation respectively. Tabl1eshows the total results per page type and per dataset.</p>
        <p>The division illustrates a clear diference in performance per sub-dataset: while detecting
boundaries for theGenerale Missiven dataset is very accurate, the other datasets contain less
homogenous document types and consequently yield significantly lower results. This is
conifrmed by initial experiments in which we trained a model only on thGeenerale Missiven dataset,
which achieved precision and recall scores close1t.o0 for all page types.</p>
        <p>Qualitative analyses indicate, not surprisingly, that the data samples for which the model
performs best use more standardized language, such as formulaic document beginnings and
endings.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Document Classification</title>
      <p>As another use case, we use document classification which is the task of assigning a label to a
document, again defined a sequence of ≥ 1 pages.</p>
      <p>
        The TANAP project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] developed a categorization schema comprising 14 document main
classes, each divided into 2 to 23 sub-classes, resulting in a total of 164 classes. These categories
mirrored the VOC’s focus on administrative aspects, e.g. distinguishing between letters sent
to the Netherlands or within Asia. However, the large number of fine-grained sub-categories
often led to overlapping and ambiguous categorizations, which imposes additional difÏculties
for both human annotators and a machine learning system. Therefore, we developed another
categorization system with 27 classes that define the document type, roughly oriented on the
TANAP main classes. For instance:
• Resolution (Dutch:Resolutie)
• Letter (Dutch: Brief )
• Minutes (Dutch: Notulen)
• ...
      </p>
      <p>On top of these, we introduce the speciaFlront Matter as a 28th document type to mark pages
that contain text, but are not part of a document, for instance tables of content.</p>
      <p>In the document classification task, the page embeddings introduced in Section2 serve as
input for a neural network with a slightly diferent architecture than the one described in
Section 3. Instead of an entire inventory, the input to the bi-directional LSTM layer is now a
sub-set of pages that represent a document. The output of the LSTM is passed through a linear
layer and a softmax layer to generate a single label for the input.
We use the same datasets as in Section3. Again, we have manually annotated all the documents
in the primary dataset with the respective document types. For thesecondary dataset the
document types have been pre-selected asG( enerale Missiven). For the tertiary dataset, we
had TANAP document categories available, which we mapped to our document type
categorization.</p>
      <p>However, the distribution of classes in the dataset dataset remains extremely skewed. Out
of the 711 documents that we have sampled for the training data – 6,000 pages in total –, 261
are of the specialFront Matter type. Among the remaining documents, there are 183 letters, 86
registers, and 41 lists, but only one of typIenvoice and Memorandum respectively. Some other
document types are not present at all. In order to get a dataset that is useful for representative
experiments, we need to put significant additional efort into annotations, focussing on the
underrepresented categories and/or find a trade-of when refining our data model so that it
remains meaningful, but makes the dataset machine-learnable.</p>
      <p>At this point, we cannot draw empirical conclusions due to an incomplete and skewed
dataset, but we take the results shown in Table2 as an indication that our page embeddings
can be used for document classification and other tasks.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Future Work</title>
      <p>We present a deep learning approach for extracting documents from a typical historical HTR’d
dataset and applied it for the specific, relevant task of document boundary detection. The
method is generalizable to outputs from other HTR systems, as well as more broadly to any
other related text representations, and for other tasks.</p>
      <p>A qualitative analysis of the results on a larger dataset is pending to give practical meaning to
the empirically measured precision and recall scores. Due to the ambiguous nature of document
boundary definitions, outputs that are not identical with our human annotation could either
be incorrect or represent an alternative correct interpretation.</p>
      <p>In order to perform a full evaluation, we have set up an evaluation sheet in which multiple
human annotators can evaluate their correctness. While empirical results are still lacking, the
transparent access to individual results have led to important insights about capabilities and
constraints of quantitative approaches. Figur4e exemplifies how we display the results per
page as logged in Weights &amp; Biases 2[].</p>
      <p>The unified architecture for creating region and page embeddings opens the door to a variety
of tasks, in which these embeddings form the vectorized input and can be fine-tuned per task.
As shown, task-specific page embeddings can be used for page sequence tagging (Section 3)
and page sequence classification (Section 4).</p>
      <p>
        Other future applications include text quality estimation for targeted post-processing.
Previous approaches rely on a mix of human-crafted rules, language-specific dictionaries, and basic
machine learning 1[
        <xref ref-type="bibr" rid="ref16 ref8">8, 19</xref>
        ]. Page embeddings might make those language-specific rules and
resources unnecessary.
      </p>
      <p>Furthermore, our design using stacked neural network layers allows for increasing the
number of embedding levels to individual regions lines or even words. In our context, this becomes
relevant when segmenting page regions instead of entire pages.</p>
      <sec id="sec-5-1">
        <title>Work in Progress</title>
        <p>We want to re-iterate that many aspects of this work are work in progress. Given the
practical tasks at hand, however, they always will be so to some extent because task definitions,
requirements, and the corresponding data models depend on availability and distribution of
data, specific use cases, and interpretation.</p>
        <p>We see these dynamics as a given in settings in which computational methods are developed
and applied for humanities research that inherently contains a degree of interpretation. We find
it important to publish the methodology and implementation along with preliminary results
in order to provide a starting point for researchers that have similar, but diferent challenges.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>LAHTeR is a project of the Netherlands eScience Center, funded under grant number
NLESC.SSIH.2022a.SSIH014.</p>
      <p>GLOBALISE is a project of the Huygens Institute with the International Institute of Social
History, the Digital Infrastructure Department of the KNAW Humanities Cluster, VU
University, the University of Amsterdam, and the Dutch National Archives. It is funded by the Dutch
[11] L. Keijser. 6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and
notarial deeds. 2020. doi: 10.5281/zenodo.6414086. url: https://zenodo.org/record/64140
86.
[23] U. Yaseen and S. Langer.Neural Text Classification and Stacked Heterogeneous Embeddings
for Named Entity Recognition in SMM4H 2021. 2021. doi: 10.48550/arXiv.2106.05823. url:
http://arxiv.org/abs/2106.0582 3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Balk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. V.</given-names>
            <surname>Dijk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kortlang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gaastra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Niemeijer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Koenders</surname>
          </string-name>
          . “
          <article-title>The Archives of the Dutch East India Company (VOC) and the Local Institutions in Batavia (Jakarta)”</article-title>
          . In:
          <article-title>The Archives of the Dutch East India Company (VOC) and the Local Institutions in Batavia (Jakarta)</article-title>
          .
          <source>Brill</source>
          ,
          <year>2007</year>
          . urlh:ttps://brill.com/display/title/147.21
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Biewald</surname>
          </string-name>
          .
          <source>Experiment Tracking with Weights and Biases</source>
          .
          <year>2020</year>
          . url: https://www.wand b.
          <source>com/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Bishop</surname>
          </string-name>
          .
          <article-title>Pattern recognition and machine learning</article-title>
          .
          <source>Information science and statistics</source>
          . New York: Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Delobelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Winters</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Berendt</surname>
          </string-name>
          . “
          <article-title>RobBERT: a Dutch RoBERTa-based Language Model”</article-title>
          . In:
          <article-title>Findings of the Association for Computational Linguistics: EMNLP 2020</article-title>
          . Ed. by
          <string-name>
            <given-names>T.</given-names>
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          . Online: Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>3255</fpage>
          -
          <lpage>3265</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>292</volume>
          . url: https://aclanthology.org /
          <year>2020</year>
          .findings-emnlp.
          <volume>292</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding”. IPnr:oceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</article-title>
          . Minneapolis, Minnesota: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
          <year>do1i</year>
          :
          <fpage>0</fpage>
          .18653/v1/
          <fpage>N19</fpage>
          -1423. url: https://aclanthology.org/N19-142.
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alahmadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Samanta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Alahmadi</surname>
          </string-name>
          .
          <article-title>“A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain”</article-title>
          .
          <source>In: IEEE Access</source>
          <volume>10</volume>
          (
          <year>2022</year>
          ), pp.
          <fpage>11341</fpage>
          -
          <lpage>11353</lpage>
          . doi:
          <volume>10</volume>
          .1109/access.
          <year>2022</year>
          .
          <volume>3144185</volume>
          . url: https://i eeexplore.ieee.org/document/9684474./
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>“Long Short-Term Memory”</article-title>
          .
          <source>INne:ural Computation 9.8</source>
          (
          <issue>1997</issue>
          ), pp.
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          . doi:
          <volume>10</volume>
          .1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735. url: https://doi.org/10.1162 /neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.1735.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Bidirectional LSTM-CRF Models for Sequence Tagging</article-title>
          .
          <year>2015</year>
          . url: http://arxiv.org/abs/1508.
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Colutto</surname>
          </string-name>
          , G. Hackl, and
          <string-name>
            <surname>G. Mühlberger.</surname>
          </string-name>
          “
          <article-title>Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents”</article-title>
          .
          <source>2I0n1:7 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</source>
          . Vol.
          <volume>04</volume>
          .
          <year>2017</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .1109/icdar.
          <year>2017</year>
          .
          <volume>307</volume>
          . url: https://ieeexplore.ieee.org/abstract/docume nt/8270253.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Katiyar</surname>
          </string-name>
          and
          <string-name>
            <surname>S. K.</surname>
          </string-name>
          <article-title>BorgohainI.mage Captioning using Deep Stacked LSTMs, Contextual Word Embeddings</article-title>
          and
          <string-name>
            <given-names>Data</given-names>
            <surname>Augmentation</surname>
          </string-name>
          .
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2102.11237. url: http://arxiv.org/abs/2102.11237.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <surname>R. van Koert</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Klut</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Koornstra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maas</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
          </string-name>
          . “
          <article-title>Loghi: An End-to-End Framework for Making Historical Documents Machine-Readable”</article-title>
          .
          <source>IDno:cument Analysis and Recognition - ICDAR</source>
          <year>2024</year>
          Workshops. Ed. by
          <string-name>
            <given-names>H.</given-names>
            <surname>Mouchère</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhu</surname>
          </string-name>
          . Cham: Springer Nature Switzerland,
          <year>2024</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>88</lpage>
          .
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .1007/978-3-
          <fpage>031</fpage>
          -70645-5\_6.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E. Manjavacas</given-names>
            <surname>Arevalo</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fonteyn</surname>
          </string-name>
          . “
          <article-title>Non-Parametric Word Sense Disambiguation for Historical Languages”</article-title>
          .
          <source>InP:roceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities. Taipei</source>
          , Taiwan: Association for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>134</lpage>
          . url:https://aclanthology.org/
          <year>2022</year>
          .nlp4dh-
          <fpage>1</fpage>
          .1.6 [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong.</surname>
          </string-name>
          Cross-Entropy
          <source>Loss Functions: Theoretical Analysis and Applications</source>
          .
          <year>2023</year>
          . url: http://arxiv.org/abs/2304.0728 8.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mungmeeprued</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lipani</surname>
          </string-name>
          . “
          <article-title>Tab this folder of documents: page stream segmentation of business documents”</article-title>
          .
          <source>In:Proceedings of the 22nd ACM Symposium on Document Engineering</source>
          . San Jose California: Acm,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .1145/3 558100.3563852. url: https://dl.acm.org/doi/10.1145/3558100.356385.2
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ramshaw</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          .
          <article-title>Text Chunking using Transformation-Based Learning</article-title>
          .
          <year>1995</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.cmp-lg/950504 0. url: http://arxiv.org/abs/cmp-lg/950504.0 [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Gurevych.</surname>
          </string-name>
          “
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”</article-title>
          .
          <source>In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          .
          <source>Hong Kong</source>
          , China: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>3980</fpage>
          -
          <lpage>3990</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1410. url: https://www.aclweb.org/ant hology/D19-1410.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schneider</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maurer</surname>
          </string-name>
          . “
          <string-name>
            <surname>Rerunning</surname>
            <given-names>OCR:</given-names>
          </string-name>
          <article-title>A Machine Learning Approach to Quality Assessment and Enhancement Prediction”</article-title>
          .
          <source>In:Journal of Data Mining &amp; Digital Humanities</source>
          <year>2022</year>
          .
          <article-title>Digital humanities in languages (</article-title>
          <year>2022</year>
          ).
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .46298/jdmdh.8561. url: https: //jdmdh.episciences.org/10239.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schnober</surname>
          </string-name>
          .
          <source>text_quality. Version 0.3.1</source>
          .
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.8189892. url: https: //www.github.com/laHTeR/htr-quality-classifi.er [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuster</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Paliwal</surname>
          </string-name>
          . “
          <source>Bidirectional recurrent neural networksS”.igIn:al Processing</source>
          ,
          <source>IEEE Transactions on 45</source>
          (
          <year>1997</year>
          ), pp.
          <fpage>2673</fpage>
          -
          <lpage>2681</lpage>
          . doi:
          <volume>10</volume>
          .1109/78.650093.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Smit</surname>
          </string-name>
          .
          <article-title>Reusing traditional finding aids for the GLOBALISE infrastructure</article-title>
          .
          <year>2024</year>
          . url: htt ps://globalise.huygens.knaw.
          <article-title>nl/from-abc-to-voc-volume-utilizing-traditional-findingaids-for-the-globalise-infrastructur</article-title>
          .e/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[22] VOC transcriptions v2 - GLOBALISE. Version V1</source>
          .
          <year>2024</year>
          . doi:10622/lvxsbw. url: https://h dl.handle.net/10622/LVXSBW.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>