=Paper=
{{Paper
|id=Vol-2723/long20
|storemode=property
|title=From Historical Newspapers to Machine-Readable Data: The
Origami OCR Pipeline
|pdfUrl=https://ceur-ws.org/Vol-2723/long20.pdf
|volume=Vol-2723
|authors=Bernhard Liebl,Manuel Burghardt
|dblpUrl=https://dblp.org/rec/conf/chr/LieblB20
}}
==From Historical Newspapers to Machine-Readable Data: The
Origami OCR Pipeline==
From Historical Newspapers to Machine-Readable Data: The Origami OCR Pipeline Bernhard Liebla , Manuel Burghardta a Computational Humanities Group, Leipzig University, Leipzig, Germany Abstract While historical newspapers recently have gained a lot of attention in the digital humanities, trans- forming them into machine-readable data by means of OCR poses some major challenges. In order to address these challenges, we have developed an end-to-end OCR pipeline named Origami. This pipeline is part of a current project on the digitization and quantitative analysis of the German newspaper “Berliner Börsen-Zeitung” (BBZ), from 1872 to 1931. The Origami pipeline reuses ex- isting open source OCR components and on top offers a new configurable architecture for layout detection, a simple table recognition, a two-stage X-Y cut for reading order detection, and a new robust implementation for document dewarping. In this paper we describe the different stages of the workflow and discuss how they meet the above-mentioned challenges posed by historical newspapers. Keywords end-to-end OCR, historical newspapers, layout detection, deep neural networks 1. Introduction In recent decades a large number of newspapers have been digitized1 , providing access to a unique collection of historiographical data. This opens up a number of opportunities for historical research, but also entails many challenges2 . A major technical challenge lies in the conversion of scanned newspapers into machine-readable data that can be processed and analyzed in a quantitative way by means of text mining techniques. This conversion is typically referred to as optical character recognition (OCR). However, OCR is a complex topic that involves a number of different processing steps3 . These steps are particularly challenging for the case of historical newspapers, as they oftentimes have rather low paper quality, use various historical fonts or require the recognition of complex page layouts in order to separate text from images and tables. In this paper we present the Origami OCR pipeline4 , which was developed as part of a current project that aims at processing the Berliner Börsen-Zeitung (BBZ), a German newspaper with CHR 2020: Workshop on Computational Humanities Research, November 18–20, 2020, Amsterdam, The Netherlands £ liebl@informatik.uni-leipzig.de (B. Liebl) DZ 0000-0003-1354-9089 (M. Burghardt) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 Google newspapers (https://news.google.com/newspapers); Europeana collection of newspapers (https: //www.europeana.eu/de/collections/topic/18-newspapers); ZEFYS Zeitungsinformationssystem der Staatsbib- liothek zu Berlin (http://zefys.staatsbibliothek-berlin.de) 2 The opportunities and challenges of digitized newspapers were recently discussed in a workshop titled ”Digitised newspapers - a new Eldorado for historians ?” (https://impresso.github.io/eldorado/). 3 For an overview, check out the workflows of the OCR-D project [3, 33, 34] or the OCR4all project [40]. 4 Our source code is publicly available on GitHub (https://github.com/poke1024/origami). As this is still partially a work in progress, we appreciate any feedback on Origami. 351 a focus on finance and economics, for the period 1872-1931. Origami tries to take into account the specific challenges for digitizing historical newspapers such as the BBZ. At its core, Origami is a new end-to-end (i.e. from document scan to PAGE XML [37]) pipeline that covers all essential aspects of a text digitization pipeline, such as (1) tools for generating and annotating ground truth, (2) dewarping, (3) segmentation (using state-of-the-art deep neural networks), (4) layout detection (through a configurable architecture offering heuristic layout rules), (5) separator-aware reading order detection, (6) line-level OCR, and (7) exporting to PAGE XML [37]. Origami uses existing technologies (e.g. Tesseract [45] for baseline detection and Calamari [52] for line-level OCR), but also contains some new modules previously not available as working implementations. Apart from OCR-D and the closed source products ABBYY Finereader5 and Transkribus6 , we know of no other framework or pipeline currently available that implements all necessary steps to perform end-to-end OCR for historical newspapers. Origami is easy to setup and built for processing hundreds of thousands of documents. Employing a Calamari-based OCR with a custom model and without using post-correction, Origami achieves overall7 word error rates (WER) of around 2% on typical BBZ pages with complex layouts and both Antiqua and blackletter typefaces. Judging from publicly available data, this seems better than the performance obtainable with the best commercial systems currently available for newspapers in this time period [24, 9]. 2. End-to-end OCR Workflows for Historical Newspapers: State of the Art A review of related research shows that the digitization of historical newspapers is carried out in quite different ways and with quite different objectives [54, 51, 25]. As Wulfman et al. summarize, in practice there are often opposing goals ”of emphasizing the quantity or the quality of digital editions” [54]. Along these lines, Smith & Cordell note that ”[t]oo often, discussions of OCR quality in digital humanities (and adjacent fields) begin and end with frustration over its imperfections” [44]. Consequently, one option is to accept these imperfections from the outset and try to remedy them later, by means of post-processing (for an example see [55]). However, post-processing is not a viable option when the base quality of the OCR is too low. Naturally, this raises the question of what ”too low” actually means, and the answer depends strongly on the specific application scenario [50, 14, 49, 21, 48]. The moving target of ”good quality” is also strongly influenced by the available technology at a given time. In 2009, Tanner et al. reported that any text printed ”pre-1900 will be fortunate to exceed 85% accuracy (15 in 100 characters wrong)” [48]. Ten years later, the best solutions for such documents achieve a character accuracy of above 99% [9]. The best way to digitize a corpus is associated with even more difficult decisions about software packages and processing workflows [40, 34]. The task of building a high-quality, end- to-end OCR pipeline that produces good results by leveraging recent advances in technology is quite complex. Although there are literally hundreds of OCR tools8 , the number of working, open source tools for building the core parts of high-quality OCR pipelines boils down to prob- ably less than ten established modules. Unfortunately, all of those have their own limitations, 5 https://www.abbyy.com 6 https://transkribus.eu 7 i.e. taking into account correct segmentation and correct reading order 8 https://www.digitisation.eu/tools-resources/tools-for-text-digitisation/. 352 such as dependency on document layouts (e.g. no tables, only columns, no columns), issues with image resolution or unsatisfying quality of pretrained font models [34, 40]. Further issues arise when trying to combine these single components to create an end-to- end OCR pipeline. To date, the OCR-D project [3, 33] seems to be the de facto standard when it comes to providing a coherent end-to-end OCR workflow. OCR-D has been gathering, combining and improving the best non-commercial OCR software components from the last decades and integrating them into the OCR-D framework since 2015. Nevertheless, we could not readily use this framework for our historical newspapers. On the one hand, this was due to the limitations of single components (see above). On the other hand, there were problems because of the lack of flexibility to adapt individual components in the workflow or to add new ones. Some practical issues we encountered when trying to use the OCR-D workflow for our historical newspaper include the following9 : • Pretrained font models. Although Calamari and OCR-D Calamari offer good OCR models for reading 19th-century Fraktur fonts, no pretrained models exist that recognize both Antiqua and German Fraktur10 . Additional stages that filter regions by font type need to be employed. • Dewarping. OCR-D offers the only modern (DNN-based) dewarping implementation at the page level (i.e. dewarping a whole page vs. dewarping single smaller regions). Un- fortunately, this implementation currently seems to break down with higher resolutions (see Section 3.3). • Layout detection. OCR-D offers a curated set of segmentation and layout algorithms, including what seems to be the only public and currently maintained layout detection from ICDAR’s 2013 ”Competition on Historical Newspaper Layout Analysis” [2]. How- ever, this solution does not seem to be based on the current technical state of the art for image segmentation, namely DNNs [28, 3]. The same is true for other layout detection modules currently offered by OCR-D (i.e. those based on Tesseract and CIS OCR-D11 ). In general, document types that are not simple multi-column layouts currently pose a huge issue for many modules. For example, as described in its documentation, the CIS OCR-D segmentation module will usually fail if there are any tables in the document. On the other hand, while there have been recent advances in DNN-based table detection such as PubTabNet [56] and TableBank [27], publicly available implementations of these approaches are scarce. • Monolithic modules. While the OCR-D workflow as a whole is highly configurable and modular, single OCR-D modules are, for historical reasons, rather monolithic and do not allow for an easy extension, for instance by means of an API. For the case of the BBZ this proved problematic, as – just as one example – we wanted to extend the X-Y cut functionality to detect separator information, which is valuable for determining reading 9 We would like to emphasize that the following problems could only be observed for the context of the digitization of the BBZ. The applicability of OCR-D for many other scenarios - especially the digitization of the Union Catalogue of Books of the 16th–18th century (VD16, VD17, VD18), for which OCR-D was originally designed, is beyond question. 10 Also, the models are only trained for binarized input, which is a strategy we find problematic, as our results in Section 3.6 demonstrate. 11 https://github.com/cisocrgroup/ocrd_cis 353 orders. However, X-Y cut implementations available through OCR-D are hard-coded and do not allow for custom scoring functions (as, for example, in [31]). Besides these practical issues with the generic OCR-D framework, we also encountered limi- tations with specific workflows for the digitization of historical newspapers as they are reported by several projects and institutions: Jerele et al., who digitized historical Slovenian newspa- pers with ABBYY FineReader12 , report issues with segmentation as being the ”second major error inducer” (the first being ”bad condition of newspaper originals”) [22]. The challenge of segmentation is also underlined by Dannélls et al. [14], who used a combination of ABBYY FineReader and Tesseract to digitize the Swedish newspaper corpus KubHist. The BBZ corpus we work with was also originally processed using ABBYY FineReader 11. Probably also as a result of segmentation issues, we found typical word error rates above 10%. A very recent (May 2020) project for digitizing historical German-Brazilian Newspapers (GBN)13 is based on a modified OCR-D workflow and is similar to our project in various as- pects. For example, GBN seems to use similar region merge operations as Origami’s layout stage (see Section 3.4). However, the overall scope of the project is quite different. Although GBN offers segmentation of documents into regions and lines using a DNN14 , GBN’s segmen- tation seems to generate purely binary labels (i.e. only differentiates text from background), and - in contrast to Origami - does not know about other content such as tables, illustrations or separators. GBN is thus neither aware of separators, nor does it attempt to detect tables or reading order. Furthermore, GBN offers deskewing, but no dewarping. Most recently, the Impresso project15 , a large project working on making 200 years of histor- ical newspapers digitally available through new, innovative interfaces, used Transkribus HTR for their OCR stage [46], and – as far as we understand it – for the whole OCR workflow (i.e. ground truth annotation, page segmentation and so on). While the results from Impresso are very promising [46], we decided against using Transkribus for our project for two reasons. First, neither the exact software nor the performance of Transkribus is very well documented [40]. Reul et al. note: ”to the best of our knowledge the exact state of the software actually incorporated in Transkribus is not publicly known” [40]. Transkribus therefore seems like a black box, which contradicts our goal of open sourcing as many components of our work in this project as possible (e.g. the final line-level OCR models). Second, the closed source approach (Transkribus is run as a ”fee-based cooperative” [40]) usually prevents projects from ”running the advanced recognition tools on their own hardware” [40]. Digitizing a large scanned corpus (a total of 642,480 pages in our case) on third-party services did not seem an appealing option for our project. All in all, current end-to-end OCR workflows are not appropriate when it comes to the digitization of historical newspapers. Specific approaches for historical newspapers also have some severe limitations that are mostly connected to poor segmentation of the complex page layouts. To address these existing challenges and limitations, we present Origami as an open source OCR pipeline that is optimized for the digitization of historical newspapers. Whenever possible, we use working open source solutions (e.g. line-level OCR, baseline detection) and 12 In addition, Aletheia [12] was used for ground truth production. 13 https://github.com/sulzbals/gbn 14 The latter is achieved through a repurposing and extension of the DNN-based SBB textline detector from Q4/2019 (available at https://github.com/qurator- spk/sbb_textline_detection) and therefore partially overlaps with our BBZ-specific DNN work from this period [28]. 15 https://impresso-project.ch/ 354 experiment with new approaches where we found the results of current open source implemen- tations not adequate for our task16 . 3. The Origami OCR Pipeline Our OCR pipeline currently consists of nine subsequent stages that are shown in Table 1. Each stage runs as a batch and saves its data independently as files. The first two stages divide the page into different regions (e.g. text vs. tables). Stages 3 and 4 then produce a deskewed and dewarped page, where text lines and borders are set straight. Stage 5 refines the regions found in stages 1 and 2 using heuristic rules. Stages 6, 7 and 8 detect lines, reading order, and perform the OCR for each line. Stage 9 produces a final output file that contains all detected information. In the following sections we describe all stages in more detail. At first glance, this workflow seems similar to established OCR workflows like used in OCR- D. There is one important difference though. Most OCR workflows operate on sets of images. For example, as shown in Figure 2, in OCR-D workflows, page images are usually binarized and cropped before being split into smaller region and line images. The final OCR then uses these line images. One big advantage of this approach is that batch interfaces and data formats are easy to understand, as they are modularized and clean: there are only images in each step. A considerable disadvantage of this approach, however, is that context information is lost. For example, a batch operating on a line image will have no way of knowing where this line originally came from or what its relation to other lines or regions on the page was. Yet this information can be useful in further steps like layout detection. One of Origami’s main contributions is experimenting with a different, more complex ap- proach in terms of data management. Instead of producing new images with each step, Origami batches write knowledge gained about the page into a set of custom, but well-defined file for- mats17 . Figure 1 shows the whole pipeline: solid black boxes represent the batches from Table 1, framed non-solid boxes are data files (called artifacts in Origami), and arrows indicate read- ing and writing of artifacts. For example, the segment stage writes a zip file that contains png files with the pixelwise page segmentations. Similarly, the contours stage writes a zip file that contains descriptions of polygonal regions. Subsequent stages read these files and produce new data based on them. The original page image remains untouched during the entire process, i.e. it is not split or subdivided into new image files. In mathematical terms, we store parameters of a function composition, but not the intermediary results themselves. 3.1. Segment (stage 1) Stage 1 uses a custom-trained deep neural network (DNN) to determine different types of re- gions and separators on a pixel by pixel basis. For example, a region predictor classifies regions using the labels TEXT (i.e. text), TABULAR (tables and tabular multi-column structures inside body text), ILLUSTRATION (borders, images and other illustrations) and BACK- GROUND (background including dirt and frame borders). Similarly, a separator predictor differentiates H (horizontal), V (vertical), and T (table column) separators. 16 These include dewarping, layout and reading order detection, binarization, and available Fraktur models. 17 Details of the file formats are documented at https://github.com/poke1024/origami/blob/master/docs/f ormats.md. 355 Table 1 Processing stages of the Origami pipeline. stage name description 1 segment separates text, tables, and illustrations (pixel by pixel) 2 contours finds and simplifies polygonal forms for regions and separators 3 flow estimates skew and warp at various points of the page 4 dewarp computes a global dewarping transformation 5 layout merges and splits regions using heuristic rules 6 lines detects baselines for all regions 7 order determines reading order using X-Y cuts 8 ocr runs OCR on line level using Calamari 9 compose combines all data into one final document The specific setup we use and describe here follows the optimal configuration found in a previous evaluation study [28]. In contrast to recent hybrid approaches like Barman et al. [5] - who combine a segmentation DNN with separate OCR information to produce segmentation data - we use a single, but complex DNN that is trained to detect all relevant classes (we expect this network to learn some OCR features as well). In contrast to dhSegment [35] - which uses a variant of ResNet-50 - we use the larger Inception-ResNet-v2 [6] for region segmentation (trained with categorical cross-entropy loss) [28]. For separator detection, we use EfficientNet-B2 (trained with generalized dice loss) [28]. Both our models employ Feature Pyramid Networks (FPNs) as a segmentation architecture. Before passing data to the DNN, we rescale pages to a total resolution of 1280 × 2400 pixels, then run each network on three overlapping vertical tiles (each having a resolution of 1280×896 pixels). These resolutions are rather high compared to standard dhSegment workflows [35]. We showed that this configuration reaches a pixel accuracy of 98.86% for regions and 99.79% for separators [28]. For each task, we use confidence voting over models trained from five folds. Figure 3 shows the output of the neural networks for a typical page from the BBZ. Among other classes, our DNN differentiates between plain vertical separators that occur between regions of body text (V ), and vertical lines that delimit columns in tables (T). 3.2. Contours (stage 2) In this stage, we convert the pixel-based segmentation data from the previous step into polygonal forms and simplify it (this process is sometimes also referred to as vectorization). We do this for two reasons: (1) to increase the performance of subsequent pipeline stages, since processing simplified vector shapes is often much faster than looking at a high number of pixels. (2) To be able to use established geometric algorithms in later stages of the pipeline (e.g. Voronoi diagrams), many of which are usually formulated and implemented in terms of polygonal data. Having found the contours, this step also employs a simple heuristic to remove textual noise. We remove all contours that are narrower than a pre-defined threshold and lie on the left or right border. The determination of region contours is a standard problem of computer vision [47]. We use OpenCV [7] for this task. Extracting exact inner polylines from thick separator pixel groups on the other hand poses a more difficult problem and good algorithms are hard to find. Though this might seem a marginal problem at first, the same challenge occurs when extracting 356 page image 1 segment pixelwise segmentation zip 2 contours region contours zip on warped image 3 flow warp estimation zip line contours zip 4 dewarp on warped image region contours dewarping grid zip zip after dewarping 5 layout region contours table separators json zip after aggregation 6 lines line contours region contours zip zip after refinement after refinement 8 ocr 7 order ocr-ed line text zip reading order json 9 compose TXT and PAGE XML zip Figure 1: Origami’s internal data flow. Filled black boxes are stages triggered by user. baselines from pixel-based DNN output. Implementations for the latter task can get complex, as for example the implementation of the OCR-D baseline detector18 shows. Note also that we do not want to simply estimate the separators as straight lines by using principal component analysis (PCA) or Hough transforms [7], since separator lines might be curved back and forth 18 https://github.com/qurator-spk/sbb_textline_detection 357 page image page binarized page image binarization page cropped page image cropping page dewarped page image dewarping region region images segmentation region deskewed region images deskewing line line images segmentation line ocr line texts Figure 2: Typical OCR-D workflow (simplified). and we will use that curvature later for warping estimation. One approach that seems to work very well, is to find the contours and then compute a thinning transform like the medial axis or the straight skeleton [1]. The latter are well understood and common algorithms of computational geometry. Unfortunately, computing skeletons in traditional geometry libraries such as CGAL [10] is magnitudes too slow for our use case with highly detailed polygons19 . We therefore implemented a much faster discrete (i.e. image-based) skeleton based on the algorithm in [15]. We extend this algorithm to extract detailed paths between detected nodes. Based on the known separator orientation, we then convert the extracted network structure to a directed acyclic graph in order to be able to efficiently find the longest path in terms of the euclidean distance [42]. Our implementation is rather simple (about 200 lines of code), fast (we use Numba [26] for just-in-time compilation) and robust. With this modification, our contours step runs in less than one second for a typical page. Combining this polyline extraction with a pixelwise baseline detection DNN in a two-stage process would yield a design similar to Grüning et al. [19] or the SBB textline detector21 . This approach could replace Tesseract as baseline extraction technology in stages 3 and 6. 3.3. Flow and dewarp (stages 3 & 4) This section covers both stages 3 and 4 since they are conceptually interwoven. The reason we split them in two in our architecture, is that we want to emphasize that our design does not assume them as one monolithic stage, which is a common practice in many other systems. For example, it would be rather easy to replace stage 3 with a different implementation (e.g. a DNN-based warping detection) in the future. There are numerous approaches to dewarping (for an overview, see Chapter 8 in [8] and 19 We ported CGAL’s skeleton functionality to scikit-geometry20 for this project. 21 https://github.com/qurator-spk/sbb_textline_detection 358 (a) Input document: Scan of the (b) Predictions for region la- (c) Horizontal separators H BBZ from January 27, 1872, Mor- bels TEXT (blue), TABU- (red), vertical text separa- genausgabe, front page. LAR (orange) and ILLUS- tors V (green) and table col- TRATION (green). umn separators T (blue). Figure 3: A sample page from the Berliner Börsen-Zeitung (a) shown with the regions (b) and separator lines (c) automatically inferred from the scanned image by our DNN in stage 1. [17]), but in practice, very few implementations are actually available. A number of recent approaches use deep neural networks (DNNs). For example, OCR-D’s official page dewarping module ocrd-anybaseocr-dewarp22 is based on the pix2pixHD GAN [4, 9, 34]. DocUNet, a different recent DNN approach, is based on U-Nets [30]. While we found the ocrd-anybaseocr- dewarp module works quite well for low-resolution documents, we were not able to apply it for our use case as we received pixel clutter when running it on the full page resolution (we used binarized input as the module documentation states [34]). As shown in Figure 4, dewarping the original image (a) is achieved, but the output is not readable anymore (b). Only when we reduced the input to a low-resolution detail region of the page, we obtained readable output (c) (though still of somewhat worse quality than the binarized input). For the digitization of newspapers, layout analysis and reading order analysis strongly bene- fit from a fully dewarped page, so the reported issues with ocrd-anybaseocr-dewarp disqualified it for being used in our pipeline. Another reason to abandon ocrd-anybaseocr-dewarp, is that it was trained to expect and generate binarized images, which would conflict with Origami’s non-binarized line OCR model23 . As an alternative, we looked into dewarping with more tra- ditional (i.e. non-DNN) approaches. However, there seem to be only two working open source implementations available: OCR-D’s ocrd-cis-ocropy-dewarp and Leptonica 24 . Both employ rather simple warping detection heuristics that are not well suited for multi-column pages and 22 The module originates from the DFKI’s anyOCR package, see https://github.com/OCR-D/ocrd_anybase ocr. 23 We use Origami with a line OCR model that relies on non-binarized input, as we obtained lower character error rates than with binarized input in previous experiments. Also see Section 3.6 24 http://www.leptonica.org/dewarping.html 359 (a) Detail from page 10 from August 22, 1888. (b) Ocrd-anybaseocr-dewarp run on full page (de- tail). (c) Ocrd-anybaseocr-dewarp run on small region. (d) Origami dewarping run on full page (detail). Figure 4: Results of running various dewarping implementations on a warped sample page. Image (a) shows one region of interest in the warped original. Images (b) and (c) show the same region after applying ocrd- anybaseocr-dewarp on page and region level. Image (d) shows the detail after applying Origami’s dewarping to the full page. complex layouts. Neither can therefore be expected to produce good output when applied to the region or page level [34]. To summarize, neither of the available implementations are suitable for dewarping high-resolution grayscale images with complex layouts, nor can they be extended or modularized to add this missing functionality. Consequently, we decided to implement a dewarping step for the page level from scratch that can take any form of warping information in the form of a number of points and flow lines on the page and generate a de- warped page - thereby separating the inner workings of the warping estimation (for example via some form of external baseline detection algorithm) from the actual dewarping calculation. In contrast to a neural network, the process is highly transparent from the algorithmic per- spective and also scales to any resolution and color depth. A sample output of our approach can be seen in Figure 4 (d). Our approach will be described in more detail in the following passage: After we first experimented with constrained cylinder models, such as in [53], we settled on the versatile vector field model of Schneider et al.25 [41]. Instead of the baseline detection described by Schneider et al., we use Tesseract’s baseline detection, which we run for each of the detected regions from our previous step. As we only present one homogeneous, single-column text region at a time, this works very well26 . In addition to baselines, we extract direction information from the vectorized separator paths from the previous stage. Figure 5 shows an example of this process. In a next step, all sample points (i.e. points and their derived flow directions) are used to build two vector fields that model page flow lines in horizontal and vertical direction re- spectively. As proposed in [41], we use a Delaunay triangulation to interpolate all missing values. Since we work with fewer samples (we generate one sample per line), we also need to extrapolate the outside of the convex hull of sample points (this does not seem to be neces- sary in [41]). This turns out to be an important but complex technical detail, necessary for good results with our current setup. In a final step, we build a combined dewarping grid that models discrete intersections of the flows of both vector fields (instead of processing them one 25 Somewhat similar grid based approaches have been proposed by others, for example [39] 26 Unfortunately, Tesseract’s official API only provides straight baselines at the moment, and our approach (and the remaining warping in Figure 5) could be improved by either accessing Tesseract’s internal spline baselines or using a different (e.g. DNN-based) baseline estimation altogether. 360 Figure 5: Sampling page warp from baselines and separator directions. Orange arrows show baselines’ up vectors. Red, blue and yellow lines show various separator types and their extracted directions. (a) Example of undersegmentation. These two (b) Example of oversegmentation. The yellow table columns should be separate regions, not one. should consist of one region, not two. Figure 6: Two examples of typical segmentation errors that the layout stage tries to fix with heuristic rules. Green lines in (b) are detected table separators T. after another, as suggested in [41]). We also sample and modify grid borders to ensure that no content is dewarped outside the final page frame. The dewarping representation resulting from this step is stored as a grid of point locations. 3.4. Layout and lines (stages 5 & 6) This rule-based stage attempts to fix typical segmentation errors (for an overview, see [2, 43]) by merging and splitting regions. As illustrated in Figure 6, some regions that got connected by the DNN in stage 1, should not be connected, while other regions that did not get connected, should actually get merged. This stage addresses these issues through a configurable pipeline of fast geometric operators. In terms of mathematical morphology, these operators can be likened to a series of highly selective dilations and erosions. The name ”layout” for this stage is a bit of a misnomer, since the layout detection is actually spread over various stages (starting at stage 1) - yet this is the stage that finally commits to an overall region layout. Although similar operations have been implemented in many segmentation solutions through binary morphological operations (like dilation and erosion) on a pixel level [2, 11], we are not aware of a description of this specific framework of selective polygonal operators for this use case. Details of the most useful operators are presented in the following. Definitions We first define some useful symbols. Note that we use the terms shape and region interchangeably. H(A) is a hull operator on a shape A as described in Section 3.4, Λ(A) refers to the number of baselines detected in a region A, d(A, B) is the shortest distance between A and B, and x0 (A) (y0 (A)) and x1 (A) (y1 (A)) refer to the minimum and maximum x (y) coordinates of the region. We denote the area of region A as |A| and the length of the interval [u0 , u1 ] as |u|. We refer to heuristic parameters as α or, for multiple parameters, as α1 , ..., αn (these 361 (a) (b) (c) (d) Figure 7: Hull operators supported in Origami: base form (a), concave hull (b), convex hull (c), and rectangular form (d). parameters are always local to the operator described). All configured parameters relating to lengths or areas on the page are handled independently of resolution by scaling them according |A∩B| to the page size. Finally, we define a measure of overlap OVL(A, B) = min(|A|,|B|) for shapes or ∑S |x| intervals A, B and a measure of cohesion COH(S) = H(∪S) x for some set of regions S. | | The hull operator This operator extends a geometric region to some form of hull that is defined by additional parameters. It never grows the region beyond its axis-aligned bounding box (AABB). As shown in Figure 7, Origami supports expanding a shape to a concave hull (the latter is defined by two concavity parameters, see [36]), the convex hull, or a rectangular form (the AABB). The rationale for this step is to find a better approximation of an area that might only have been partially detected by the DNN. For example, if we know we are dealing with a Manhattan layout, we might benefit from finding overlaps in rectangular shapes. Note that we also apply these hulls when merging shapes in the other operators. For the BBZ, we use convex hull operators. The overlap merge operator This operator merges two regions A, B of the same region type, e.g. TEXT, if OVL(A, B) >= α, i.e. if the green area in the leftmost example in Figure 8(a) gets too big in comparison to the unmerged regions. The merged region is given as H(A ∪ B). The adjacency merge operator This operator merges regions that have the same type (e.g. TEXT) that are near to each other, subject to additional constraints. We use this to extend regions that are vertically aligned to the left or right (e.g. for finding header texts), and regions that are horizontally aligned to the bottom or right (e.g. for merging regions of the same table). Specifically, we merge A and B if (1) the shapes fulfill a specified horizontal (or vertical) alignment, e.g. OVL([x0 (A), x1 (A)], [x0 (B), x1 (B)]) > α1 , (2) COH(A ∪ B) > α2 , (3) d(A, B) < α3 , and (4) max(Λ(A), Λ(B)) <= α4 (see Figure 8). Additionally, we check if a newly merged region overlaps with other region types or with separators. If any of the two overlaps are above a certain threshold, we do not merge. To determine adjacency efficiently, we build a segment-based Voronoi-diagram from the region contours. The sequential merge operator Sequential merge is similar to adjacency merge, but compares longer runs of regions along a reading order determined by a X-Y cut on the regions (see Section 3.5). This has proven useful to merge cluttered table regions as shown in Figure 6(b). We greedily look for runs of regions S with COH(S) > α1 and only stop extending a run if we 362 1 3 2 4 (a) Two types of merges. (b) The four criteria of adjacency merge. Figure 8: (a) Overlap merge (left) and adjacency merge (right), when H (the hull operator) is configured to form the convex hull (the additional area added by applying H is indicated in darker gray in the merged forms at the bottom). (b) Conditions for merging adjacent regions: (1) vertical alignment (indicated as green area), (2) cohesion, a measure of the excess area we would add when actually merging the regions (indicated in green), (3) shortest distance between regions (green arrow), (4) number of text lines in the involved regions. (a) (b) 1.00 0.8 0.75 0.50 0.6 0.25 0.4 0.00 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 (c) (d) Figure 9: Steps involved in spillover detection: (a) binarization, (b) vertical convolution, (c) quantiles, (d) horizontal convolution. (c) and (d) show horizontal pixel offset on the x-axis and grayscale value on the y-axis (1 is white, 0 is black). encounter COH(S) < α2 , in which case we start a new run. Similar to adjacency merge, we only merge if d(A, B) < α3 and no overlaps with other region types or separators exist. The fix-spillover operator This operator fixes undersegmentation, i.e. cases where the DNN recognized one region, whereas it should have recognized separate regions as shown in Figure 6(a). It is implemented as a morphological analysis followed by a split. As shown in Figure 9, we first binarize a region using a Sauvola threshold (a), then convolve vertically with a kernel that matches the line height (b). For each column, we now find the 0.1 quantile (c). Finally, we convolve horizontally (d). In this signal, we find peaks of a certain minimum width, which are the desired whitespace columns. The region is then split at this location. The whole operation depends on proper deskewing and dewarping. 363 Figure 10: Example of our two-stage X-Y cut approach: The numbers in circles indicate the determined reading order. In this example, 12-13-14-15 originally belonged to one region and 16 to another region. On the region level, these regions’ bounding boxes would intersect, and a clean X-Y cut is not feasible. We detect this case, split these two regions into their lines, and re-run an X-Y cut on the line level (split lines are underlined in orange). Lines stage After the layout completes, the new determined regions are used to extract new baselines. This task is much easier than before (i.e. in stage 3), since dewarping will have made lines (roughly) horizontal and parallel. This step currently uses Tesseract internally. 3.5. Reading order (stage 7) Reading order is determined using a recursive X-Y cut (also known as RXYC) [32, 11, 31] on the dewarped regions27 . We project bounding boxes as described in [20] and use a custom scoring scheme (similarly non-standard schemes have been investigated by Meunier [31]). Origami supports three variants when choosing a cut: (1) widest gap, (2) longest cut, (3) largest whitespace area (the product of (1) and (2)). We found that (3) works best for our corpus. We combine this score with a separator scoring scheme that prefers cuts that run along aligned separators, while it penalizes cuts that cross unaligned separators (e.g. we will penalize a vertical cut that would split a horizontal separator). For this, we measure the ratio of the whitespace gap that is covered by a separator. As is illustrated in Figure 10, we employ a two-step approach X-Y cut that refines regions into lines for those regions whose bounding boxes overlap with each other (and do therefore not provide whitespace gaps for proper cuts). This approach can fix some simple border cases of notoriously difficult L-shapes [31]. Figure 11 provides some examples for the consolidated information about layout regions, reading order, and table structure (i.e. table regions and their columns) at this stage of the workflow. Before X-Y cuts are performed on the region level, a refinement step resamples evidence from the DNN output from stage 1 at the line level: for each line in a region, we check if the majority of the DNN pixel predictions for the line’s area corresponds to the region type (e.g. TEXT). If this is not the case, the line gets deleted (and the region is reduced accordingly). This step resolves overlaps at some region boundaries and thus provides a more robust basis 27 Proper dewarping is crucial here, since for X-Y cut to work, whitespace areas and separator lines need to be straight and axis-aligned. 364 (a) (b) Figure 11: Examples of completed layout and reading order analysis (numbers in circles indicate reading order). The black borders stem from page dewarping. for X-Y cuts. 3.6. OCR and compose (stages 8 & 9) We use an OCR model specifically trained for the BBZ with Calamari as well as a custom network architecture and various data augmentation techniques. The model uses ensemble voting with 5 folds and non-binarized line images as input. We refer to this model as Origami’s BBZ model, as Origami supports specifying other (and binarizing) OCR models as well. We described the training of Origami’s BBZ model in detail in a separate paper [29]. The following results should be understood as a measure of what can be achieved with Origami under optimal conditions, i.e. when using a tailored OCR model for a specific corpus. While we believe that the results illustrate the pipeline’s potential, our results do not provide a general baseline when applying Origami to other corpora or in terms of other pipeline’s general performance. To really assess the quality of Origami in comparison to other systems such as OCR-D or Transkribus, one would need to define a mapping between similar sets of stages and then test against a shared set of inputs and validation outputs. Defining such a methodology is a very challenging task though, which we may investigate in more detail in future work28 . Table 2 lists the results on three typical BBZ test pages, which are shown in Figure 12. As indicated by the light areas in Figure 12(a), we excluded a badly readable header area (which 28 For a discussion of the problems involved, see [38, 13, 23]. 365 Table 2 CERs and WERs for the final stage without table cell text (best scores are bold). All models used the same line image input from previous Origami stages. Non-Origami models’ input was always binarized using an Otsu threshold. Comparisons were calculated after harmonizing obviously different transcription schemes (e.g. dashes and Fraktur-s). (a) contains mixed Fraktur and Antiqua content, which explains the bad scores for the Fraktur-only models. The bad scores for (c) for non-Origami models seem to be a result from (c)’s lower line height (this hypothesis is supported by the fact that larger title headings come out correctly most of the time, whereas the smaller body text is mostly garbage). CER WER model (a) (b) (c) (a) (b) (c) OCR-D gt4histocr-calamari1 16.37% 2.98% 18.98% 53.18% 15.18% 60.4% Calamari fraktur_19th_century2, 20193 28.83% 4.7% 36.98% 70.41% 22.13% 84.87% Calamari fraktur_19th_century2, 2020 14.23% 1.37% 20.3% 42.23% 5.99% 60.02% Origami, trained on BBZ (v19-h2-e-v1) 0.76% 0.3% 0.32% 2.4% 1.39% 1.75% 1 Pretrained models from https://github.com/OCR-D/ocrd_calamari 2 Pretrained models from https://github.com/Calamari-OCR/calamari_models 3 Last year’s pretrained model, i.e. git commit hash e568b1d (a) (b) (c) Figure 12: Page scans used to measure overall OCR performance: (a) January 5, 1872, Morgenausgabe, header page, 4582 × 6507 pixels, overall bad printing quality, (b) January 5, 1872, Morgenausgabe, page 2, 4301 × 6521 pixels, (c) November 1, 1918, Morgenausgabe, page 2, 2400 × 3555 pixels. Light areas in (a) were excluded from evaluation. (a) contains both Fraktur and Antiqua, whereas (b) and (c) contain only Fraktur. Note that (a) and (b) are high resolution images (body text line height about 50 pixels), whereas (c) is a lower resolution image (body text line height about 25 pixels). also contained various recurring phrases trained through other pages’ header areas, that would have given Origami an unfair advantage) and several table regions with low quality text (and fractions) from the evaluation of page (a) for all models. No parts of the evaluation pages were included in Origami’s training set. Origami’s BBZ model outperforms the best other currently available Fraktur models by at 366 Scale Line Crop Line Binarize Deskew Dewarp single remapping of source image Scan Figure 13: Line image sampling implemented in Origami. Conventional OCR pipelines (top path) perform deskewing, dewarping, scaling and other preprocessing operations (e.g. binarization) in subsequent isolated steps, thereby risking degrading quality with each step (red arrows indicate possible information loss due to sampling, crop does not involve sampling). Origami (bottom path) combines deskewing, dewarping and scaling into one single remapping (a set of piecewise linear mappings built through corresponding control points [18]), thus the original source image is only sampled once. least 1% for CER (character error rate) and by at least 5% for WER (word error rate)29 . In general, the gap seems to be larger. The absolute WERs of around 2% obtained in this small evaluation compares quite favorably to the WERs of above 5% recently reported for ABBY 10 and 11 when run on newspapers published between 1884 and 1947 [24]. Our results for page (a) illustrate the high error rates one obtains when applying pretrained Fraktur-only models to mixed Antiqua-Fraktur content. Page (b) shows the performance under ideal conditions (high image resolution, only Fraktur text) with pretrained third-party models. With page (c), third party models’ quality plummets with lower image resolutions, whereas Origami’s own model roughly retains its performance. This might make Origami’s BBZ model an option for working with lower resolution input data. Origami’s BBZ model understands both Fraktur and Antiqua at similar error levels. It was also trained to recognize numerical fractions and a number of special symbols found in the BBZ. Finally, it was trained to correctly handle increased letter-spacing (Sperrsatz) that is commonly found in the BBZ. To minimize lost information through resampling, Origami combines deskewing, dewarping and scaling into a single linear remapping from the source image to the final line image, that is directly fed into the line-level OCR network (see Figure 13). Many pipelines perform these operations step by step using subsequent (low-quality nearest-neighbor) resampling on already binarized input (see top half of Figure 13), which seems not ideal given that resampling is a lossy process [16]. Any output of this stage is finally composed into a plain text version and a schema-validated PAGE XML file (the latter also contains polygonal coordinates for all regions and the obtained reading order). Figure 14 shows an example of the plain text output that features both body text and tables. 4. Conclusion In this paper we presented Origami, a new end-to-end pipeline for performing OCR on histor- ical newspapers. Origami is robust enough for large scale batches in production environments, but also small and flexible enough to allow for easy experimentation with new algorithmic 29 CERs and WERs were evaluated using Dinglehopper (https://github.com/qurator-spk/dinglehopper). 367 (a) Scan (detail). (b) Result of compose stage. Figure 14: Detail of a scan of the BBZ from January 6, 1872, Morgenausgabe, page 1 (a) and the corresponding OCR result from the plain text compose stage (b). approaches. Origami offers a reliable, table-aware layout detection based on deep neural net- works, high-resolution deskewing and dewarping, selective polygonal merge and split operators and separator-aware, two-stage X-Y cuts. We also demonstrated (1) the feasibility of an all- in-one transformation for extracting scaled, dewarped line images directly from the original document scan without intermediary sampling, and (2) the validity of a new non-standard paradigm regarding binarization that only relies on thresholding as a source of information for micro decisions (e.g. layout operators), but not as general pre-processing for all subse- quent stages. Our evaluations showed how the pipeline’s mixed Fraktur-Antiqua OCR model with non-binarized input (trained with Calamari) clearly outperforms other publicly available single-typeface ensemble models in terms of error rates for our use case. In our current project on digitizing the Berliner Börsen-Zeitung, the Origami pipeline proved to be the ideal experimental platform to research how existing high-quality OCR components can be leveraged and how new functionalities can be implemented. As Origami is lean, flexible and generally easy-to-use (all components are bundled and can be setup with just a few com- mands), we hope it will be useful for others as well. We also hope that the Origami workflow will spark some more discussion on some of the experimental features we implemented, for instance the general role and necessity of binarization as a pre-processing step. Acknowledgments This research was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project number BU 3502/1-1. 368 References [1] O. Aichholzer et al. “A Novel Type of Skeleton for Polygons”. In: J. UCS The Journal of Universal Computer Science 1 (Jan. 1995), pp. 752–761. doi: 10.1007/978-3-642-803 50-5_65. [2] A. Antonacopoulos et al. “ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013)”. In: Proceedings of the 12th International Conference on Docu- ment Analysis and Recognition (ICDAR 2013). Washington, DC, USA, 2013, pp. 1454– 1458. [3] K. Baierer et al. “OCR-D kompakt: Ergebnisse und Stand der Forschung in der Förderini- tiative”. In: BIBLIOTHEK – Forschung und Praxis (June 2020). doi: 10.18452/21548. [4] V. K. Bajjer Ramanna, S. S. Bukhari, and A. Dengel. “Document Image Dewarping Using Deep Learning”. In: Proceedings Ot the International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019). Prague, Czech Republic: Insticc, Feb. 2019, pp. 524–531. doi: 10.5220/0007368405240531. [5] R. Barman et al. “Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers”. In: CoRR abs/2002.06144 (May 2020). arXiv: 2002.06144 [cs]. [6] S. Bianco et al. “Benchmark Analysis of Representative Deep Neural Network Archi- tectures”. In: IEEE Access 6 (2018), pp. 64270–64277. issn: 2169-3536. doi: 10 . 1109 /ACCESS.2018.2877890. arXiv: 1810.00736. [7] G. Bradski. “The OpenCV Library”. In: Dr. Dobb’s Journal of Software Tools (2000). [8] S. S. Bukhari. “Generic Methods for Document Layout Analysis and Preprocessing”. Doctoralthesis. Technische Universität Kaiserslautern, 2012, p. 219. [9] S. S. Bukhari et al. “anyOCR: An Open-Source OCR System for Historical Archives”. In: Proceedings of the 14th International Conference on Document Analysis and Recognition (ICDAR 2017). Vol. 1. Kyoto, Japan, 2017, pp. 305–310. [10] F. Cacciola. 2D Straight Skeleton and Polygon Offsetting. Tech. rep. CGAL Editorial Board, 2020. [11] R. Cattoni et al. Geometric Layout Analysis Techniques for Document Image Under- standing: A Review. Tech. rep. Via Sommarive 18, I-38050 Povo, Trento, Italy: ITC-irst, 1998, pp. 1–68. [12] C. Clausner, S. Pletschacher, and A. Antonacopoulos. “Aletheia - an Advanced Document Layout and Text Ground-Truthing System for Production Environments”. In: Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR 2011). Beijing, China: IEEE Computer Society, 2011, pp. 48–52. isbn: 978-0-7695-4520-2. doi: 10.1109/ICDAR.2011.19. [13] C. Clausner, S. Pletschacher, and A. Antonacopoulos. “Flexible Character Accuracy Measure for Reading-Order-Independent Evaluation”. In: Pattern Recognition Letters 131 (Mar. 2020), pp. 390–397. issn: 01678655. doi: 10.1016/j.patrec.2020.02.003. 369 [14] D. Dannélls, T. Johansson, and L. Björk. “Evaluation and Refinement of an Enhanced OCR Process for Mass Digitisation.” In: Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (DHN 2019). Ed. by C. Navarretta, M. Agirrezabal, and B. Maegaard. Copenhagen, Denmark, Mar. 2019, pp. 112–123. url: http://www.ce ur-ws.org/Vol-2364/9_paper.pdf. [15] M. Dirnberger, A. Neumann, and T. Kehl. “NEFI: Network Extraction from Images”. In: CoRR abs/1502.05241 (2015). arXiv: 1502.05241 [cs]. [16] N. A. Dodgson. Image Resampling. Tech. rep. UCAM-CL-TR-261. University of Cam- bridge, Computer Laboratory, Aug. 1992. url: https://www.cl.cam.ac.uk/techreports /UCAM-CL-TR-261.html. [17] C. A. Glasbey and K. V. Mardia. “A Review of Image-Warping Methods”. In: Journal of Applied Statistics 25.2 (1998), pp. 155–171. doi: 10.1080/02664769823151. [18] A. Goshtasby. “Piecewise Linear Mapping Functions for Image Registration”. In: Pattern Recognition 19.6 (1986), pp. 459–466. issn: 0031-3203. doi: 10.1016/0031-3203(86)9004 4-0. [19] T. Grüning et al. “A Two-Stage Method for Text Line Detection in Historical Docu- ments”. In: International Journal on Document Analysis and Recognition (IJDAR) 22.3 (Sept. 2019), pp. 285–302. issn: 1433-2833, 1433-2825. doi: 10.1007/s10032-019-00332-1. [20] J. Ha, R. M. Haralick, and I. T. Phillips. “Recursive X-Y Cut Using Bounding Boxes of Connected Components”. In: Proceedings of 3rd International Conference on Docu- ment Analysis and Recognition (ICDAR 1995). Vol. 2. ICDAR 1995. Montreal, Quebec, Canada, 1995, pp. 952–955. [21] M. J. Hill and S. Hengchen. “Quantifying the Impact of Dirty OCR on Historical Text Analysis: Eighteenth Century Collections Online as a Case Study”. In: Digital Scholarship in the Humanities 34.4 (Apr. 2019), pp. 825–843. issn: 2055-7671. doi: 10.1093/llc/fqz0 24. [22] I. Jerele et al. “Optical Character Recognition of Historical Texts: End-User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century”. In: Review of the National Center for Digitization 21 (2012), pp. 117–126. issn: 1820-0109. [23] R. Karpinski, D. Lohani, and A. Belaïd. “Metrics for Complete Evaluation of OCR Per- formance”. In: Proceedings of the International Conference on Image Processing, Com- puter Vision, & Pattern Recognition (IPCV 2018). Las Vegas, Nevada, USA, July 2018, pp. 23–29. [24] L. Wilms. Newspaper OCR Quality: What Have We Learned? https://lab.kb.nl/about- us/blog/newspaper-ocr-quality-what-have-we-learned. July 2020. [25] E. Klijn. “The Current State-of-Art in Newspaper Digitization”. In: D-Lib Magazine 14.1/2 (Jan. 2008). issn: 1082-9873. [26] S. K. Lam, A. Pitrou, and S. Seibert. “Numba: A LLVM-Based Python JIT Compiler”. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. LLVM 2015. New York, NY, USA: Association for Computing Machinery, 2015, pp. 1–6. isbn: 978-1-4503-4005-2. doi: 10.1145/2833157.2833162. [27] M. Li et al. “TableBank: Table Benchmark for Image-Based Table Detection and Recog- nition”. In: CoRR abs/1903.01949 (2019). arXiv: 1903.01949 [cs]. 370 [28] B. Liebl and M. Burghardt. “An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers”. In: CoRR abs/2004.07317 (2020). arXiv: 2004.07317 [cs]. [29] B. Liebl and M. Burghardt. “On the Accuracy of CRNNs for Line-Based OCR: A Multi- Parameter Evaluation”. In: CoRR abs/2008.02777 (2020). arXiv: 2008.02777 [cs]. [30] K. Ma et al. “DocUNet: Document Image Unwarping via a Stacked u-Net”. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018). Salt Lake City, UT, USA, June 2018, pp. 4700–4709. [31] J.-L. Meunier. “Optimized XY-Cut for Determining a Page Reading Order”. In: Proceed- ings of the 8th International Conference on Document Analysis and Recognition (ICDAR 2005). Vol. 1. Seoul, South Korea, 2005, pp. 347–351. [32] G. Nagy and S. C. Seth. “Hierarchical Image Representation with Application to Opti- cally Scanned Documents”. In: International Conference on Pattern Recognition. 1984, pp. 347–349. [33] C. Neudecker et al. “OCR-D: An end-to-end open source OCR framework for historical printed documents”. In: Proceedings of the 3rd International Conference on Digital Ac- cess to Textual Cultural Heritage (DATeCH2019). New York, NY, USA: Association for Computing Machinery, May 2019, pp. 53–58. isbn: 978-1-4503-7194-0. doi: 10.1145/33 22905.3322917. [34] E. Engl et al. OCR-D Workflows. https://github.com/OCR-D/ocrd-website/blob/mast er/site/en/workflows.md. 2020. [35] S. A. Oliveira, B. Seguin, and F. Kaplan. “dhSegment: A Generic Deep-Learning Ap- proach for Document Segmentation”. In: Proceedings of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR 2018). Niagara Falls, NY, USA: IEEE Computer Society, Aug. 2018, pp. 7–12. isbn: ISBN 978-1-5386-5875-8. doi: 10 . 1109 /ICFHR-2018.2018.00011. arXiv: 1804.10371. [36] J.-S. Park and S.-J. Oh. “A New Concave Hull Algorithm and Concaveness Measure for N-Dimensional Datasets”. In: Journal of Information Science and Engineering 29 (Mar. 2013), pp. 379–392. [37] S. Pletschacher and A. Antonacopoulos. “The PAGE (Page Analysis and Ground-Truth Elements) Format Framework”. In: Proceedings of the 20th International Conference on Pattern Recognition (ICPR 2010). Istanbul, Turkey, Aug. 2010, pp. 257–260. [38] S. Pletschacher, C. Clausner, and A. Antonacopoulos. “Europeana Newspapers OCR Workflow Evaluation”. In: Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing (HIP 2015). Gammarth, Tunisia: ACM Press, 2015, pp. 39–46. isbn: 978-1-4503-3602-4. doi: 10.1145/2809544.2809554. [39] M. Rahnemoonfar and A. Antonacopoulos. “Restoration of Arbitrarily Warped Historical Document Images Using Flow Lines”. In: Proceedings of the 11th International Confer- ence on Document Analysis and Recognition (ICDAR 2011). Beijing, China, Sept. 2011, pp. 905–909. [40] C. Reul et al. “OCR4all - an Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings”. In: CoRR abs/1909.04032 (2019). arXiv: 1909.04032 [cs]. 371 [41] D. C. Schneider, M. Block, and R. Rojas. “Robust Document Warping with Interpolated Vector Fields”. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007). Curitiba, Paraná, Brazil: IEEE Computer Society, Sept. 2007, pp. 113–117. [42] R. Sedgewick and K. Wayne. Algorithms. Fourth. Addison-Wesley Professional, 2011. isbn: 0-321-57351-X. [43] F. Shafait. “Geometric Layout Analysis of Scanned Documents”. Dissertation. Technical University of Kaiserslautern, Germany, Apr. 2008. [44] D. Smith and R. Cordell. A Research Agenda for Historical and Multilingual Optical Character Recognition. Northeastern University, 2018. [45] R. Smith. “An Overview of the Tesseract OCR Engine”. In: Proceedings of the 9th In- ternational Conference on Document Analysis and Recognition (ICDAR 2007). Vol. 2. Curitiba, Paraná, Brazil, Sept. 2007, pp. 629–633. [46] P. Ströbel and S. Clematide. Improving OCR of Black Letter in Historical Newspapers: The Unreasonable Effectiveness of HTR Models on Low-Resolution Images. Utrecht, The Netherlands, July 2019. doi: 10.5167/uzh-177164. url: https://dev.clariah.nl/files/dh2 019/boa/0694.html. [47] S. Suzuki and K. Abe. “Topological Structural Analysis of Digitized Binary Images by Border Following”. In: Computer Vision, Graphics, and Image Processing 30 (1985), pp. 32–46. [48] S. Tanner, T. Muñoz, and P. H. Ros. “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive”. In: D-Lib Magazine 15.7/8 (July 2009). issn: 1082-9873. [49] M. C. Traub, J. van Ossenbruggen, and L. Hardman. “Impact Analysis of OCR Quality on Research Tasks in Digital Archives”. In: Research and Advanced Technology for Digital Libraries. Ed. by S. Kapidakis, C. Mazurek, and M. Werla. Cham: Springer International Publishing, 2015, pp. 252–263. isbn: 978-3-319-24592-8. [50] D. van Strien et al. “Assessing the Impact of OCR Quality on Downstream NLP Tasks:” in: Proceedings of the International Conference on Agents and Artificial Intelligence (ICAART 2020). Valletta, Malta: Science and Technology Publications, 2020, pp. 484– 496. isbn: 978-989-758-395-7. doi: 10.5220/0009169004840496. [51] H. Walravens, ed. The Impact of Digital Technology on Contemporary and Historic News- papers. Berlin, Boston: De Gruyter Saur, 2008. isbn: 978-3-598-44126-4. doi: 10.1515/9 783598441264. [52] C. Wick, C. Reul, and F. Puppe. “Calamari - a High-Performance Tensorflow-Based Deep Learning Package for Optical Character Recognition”. In: CoRR abs/1807.02004 (2018). arXiv: 1807.02004 [cs]. [53] M. Wu et al. “A Model Based Book Dewarping Method to Handle 2D Images Captured by a Digital Camera”. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR 2007). Vol. 1. Curitiba, Paraná, Brazil, Sept. 2007, pp. 158–162. 372 [54] C. E. Wulfman et al. Complexities in the Use, Analysis, and Representation of Historical Digital Periodicals. Utrecht, The Netherlands, July 2019. doi: doi:10.34894/WTP281. url: https://dataverse.nl/api/access/datafile/19099. [55] T.-I. Yang, A. J. Torget, and R. Mihalcea. “Topic Modeling on Historical Newspapers”. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. Portland, OR, USA: Association for Compu- tational Linguistics, June 2011, pp. 96–104. [56] X. Zhong, E. ShafieiBavani, and A. Jimeno-Yepes. “Image-Based Table Recognition: Data, Model, and Evaluation”. In: CoRR abs/1911.10683 (2019). arXiv: 1911.10683 [cs]. 373