=Paper=
{{Paper
|id=Vol-3290/long_paper8670
|storemode=property
|title=Page Layout Analysis of Text-heavy Historical Documents: a
Comparison of Textual and Visual Approaches
|pdfUrl=https://ceur-ws.org/Vol-3290/long_paper8670.pdf
|volume=Vol-3290
|authors=Sven Najem-Meyer,Matteo Romanello
|dblpUrl=https://dblp.org/rec/conf/chr/Najem-MeyerR22
}}
==Page Layout Analysis of Text-heavy Historical Documents: a
Comparison of Textual and Visual Approaches==
Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches Sven Najem-Meyer1,∗ , Matteo Romanello2 1 EPFL, Lausanne, Switzerland. 2 UNIL, Lausanne, Switzerland Abstract Page layout analysis is a fundamental step in document processing which enables to segment a page into regions of interest. With highly complex layouts and mixed scripts, scholarly commentaries are text-heavy documents which remain challenging for state-of-the-art models. Their layout considerably varies across editions and their most important regions are mainly defined by semantic rather than graphical characteristics such as position or appearance. This setting calls for a comparison between textual, visual and hybrid approaches. We therefore assess the performances of two transformers (LayoutLMv3 and RoBERTa) and an objection- detection network (YOLOv5). If results show a clear advantage in favor of the latter, we also list several caveats to this finding. In addition to our experiments, we release a dataset of ca. 300 annotated pages sampled from 19th century commentaries. Keywords Page Layout Analysis, Historical Documents, Classical Commentaries, Digital Humanities 1. Introduction 1.1. Page layout analysis Automatically transcribing a page by means of optical character recognition (OCR) often results in losing crucial information about its layout. This loss can be critical for further analyses which typically require accessory regions such as running headers and footnotes to be separated from the main text. Capturing information about page layout is also of key importance for the automatic or semi-automatic markup of digitized documents, as textual information contained in each page region can be automatically marked up, provided that a mapping is established between region types and markup elements. To tackle this issue, we focus on Page Layout Analysis1 , which aims at segmenting a page into homogeneous regions and at classifying those regions according to their CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium ∗ Corresponding author. £ sven.najem-meyer@epfl.ch (S. Najem-Meyer); matteo.romanello@unil.ch (M. Romanello) ȉ 0000-0002-3661-4579 (S. Najem-Meyer); 0000-0002-7406-6286 (M. Romanello) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Inter- national (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 We use Page Layout Analysis rather than the more generic Document Layout Analysis because the latter includes recognizing regions that span multiple pages, which is beyond the scope of this study. 36 contents [7, 11]. Region contents can be of both textual and visual nature, and the two modalities can be leveraged in a separate or combined fashion. Purely textual approaches construe layout analysis as a natural language processing (NLP) problem. They aim at delimiting and at labeling the sequence of text composing a region. Visual approaches, on the other hand, seize the task as a computer vision problem and aim at detecting and classifying image regions. Finally, hybrid approaches leverage both modalities to detect and classify image regions and their corresponding text sequence. Visual approaches are often considered the standard way to go. This trend is probably encouraged by the recent progress of pre-trained convolutional neural networks (CNN) and by their ability to deal with non-textual regions. These approaches unsurprisingly show their best performances in distinguishing regions with highly contrasting graphical attributes, such as tables, illustrations and drop capitals. Yet, regions are often charac- terised by semantic rather than graphical features. In this case, it makes sense to opt for textual or hybrid approaches. If purely textual approaches prove their usefulness when a page’s image is not aligned to its text or not available at all (e.g. [17]), they end up dis- carding relevant information when it is. Hybrid models hence make use of images, text and their corresponding coordinates. Notice that this can be done either by enhancing an image-based model with text embeddings (e.g. by addition or concatenation) or by providing a textual model with text-coordinates in parallel to a visual backbone. How these three approaches are best suited to analysing text-heavy documents remains to be addressed. 1.2. Background: the case of classical commentaries In this paper, we focus on historical classical commentaries. We place them in the broader category of text-heavy documents as they mostly contain text, as opposed to more visual documents like illuminated manuscripts. The research project Ajax Multi-Commentary2 serves as the context for this work. It aims to create an automated pipeline to convert digitized commentaries into a body of structured information to aid in their comparative analysis. Within this pipeline, page layout analysis plays a crucial role as it can enable the (semi-)automatic markup of information contained within commentaries. Together with critical editions and translations, commentaries are one of the main genres of publication in literary and textual scholarship. Providing in-depth analyses often side by side with a critical edition of the chosen text, commentaries can have very sophisticated layouts with considerable variations between editions (Fig. 1). A common layout type has the commentary section as a single or double column footnote section positioned below the primary text or its translation. In other layout types, however, commentary sections span over the entire page. Conversely, regions with similar place- ment and appearance can have different functions. Besides the complexity of their layout, commentaries also feature a specific prose style, clearly recognizable by its intertwining of multiple scripts and its pervasive use of abbreviations. Comments generally follow a determined pattern such as line number - commented word or excerpt - comment, for 2 https://github.com/AjaxMultiCommentary 37 instance "1 (line number). ... (Excerpt): cp. Tr. 689-691. The passage in Aesch. Ag. 587-598 is scarcely a true parallel [...] (comment)". As our project’s pipeline starts with commentary images and ends with text mining, we value page layout analysis as a crucial step in which primary text, margin notes, line numbers and commentary sections ought to be precisely segmented. This task remains a challenging one given the characteristics listed above. If information about layout is mainly conveyed by semantic rather than by graphical clues, visual features are not irrelevant. Commentary regions are generally written in a smaller font and are often punctuated by bold line numbers anchors. Besides, for Greek commentaries, the script lends to a good visual feature to differentiate between the primary text and its translation. Figure 1: Example pages from 19th century commentaries on Sophocles’ Ajax. The commentaries are by (from le昀琀 to right): Lobeck (1835), Schneidewin (1853), Campbell (1881), Jebb (1896) and Wecklein (1894). Commentary sections are highlighted in blue, primary texts in pink and critical apparatus in yellow. 1.3. Goals In this challenging and mostly uncharted setting, our primary goal is to assess the performances of textual, visual and hybrid approaches. For each of these approaches, we ask the following questions: • RQ1: How well do state-of-the art models perform over commentaries of works written in different scripts (e.g. Latin or Polytonic Greek), belonging to different literary genres and having different layout types? Which of the textual, visual and hybrid approaches is best suited for the task? • RQ2: What is the impact of the quantity of training data? • RQ3: How well do models generalize on layout type they have not seen during training? For textual and hybrid approaches only, we address two additional questions: 38 • RQ4: How do hybrid models perform on languages they have not been pre-trained on? • RQ5: To which extent do textual and visual features separately account for the model’s decision? 2. Related work Generic approaches to page layout analysis have known considerable progress in recent years. Overtaking CNNs, image transformers such as DiT [12] or LayoutLMv3 [10] can be used for several visual or multi-modal document analysis tasks. However, on the contrary to newspapers, magazines and scientific press, commentaries remain barely explored as far as layout analysis is concerned. We therefore compared studies on historical documents, as they share the many similarities with commentaries. Simistra et al. [19] report the performances of several pixel classification algorithms for the Competition on Layout Analysis for Challenging Medieval Manuscripts at ICDAR 2017. The tasks include region detection for text, comments and images. Results show a net advantage in favor of convolutional neural networks (CNN), with intersection over union (IoU) scores ranging up to .90 for comments. It must be signaled, however, that comments consistently take the form of of marginal glosses and thereby possess very distinctive graphical features. Mehri et al. [14] also report performances of pixel-based approaches for the Compe- tition on Historical Book Analysis at ICDAR 2019. The competition is based on two challenges: distinguishing between text and images and classifying various text fonts. The best models used fully CNNs and reach scores close to perfection in the first chal- lenge (.99 F-score and above), but the binary classification is relatively easy for this type of pre-trained networks. Results to the second challenges, though slightly lower, are par- ticularly interesting to our research. They show that CNN can leverage fine-grained information regarding the font style (e.g. bold, italicized, etc.), which may avail in our case. Finally, Yang et al. [21] proposed a multimodal CNN to extract semantic structure from documents. The principle is to build a text embedding map which is accessible by the last layer of the model. Building on this idea, Barman et al. [2] reports notable improvement when using textual features to segment historical newspapers. If the text- only models yield the lowest scores, combining text and images consistently outperform image-only features by 3% mIoU in average. 3. Datasets While ground truth datasets already exist for layout analysis of historical documents such as manuscripts and early printed books [20, 3], newspapers [1] and even for the semantic segmentation of geographical maps [16], no such dataset existed for scholarly commentaries or critical editions. We contribute to filling this gap by creating and releasing GT4HistCommentLayout, a dataset of page layout analysis annotations on 19th 39 century commentaries to Ancient Greek and Latin works, written in English, French and German3 . This new dataset complements GT4HistComment [18], which provides OCR ground-truth data for the same type of historical documents. 3.1. Layout annotation To perform layout annotation we devise a content-based region taxonomy geared towards commentaries and critical editions (Fig. 2). It consists of 18 fine-grained classes, which are mapped to 8 coarse-grained classes in order to reduce the class number and class sparsity. Mapping is achieved by grouping region types with similar visual characteristics (e.g. numbers). The list of classes defined by our taxonomy is given in Table 1. For the experiments reported below, we exclusively consider coarse-grained classes. Table 1 Complete list of fine- and coarse-grained page region classes used for layout annotation, with their corresponding mapping to SegmOnto’s controlled vocabulary. Fine Coarse SegmOnto Type:Subtype commentary commentary MainZone:commentary critical apparatus critical apparatus MarginTextZone:criticalApparatus footnotes footnotes MarginTextZone:footnotes page number number NumberingZone:pageNumber text number number NumberingZone:textNumber bibliography others MainZone:bibliography handwritten marginalia others MarginTextZone:handwrittenNote index others MainZone:index others others CustomZone printed marginalia others MarginTextZone:printedNote table of contents others MainZone:ToC title others TitlePageZone translation others MainZone:translation appendix paratext MainZone:appendix introduction paratext MainZone:introduction preface paratext MainZone:preface primary text primary text MainZone:primaryText running header running header RunningTitleZone This taxonomy distinguishes between the original Greek text of the work being com- mented upon (Primary text), the commentary sections (Commentary), the commenta- tor’s translation of the commented text (Translation), the section containing information 3 https://doi.org/10.5281/zenodo.7271729 40 about manuscript readings and editorial conjectures (Critical apparatus), the paratex- tual elements — e.g. table of contents, appendices, indices, footnotes, introductory and prefatory materials — (Paratext) and finally page and line numbers (Number). In order to make the published dataset as widely reusable as possible, we mapped our classes to the SegmOnto controlled vo- cabulary [9, 8]. The only difficulty we en- countered in the mapping to SegmOnto concerned the commentary region class, as it can be mapped both to a MainZone or to a MarginTextZone, depending on the com- mentary at hand. In fact, in commentaries containing both primary text and com- mentary, commentary regions could be in- terpreted as marginalia to the commented text (i.e. a MarginTextZone); whereas in commentaries with no primary text, the commentary itself is undoubtedly the main region of the page (i.e. a Main- Zone). We address this issue by always considering commentary of type MainZone, based on the consideration that the area of the page such regions tend to occupy is roughly equal to the area of primary text Figure 2: The main layout elements of a or translation regions (when present). scholarly commentary page. Annotation was performed by three an- notators by using the VGG Image Anno- tator (VIA) tool [6]. While each commen- tary was annotated by one person at a maximum, all annotations were revised by an expert in order to ensure consistency in the application of the annotation guidelines. Manually annotated page regions were automatically resized to fit exactly the minimal bounding rectangle around contained words. 3.2. Sampling and dataset composition As a sampling strategy, we started with ca. 40 pages of annotation per commentary. We made sure that all page layout types (see Fig. 3 for selective examples) of any given commentary are included in the sample because page layout can vary quite substantially throughout a commentary depending on the section contents. The data used for experiments consist of an internal and an external dataset. The internal dataset comprises of commentaries to Sophocles’ Ajax, published from the be- ginning of the 19th century to date. Of these 12 commentaries, slightly less than a half are in the public domain, while the remaining are still under copyright. The external dataset, instead, consists of commentaries to both Latin and Greek classical works, sam- 41 Figure 3: Overview of various layouts. From le昀琀 to right: Introduction (from Wecklein), Commentary and primary text (from Campbell), index (from Lobeck) and appendix (from Jebb). pled to include works both in prose and poetry. It contains an English commentary to Tacitus’ Annals (Latin prose), a German commentary to book 6 of Vergil’s Aeneid (Latin poetry), and a German commentary to book 7 of Thucydides’ History of the Peloponnesian War (Ancient Greek prose). The specific purpose of this external dataset is to evaluate with which accuracy layout analysis models trained on data from one spe- cific genre and literature (i.e. Greek poetry, in the case of the Ajax) can be applied to commentaries about works from a wider variety of literary genres (see RQ6). Given this important distinction, the ground-truth dataset we release contains the public domain portion of the internal dataset, as well as the entire external dataset (as it consists exclusively of out of copyright documents). Detailed statistics about these datasets can be found in Table 2. 4. Experimental setup 4.1. Models LayoutLMv3 For hybrid experiments, we use LayoutLMv3�㔵�㔴�㕆�㔸 [10], a transformer which uses text, text-coordinates and image as inputs. This choice is motivated by the need to have a state-of-the-art hybrid model easily comparable both with a tex- tual approach (by pitting it against RoBERTa, infra) and a visual approach (by way token ablation). On the contrary to its predecessor, LayoutLMv3 does not rely on a pre-trained CNN for its visual backbone, but uses a multi-modal transformer instead. The authors claim superior results to concurrent systems such as DocFormer or Struc- turalLM. Pretests showed LayoutLMv3 to be slightly superior to LayoutLMv2 at the cost of a longer training time. As the model converged after 30 epochs, we fine-tune each model for a total of 40 epochs using recommended parameters and a maximum length of 512 tokens per instance. In the experiments below, we use LayoutLM for token classification, which opens three possible ways of labeling the data. The first method consists in annotating only the first word of a region. This method has the downside 42 Table 2 Detailed statistics about the annotated data. For each annotated commentary we report the number of pages as well as the total number of regions per class. Commentary Pages AppCrit Comm. Footn. Num. Others Parat. Primary t. Running h. Internal commentaries (public domain) Lobeck 1835 61 0 20 13 227 32 61 6 67 Campbell 1881 42 26 52 11 112 20 42 16 26 Jebb 1896 43 25 50 8 87 55 43 11 18 Schneidewin 1853 62 0 84 3 126 10 62 20 42 Wecklein 1894 42 0 35 2 145 12 42 5 41 Total 250 51 241 37 697 129 250 58 194 Internal commentaries (under copyright) Colonna 1975 40 28 0 10 164 12 40 12 26 De Romilly 1976 41 28 33 4 140 18 41 8 30 Ferrari 1974 40 0 57 8 111 15 40 9 29 Garvie 1998 40 9 10 6 136 15 40 7 10 Kamerbeek 1953 40 0 30 12 38 9 40 10 0 Paduano 1982 40 0 22 0 139 20 40 9 15 Untersteiner 1934 40 0 27 0 76 16 40 7 26 Total 281 65 179 40 804 105 281 62 136 External commentaries (public domain) Classen & Steup 1889 41 0 44 0 74 3 19 22 37 Norden 1903 40 10 16 2 107 18 6 9 38 Furneaux 1896 40 30 60 8 140 44 5 31 37 Total 121 40 120 10 321 65 30 62 112 Grand total (public domain) 371 91 361 47 1018 194 280 120 306 Grand total (all) 652 156 540 87 1822 299 561 182 442 of creating highly imbalanced classes, with a vast majority of words marked with a zero-label and very few marked with their region’s class. This method did not yield encouraging results in pre-tests and was therefore abandoned. The second method is inspired by the named entity recognition field and consists in labelling the first word of a region with BEGIN-[RegionClass] and the following with INSIDE-[RegionClass]. Besides doubling the number of classes, this method leads to the creation of very long entities and performed poorly in pre-tests. We therefore go for the third method, which consists in labeling all the words in a region with the region’s label. RoBERTa Provided the multilingualism of commentaries, it could have been relevant to use a multilingual transformer for text-only experiments. However, as LayoutLMv3 uses RoBERTa [13] to initialize its embeddings, we stick to the method used by its au- thors [10] and chose to train RoBERTa�㔵�㔴�㕆�㔸 for a fair comparison with the former. This bi-directional multi-head attention transformer was released as an improved version of BERT [5], being pre-trained on 160GB of uncompressed English text from Wikipedia, BooksCorpus, CC-News, OpenWebText and Stories [13]. Regarding training and la- belling, we use the same settings as LayoutLM. 43 YOLOv5 For visual experiments, we use YOLOv54,5 , an object-detection model based on DarkNet. The choice of YOLO is mainly motivated by the encouraging results in historical document layout recognition obtained by [4]. In preliminary tests, the model performed best with an image resolution of 1280 and converged around epoch 250. Re- garding the size of the model, the larger version (YOLOv5x) did not yield considerably better results despite a much longer training time. All experiments are therefore run on YOLOv5m with a resolution of 1280 for 300 epochs. In order to assess the amount of difficulty added by multiplying classes, we create two YOLO models: • YOLO�㕀�㕜�㕛�㕜 is trained for single class object detection, which is enabled by labelling all regions identically. • YOLO�㕀�㕢�㕙�㕡�㕖 is trained for multi-class object detection, using our dataset’s coarse labels (see Section 3.1). YOLO�㕀�㕜�㕛�㕜 +LayoutLM/RoBERTa This model combines the two approaches, using YOLO�㕀�㕜�㕛�㕜 to detect regions and LayoutLM/RoBERTa to classify them. Words con- tained within predicted regions are fed to LayoutLM/RoBERTa. The majority class among words is then used to label the regions. 4.2. Implementation and training We implement our experiments using HuggingFace transformers6 and YOLO’s API7 . Training was performed on two NVIDIA GeForce GTX TITAN X GPUS, each with 12.2GO of memory. The code is made publicly available on GitHub8 . 4.3. Evaluation methods As LayoutLM and RoBERTa are used for token classification, they should be evaluated on entity or word basis. However, in order to enable a meaningful comparison with YOLO, we group consecutive words with identical labels and build up a region from their bounding rectangle. Notice that LayoutLM and RoBERTa are severely disadvantaged by this evaluation procedure, as a single incorrectly labelled word among an actual region disrupts its unity. This problem is illustrated Figure 4 in the appendix and is not straightforward to mitigate without visual operations or carefully tailored rules. Indeed, homogenising long strands of tokens could result in the absorption of tiny regions like line numbers. We therefore evaluate the results without post-processing them and compute all mean average precision (mAP) scores at a 0.5 IoU threshold9 . It is worth noting 4 https://github.com/ultralytics/yolov5 5 Despite multiple attempts, we couldn’t get Kraken’s (https://github.com/mittagessen/kraken) seg- mentation training to work on our infrastructures and therefore removed it from our experimental procedure. 6 https://github.com/huggingface/transformers 7 https://github.com/ultralytics/yolov5 8 https://github.com/AjaxMultiCommentary/ajmc/tree/main/ajmc/olr 9 We used the Python package mean-average-precision, https://github.com/bes-dev/mean_average_ precision, version 2021.4.26.0 44 that the obtained scores are approximately .10 mAP points below the scores produced by YOLO’s built-in evaluation tool, a discrepancy already mentioned by [3]10 . 5. Experiments We divide our experiments according to our research questions and list them in Table 3. As using only textual features is consistently reported to yield lower results [2, 21, 10], we test this approach in a single sub-experiment to the hybrid series. This allows us to simplify our experimental design and to spare computing power while still being able to measure the benefits of adding image and coordinates. Results are presented in Table 4, and sample predictions are shown in Figure 4. Table 3 Experimental design. id name RQ Train data Test data Languages 0A Jebb - Base RQ1 Jebb Jebb en, gr 0B Kamerbeek - Base RQ1 Kamerbeek Kamerbeek en, gr 1A Jebb - Half trainset RQ2 Jebb Jebb en, gr 1C Jebb - Token ablation RQ5 Jebb Jebb - 1D Jebb, Kamerbeck - base RQ3 Jebb, Kamerbeek Jebb en, gr 1E Jebb - Text only RQ1&5 Jebb Jebb en, gr 2A Campbell, Jebb - Transfer RQ3 Campbell Jebb en, gr 2B Kamerbeek, Jebb - Transfer RQ3 Kamerbeek Jebb en, gr 2C Garvie, Jebb - Transfer RQ3 Garvie Jebb en, gr 3A Paduano - Base RQ4 Paduano Paduano it, gr 3B Wecklein - Base RQ4 Wecklein Wecklein de, gr 4A Omnibus (internal) RQ1 All (internal) All (internal) en, de, it, lat, gr 4B Omnibus (external) RQ1 All (external) All (external) en, de, lat 4C Omnibus - Transfer RQ3 All (internal) All (external) en, de, lat RQ1: Which of the textual, visual and hybrid approaches performs best? As de- scribed in Section 1.3, our primary goal is to assess the performances of state-of-the-art models on classical commentaries and to investigate which of the three named approaches is the most appropriate for this kind of data. Experimental design. As LayoutLM is pre-trained on English data, we first test its performances on two English commentaries: Jebb’s (baseline, experiment 0A) and Kamerbeek’s (0B). Besides its scholarly resonance, we chose Jebb’s commentary as a baseline because it contains regions of all coarse classes. As for Kamerbeek’s commentary, it presents an utterly different layout in which the commentary sections span over an entire page. 10 See also https://github.com/bes-dev/mean_average_precision/issues/1 45 Additionally, we also train and test our models on a diverse set of commentaries on Sophocles’ Ajax (experiment 4A) and on other Greek and Latin prose and poetry works (4B). We then test visual approaches by running the same experiments with YOLO�㕀�㕢�㕙�㕡�㕖 . Finally, we test textual approaches with RoBERTa using the same data as 0A (experiment 1E). Results and Discussion. Results show a net advantage in favor of YOLO�㕀�㕢�㕙�㕡�㕖 , which overtakes LayoutLM by an average of .27 points over experiments 0A, 0B, 4A and 4B. Interestingly and on the contrary to LayoutLM, YOLO�㕀�㕢�㕙�㕡�㕖 completely misses footnotes in Jebb (N=8) and systematically incorporates them within the main paratext region. As for RoBERTa, its poor results are inline with previously mentioned studies showing the inferiority of text-only approaches. This first series of experiments shows that image- based approaches can perform well even on region with few distinctive graphical features if they have seen similar layouts in training. RQ2: What is the impact of training set’s size? To address this question, we copy the setting of experiment 0A, only changing the size of the training set by sampling half of it randomly (experiment 1A). Results and Discussion. If both YOLO�㕀�㕢�㕙�㕡�㕖 and LayoutLM show a performance drop in comparison with 0A, it is worth noting that depriving the former of half its training data only leads to a .05 decrease in mAP. The latter’s case is more concerning and deserves a more thorough inquiry. First, the model did not seem to be penalised by the number of epochs, as its maximum score is already attained at epoch 33/50. Secondly, the difference in mAP does not reflect the difference in word-based F1-score, which only decreases of .10. In-depth analyses revealed predictions to be much more scattered, which drastically hampers homogeneous region building and accounts for the plunge of mAP scores. The takeaway of this experiment is that 15 to 20 pages of ground-truth data already opens the way to satisfactory results, whereas doubling this amount only accounts for an improve of .05 mAP. RQ3: How well do models generalize on layout types they have not encountered dur- ing training? We address the question of generalization in three ways. We first train a model on two commentaries and evaluate it on Jebb (baseline) to see whether mixing layout types can confuse the model. For this sub-experiment (1D) we use the commen- taries by Jebb and Kamerbeek, two English commentaries with different layouts which we already have individual baselines for (cf. 0A and 0B). We then train three models on three English commentaries and evaluate them, again, on Jebb. We choose one commen- tary with an almost identical layout (Campbell, experiment 2A), and two commentaries with a completely different layout, Kamerbeek (2B) and Garvie (2C). In these two items, commentary sections cover the main zone of the page. Finally, we train a model on all internal commentaries and test it on external commentaries (4C). 46 Results and discussion. First, it seems that mixing two types of layout in training did not confuse the model. On the contrary, YOLO�㕀�㕢�㕙�㕡�㕖 shows a .15 increase in mAP between 0A and 1D. This improvement is probably due to the quantity of available data, as regions such as running headers, paratext, numbers and footnotes see their number of training instances doubled and their scores consistently improved. Interestingly, this correlation is not present for commentary sections. Its AP remains at .90 despite a rise in �㕁�㕡 from 40 to 66. This plateau can be explained by the important change in the region’s morphology between Kamerbeek and Jebb. More generally, this result suggests that using a single model with more data yields better results than individual models. For experiments 2A, 2B and 2C, we generally observe a net decrease of performance when compared to the baseline. This being said, results are still better when generalizing to a similar layout type. Experiment 2A is therefore above 2B and 2C for both LayoutLM and YOLO�㕀�㕢�㕙�㕡�㕖 . These results also hint at the fact that LayoutLM only seems to gain little information from the textual channel, a trend to be confirmed below. Performances also decrease in experiment 4C, despite the broadness of the training set. This result shows compelling evidence about the model’s struggles with completely unseen data. Indeed, if many of the layout types are present in the training, one must not understate the importance of other image features like the quality of the scan, the binarization threshold and so forth. To circumvent this problem, it is maybe sufficient to add very few images from the target data in training or fine-tuning. We keep this hypothesis to be tested in future works. RQ4: How do hybrid models perform on languages they have not been pre-trained on? We then measure the impact of the text’s language by training two models on an Italian (Paduano) and a German commentary (Wecklein) respectively (experiments 3A and 3B). Results and discussion. As it appears, LayoutLM does not seem to be impacted by the commentary’s main language. If results on German data fail to equate those of Jebb, Italian data gets the best results for a single commentary overall. These results can be explained by the domain-specific prose style of commentary writing. As mentioned in Section 1.2, the text often patches Greek scripts, abbreviations, rare words and proper nouns together. To circumvent this unusual distribution, LayoutLM’s tokenizer has to chunk words into extremely tiny pieces to match them to its vocabulary. It is therefore very frequent to see words fed to the model as sequences of single-character embeddings. This setting lessens the model capacity to rely on the knowledge acquired during pre- training and hence degrades its overall performances. RQ5: To which extent do textual and visual features separately account for the Lay- outLM’s decision? To measure this last statement more precisely, we run LayoutLM in token ablation mode (1C), feeding the model only with null tokens, thereby constraining its weights to rely solely on coordinates and images. 47 Results and discussion. RoBERTa’s poor results (1E) already indicate that Lay- outLM is highly dependant on coordinates and images. Experiment 1C confirms this intuition and contributes to explaining the model’s indifference towards language. As a matter of fact, blanking textual inputs only diminishes the models performances by .01 mAP. In some regions with consistent positioning, textual inputs are even worsening the results: this is the case with commentary, critical apparatus and footnotes. However, the textual contents of page regions such as running headers and numbers do contain straightforward meaningful information. The former always contains identical words and the latter almost only consist of Arabic numerals. This could explain why token ablation deteriorates the model’s results in these two cases. 6. General discussion YOLO�㕀�㕜�㕛�㕜 and YOLO�㕀�㕜�㕛�㕜 +LayoutLM. With a single class to predict, YOLO�㕀�㕜�㕛�㕜 un- surprisingly surpasses YOLO�㕀�㕢�㕙�㕡�㕖 and displays very encouraging results. The model is above .9 in experiments 0B, 3A and 4B, generalizes better than its rivals and even reaches perfect mAP@0.5 for experiment 3B. Though these results can already be useful for other downstream tasks like pre-OCR region detection, we leverage YOLO�㕀�㕜�㕛�㕜 ’s predictions and use them as a basis for LayoutLM, thereby addressing the problem of rebuilding regions. With an intriguing exception in experiment 0B, this method consistently im- proves LayoutLM. This result is coherent with caveats that come with our evaluation system (cf. Section 4.3). Rebuilding regions from labelled sequence can indeed lead to unwanted patchwork like schemes, as small nested clusters divide regions and build new ones. However, taking the majority class among labelled tokens in an already predicted region alleviates the harshness of region-based evaluation. This methods conveys two other remarks. First, though we applied token classification to enable a fair comparison with baseline settings, the fact that regions are predefined allows for implementing a sequence classification model, which could improve the results. Indeed, it may be tough for the model to correctly label a single page number token lost in a long sequence of text. However, classifying an isolated line or page number in a pre-defined region could be an easier task. As a second remark, it is worth recalling that if this approach remains inferior to YOLO�㕀�㕢�㕙�㕡�㕖 in our case, it could prove to be more efficient with less domain- specific and noisy texts. Lifting this barrier could perhaps be achieved by the use of multilingual models such as LayoutXLM or by continuing the transformers pre-training on domain-specific data, an investigation we plan to pursue in future works. Inter- and intra-experiment variances For experiments 0A, 0B, 3A and 3B, we train a single YOLO�㕀�㕢�㕙�㕡�㕖 model per commentary. Despite similar training parameters compara- ble amounts of data, we witness a strong variance between commentaries, with a gap of .18 between 0A and 0B. If this variance might be explained by layout particularities, we are also aware that it can be caused by the sparsity of the evaluation set. To acknowledge this limitation, we indicate the number of training and evaluation instance in Table 4. This sparsity also correlates with intra-experiment variance, i.e. differences between each 48 Table 4 General Results table, where bold numbers are applied to the highest score in a single experiments. N�㕡 and N�㕒 indicate the counts of instances in train and evaluation set respectively. Dashes stand for na-values. All App. crit. Commentary Footnote Numbers Others Paratext Primary text Running h. Exp Model mAP AP N�㕡 N�㕒 AP N�㕡 N�㕒 AP N�㕡 N�㕒 AP N�㕡 N�㕒 AP N�㕡 N�㕒 AP N�㕡 N�㕒 AP N�㕡 N�㕒 AP N�㕡 N�㕒 LLM .38 .12 20 5 .51 40 10 .50 6 2 .33 63 24 .32 29 14 .20 7 4 .34 14 4 .76 29 11 Y�㕀�㕜�㕛�㕜 .81 - - - - - - - - - - - - - - - - - - - - - - - - 0A Y+LLM .45 .45 20 5 .70 40 10 .00 6 2 .67 63 24 .19 29 14 .40 7 4 .50 14 4 .70 29 11 Y�㕀�㕢�㕙�㕡�㕖 .69 .60 20 5 .90 40 10 .00 6 2 .81 63 24 .62 29 14 .95 7 4 .75 14 4 .89 29 11 LLM .22 - 0 0 .05 26 4 .12 10 2 .50 32 6 .00 6 2 .36 7 3 - 0 0 .71 31 7 Y�㕀�㕜�㕛�㕜 .93 - - - - - - - - - - - - - - - - - - - - - - - - 0B Y+LLM .21 - 0 0 .00 26 4 .00 10 2 .17 32 6 .00 6 2 .61 7 3 - 0 0 .90 31 7 Y�㕀�㕢�㕙�㕡�㕖 .51 - 0 0 1.00 26 4 .25 10 2 1.00 32 6 .00 6 2 .83 7 3 - 0 0 1.00 31 7 LLM .14 .04 20 5 .08 40 10 .03 6 2 .08 63 24 .01 29 14 .08 7 4 .02 14 4 .76 29 11 Y�㕀�㕜�㕛�㕜 .74 - - - - - - - - - - - - - - - - - - - - - - - - 1A Y+LLM .35 .27 20 5 .40 40 10 .00 6 2 .70 63 24 .08 29 14 .44 7 4 .17 14 4 .76 29 11 Y�㕀�㕢�㕙�㕡�㕖 .64 .60 20 5 .80 40 10 .10 6 2 .94 63 24 .45 29 14 .83 7 4 .42 14 4 1.00 29 11 LLM .37 .60 20 5 .80 40 10 1.00 6 2 .01 63 24 .22 29 14 .21 7 4 .10 14 4 .05 29 11 1C Y+LLM .43 .40 20 5 .78 40 10 1.00 6 2 .22 63 24 .19 29 14 .63 7 4 .17 14 4 .03 29 11 LLM .26 .15 20 5 .34 66 10 .00 16 2 .35 95 24 .22 35 14 .07 14 4 .11 14 4 .85 60 11 Y�㕀�㕜�㕛�㕜 .83 - - - - - - - - - - - - - - - - - - - - - - - - 1D Y+LLM .43 .40 20 5 .19 66 10 .50 16 2 .58 95 24 .27 35 14 .32 14 4 .34 14 4 .85 60 11 Y�㕀�㕢�㕙�㕡�㕖 .85 .80 20 5 .90 66 10 .75 16 2 .85 95 24 .79 35 14 1.00 14 4 .68 14 4 1.00 60 11 RoB. .10 .20 20 5 .27 40 10 .00 6 2 .00 63 24 .19 29 14 .01 7 4 .12 14 4 .00 29 11 1E Y+R. .11 .27 20 5 .26 40 10 .00 6 2 .00 63 24 .08 29 14 .17 7 4 .11 14 4 .00 29 11 LLM .18 .01 23 5 .07 46 10 .00 8 2 .33 96 24 .00 18 14 .09 12 4 .11 23 4 .84 32 11 Y�㕀�㕜�㕛�㕜 .65 - - - - - - - - - - - - - - - - - - - - - - - - 2A Y+LLM .29 .20 23 5 .20 46 10 .00 8 2 .67 96 24 .06 18 14 .20 12 4 .25 23 4 .77 32 11 Y�㕀�㕢�㕙�㕡�㕖 .35 .20 23 5 .00 46 10 .00 8 2 .73 96 24 .07 18 14 .20 12 4 .65 23 4 .97 32 11 LLM .10 .00 0 5 .07 26 10 .00 10 2 .21 32 24 .00 6 14 .27 7 4 .00 0 4 .23 31 11 Y�㕀�㕜�㕛�㕜 .52 - - - - - - - - - - - - - - - - - - - - - - - - 2B Y+LLM .16 .00 0 5 .16 26 10 .00 10 2 .21 32 24 .00 6 14 .50 7 4 .00 0 4 .45 31 11 Y�㕀�㕢�㕙�㕡�㕖 .20 .00 0 5 .00 26 10 .00 10 2 .70 32 24 .00 6 14 .33 7 4 .00 0 4 .58 31 11 LLM .01 .00 7 5 .02 9 10 .00 4 2 .00 96 24 .00 9 14 .00 5 4 .03 8 4 .00 16 11 Y�㕀�㕜�㕛�㕜 .39 - - - - - - - - - - - - - - - - - - - - - - - - 2C Y+LLM .06 .00 7 5 .02 9 10 .00 4 2 .00 96 24 .06 9 14 .25 5 4 .17 8 4 .00 16 11 Y�㕀�㕢�㕙�㕡�㕖 .26 .00 7 5 .20 9 10 .00 4 2 .70 96 24 .07 9 14 .00 5 4 .50 8 4 .58 16 11 LLM .41 - 0 0 .60 19 3 - 0 0 .26 120 19 .06 17 3 .83 7 2 .50 14 1 1.00 32 6 Y�㕀�㕜�㕛�㕜 .90 - - - - - - - - - - - - - - - - - - - - - - - - 3A Y+LLM .63 - 0 0 1.00 19 3 - 0 0 1.00 120 19 .06 17 3 1.00 7 2 1.00 14 1 1.00 32 6 Y�㕀�㕢�㕙�㕡�㕖 .58 - 0 0 1.00 19 3 - 0 0 1.00 120 19 .67 17 3 1.00 7 2 .00 14 1 1.00 32 6 LLM .35 - 0 0 .25 31 4 .00 1 1 .44 126 19 .00 10 2 .12 2 3 1.00 36 5 1.00 31 6 Y�㕀�㕜�㕛�㕜 1.00 - - - - - - - - - - - - - - - - - - - - - - - - 3B Y+LLM .43 - 0 0 .08 31 4 .00 1 1 .95 126 19 .00 10 2 .87 2 3 .76 36 5 .82 31 6 Y�㕀�㕢�㕙�㕡�㕖 .54 - 0 0 1.00 31 4 .00 1 1 1.00 126 19 .50 10 2 .00 2 3 1.00 36 5 .83 31 6 LLM .52 .61 96 20 .61 363 57 .54 60 18 .32 1248 254 .10 163 58 .44 88 33 .65 273 57 .90 379 92 Y�㕀�㕜�㕛�㕜 .87 - - - - - - - - - - - - - - - - - - - - - - - - 4A Y+LLM .57 .45 96 20 .57 363 57 .66 60 18 .73 1248 254 .15 163 58 .62 88 33 .48 273 57 .89 379 92 Y�㕀�㕢�㕙�㕡�㕖 .79 .81 96 20 .96 363 57 .61 60 18 .90 1248 254 .67 163 58 .85 88 33 .61 273 57 .92 379 92 LLM .44 .45 32 8 .81 102 18 .00 7 3 .34 275 46 .19 53 12 .42 26 4 .35 53 9 .94 96 16 Y�㕀�㕜�㕛�㕜 .93 - - - - - - - - - - - - - - - - - - - - - - - - 4B Y+LLM .65 .46 32 8 .90 102 18 .00 7 3 .98 275 46 .51 53 12 .50 26 4 .88 53 9 1.00 96 16 Y�㕀�㕢�㕙�㕡�㕖 .74 .85 32 8 .97 102 18 .00 7 3 .96 275 46 .64 53 12 .50 26 4 1.00 53 9 1.00 96 16 LLM .31 .06 96 8 .41 363 18 .33 60 3 .33 1248 46 .01 163 12 .25 88 4 .28 273 9 .77 379 16 Y�㕀�㕜�㕛�㕜 .65 - - - - - - - - - - - - - - - - - - - - - - - - 4C Y+LLM .39 .00 96 8 .49 363 18 .41 60 3 .43 1248 46 .24 163 12 .31 88 4 .28 273 9 1.00 379 16 Y�㕀�㕢�㕙�㕡�㕖 .42 .00 96 8 .89 363 18 .00 60 3 .60 1248 46 .37 163 12 .29 88 4 .33 273 9 .91 379 16 region’s score. As splitting at page level does not allow to precisely balance all classes, some end up being poorly distributed. Besides, footnotes are extremely rare, which can explain their poor results (mean AP=.13 over the four mentioned experiments). On the contrary, more frequent classes like commentaries and running headers yield much higher results, with a mean AP of .98 and 0.93 respectively. 49 7. Conclusions and further work Our main contributions lie in our experiments and in the release of an annotated dataset. As part of our annotated data is still under copyright, we also release YOLO models trained on the entirety of our data in the hope that they might constitute a useful starting point for similar research 11 . The key takeaways of this research are listed below: • We show that an object detection model such as YOLO succeeds in classifying se- mantic regions of text-heavy documents even if they feature little obvious graphical differences. • Hybrid models like LayoutLM may be of help to researchers working with clean and generic English data. However, in a highly noisy, multi-lingual, domain-specific and historic setting, they tend to make little use of the textual channel and mainly base their decision on coordinates and images. • With 8 classes in total, we show that annotating 15 to 20 pages of ground truth data already yields satisfactory results. Doubling this amount amends results by .05% in average. Furthermore, because historical classical commentaries and critical editions have a significant similarity in terms of layout, our approach establishes the groundwork for developing a robust, generic model for page layout analysis of these publications in the near future. Such a model, in combination with existing open source tools for annotation that can be chained into a seamless pipeline (e.g. eScriptorium for annotation and Kraken for OCR), has the potential to open up new perspectives to researchers for exploiting openly available digitized editions and commentaries. Similarly, such a model could be useful to projects aimed at the creation of large-scale corpora of marked up texts such the Free First Thousands Years of Greek (FF1KG) project [15]. In this project, dozens of summer interns over the years have manually annotated the layout of digitized critical editions, carrying out a tedious task that in the near future will be possible to semi-automate. As for future works, we think two strands of possible improvement are worth inves- tigating. First, and as mentioned in Section 5, we would like to explore the effect of adding minimal in-domain data when fine-tuning a generic model. Based on our exper- iments (notably 1A), our hypothesis is that providing only a few pages should improve the results on unseen commentaries. Secondly, we think our hybrid would yield better results if they could use more meaningful representation of the text. It could therefore be worth testing with a multilingual model such as LayoutXLM or with a domain-specific language modelling pre-training. Acknowledgments This research has been supported by the Swiss National Science Foundation under an 11 https://github.com/AjaxMultiCommentary/layout-yolo-models 50 Ambizione grant PZ00P1_186033. We thank Carla Amaya who helped in the ground truth annotation process. References [1] R. Barman, M. Ehrmann, S. Clematide, and Oliveira. Datasets and Models for Historical Newspaper Article Segmentation. 2021. doi: 10.5281/zenodo.3706863. url: https://zenodo.org/record/3706863. [2] R. Barman, M. Ehrmann, S. Clematide, S. A. Oliveira, and F. Kaplan. “Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers”. In: arXiv:2002.06144 [cs] (2020). url: http://arxiv.org/abs/2002.06144. [3] T. Clérice. YALTAi: Segmonto Manuscript and Early Printed Book Dataset. 2022. doi: 10.5281/zenodo.6814770. url: https://zenodo.org/record/6814770. [4] T. Clérice. You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine. 2022. url: htt ps://hal-enc.archives-ouvertes.fr/hal-03723208. [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: arXiv:1810.04805 [cs] (2019). url: http://arxiv.org/abs/1810.04805. [6] A. Dutta and A. Zisserman. “The VIA annotation software for images, audio and video”. In: MM ’19: Proceedings of the 27th ACM international conference on multimedia. New York, NY, USA: Acm, 2019, pp. 2276–2279. doi: 10.1145/33430 31.3350535. url: https://doi.org/10.1145/3343031.3350535. [7] S. Eskenazi, P. Gomez-Krämer, and J.-M. Ogier. “A comprehensive survey of mostly textual document segmentation algorithms since 2008”. In: Pattern Recog- nition 64 (2017), pp. 1–14. doi: 10.1016/j.patcog.2016.10.023. url: https://linking hub.elsevier.com/retrieve/pii/S0031320316303399. [8] S. Gabay, J.-B. Camps, A. Pinche, and N. Carboni. SegmOnto, A Controlled Vocabulary to Describe the Layout of Pages. Paris/Genève, 2021. url: https://git hub.com/SegmOnto. [9] S. Gabay, J.-B. Camps, A. Pinche, and C. Jahan. “SegmOnto: common vocabulary and practices for analysing the layout of manuscripts (and more)”. In: ICDAR 2021 Workshop on Computational Paleography (IWCP). Lausanne, 2021, pp. 1–4. url: https://www.csmc.uni-hamburg.de/iwcp2021/files/abstracts/iwcp2021-paper-7 .pdf. [10] Y. Huang, T. Lv, L. Cui, Y. Lu, and F. Wei. LayoutLMv3: Pre-training for Docu- ment AI with Unified Text and Image Masking. 2022. url: http://arxiv.org/abs/2 204.08387. 51 [11] K. Kise. “Page Segmentation Techniques in Document Analysis”. In: Handbook of Document Image Processing and Recognition. Ed. by D. Doermann and K. Tombre. London: Springer London, 2014, pp. 135–175. doi: 10.1007/978-0-85729-859-1\_5. url: http://link.springer.com/10.1007/978-0-85729-859-1%5C%5F5. [12] J. Li, Y. Xu, T. Lv, L. Cui, C. Zhang, and F. Wei. DiT: Self-supervised Pre-training for Document Image Transformer. 2022. url: http://arxiv.org/abs/2203.02378. [13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle- moyer, and V. Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: arXiv:1907.11692 [cs] (2019). url: http://arxiv.org/abs/1907.1169 2. [14] M. Mehri, P. Heroux, R. Mullot, J.-P. Moreux, B. Couasnon, and B. Barrett. “IC- DAR2019 Competition on Historical Book Analysis - HBA2019”. In: 2019 Inter- national Conference on Document Analysis and Recognition (ICDAR). Sydney, Australia: Ieee, 2019, pp. 1488–1493. doi: 10.1109/icdar.2019.00239. url: https://i eeexplore.ieee.org/document/8978192/. [15] L. Muellner. “The Free First Thousand Years of Greek”. In: Digital Classical Philol- ogy. Ed. by M. Berti. Berlin, Boston: De Gruyter Saur, 2019, pp. 7–18. url: https: //www.degruyter.com/document/doi/10.1515/9783110599572-002/html. [16] R. Petitpierre. Historical City Maps Semantic Segmentation Dataset. 2021. doi: 10.5281/zenodo.5513639. url: https://zenodo.org/record/5513639. [17] M. Riedl, D. Betz, and S. Padó. “Clustering-Based Article Identification in Histori- cal Newspapers”. In: Proceedings of the 3rd Joint SIGHUM Workshop on Computa- tional Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Minneapolis, USA: Association for Computational Linguistics, 2019, pp. 12–17. doi: 10.18653/v1/W19-2502. url: https://www.aclweb.org/anthology/W19-2502. [18] M. Romanello, C. Amaya, B. Robertson, and S. Najem-Meyer. AjaxMultiCommentary/GT-commentaries-OCR: Version 1.0. 2021. doi: 10 . 5 281/zenodo.5526670. url: https://doi.org/10.5281/zenodo.5526670. [19] F. Simistira, M. Bouillon, M. Seuret, M. Wursch, M. Alberti, R. Ingold, and M. Liwicki. “ICDAR2017 Competition on Layout Analysis for Challenging Medieval Manuscripts”. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Kyoto: Ieee, 2017, pp. 1361–1370. doi: 10.1109/icdar.2 017.223. url: http://ieeexplore.ieee.org/document/8270154/. [20] D. Stoekl Ben Ezra, B. Brown-DeVost, P. Jablonski, H. Lapin, B. Kiessling, and E. Lolli. “BiblIA - a General Model for Medieval Hebrew Manuscripts and an Open Annotated Dataset”. In: The 6th International Workshop on Historical Document Imaging and Processing. Hip ’21. New York, NY, USA: Association for Computing Machinery, 2021, pp. 61–66. doi: 10.1145/3476887.3476896. url: https://doi.org/1 0.1145/3476887.3476896. 52 [21] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. L. Giles. Learning to Ex- tract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Network. 2017. url: http://arxiv.org/abs/1706.02337. 53 Appendix Figure 4: Examples of pages by Jebb (le昀琀) and Kamerbeek (right) with the predictions of LayoutLM’s and YOLO�㕀�㕢�㕙�㕡�㕖 ’s best checkpoints from experiment 4A. 54