=Paper=
{{Paper
|id=Vol-3180/paper-83
|storemode=property
|title=Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual
Historical Documents
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-83.pdf
|volume=Vol-3180
|authors=Maud Ehrmann,Matteo Romanello,Sven Najem-Meyer,Antoine Doucet,Simon Clematide
|dblpUrl=https://dblp.org/rec/conf/clef/EhrmannRNDC22
}}
==Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual
Historical Documents==
Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents Maud Ehrmann1 , Matteo Romanello2 , Sven Najem-Meyer1 , Antoine Doucet3 and Simon Clematide4 1 Digital Humanities Laboratory, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland 2 University of Lausanne, Lausanne, Switzerland 3 University of La Rochelle, La Rochelle, France 4 Department of Computational Linguistics, University of Zurich, Zurich, Switzerland Abstract This paper presents an overview of the second edition of HIPE (Identifying Historical People, Places and other Entities), a shared task on named entity recognition and linking in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, HIPE-2022 confronts systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. This shared task is part of the ongoing efforts of the natural language processing and digital humanities communities to adapt and develop appropriate technologies to efficiently retrieve and explore information from historical texts. On such material, however, named entity processing techniques face the challenges of domain heterogeneity, input noisiness, dynamics of language, and lack of resources. In this context, the main objective of HIPE-2022, run as an evaluation lab of the CLEF 2022 conference, is to gain new insights into the transferability of named entity processing approaches across languages, time periods, document types, and annotation tag sets. Tasks, corpora, and results of participating teams are presented. Compared to the condensed overview [1], this paper contains more refined statistics on the datasets, a break down of the results per type of entity, and a discussion of the ‘challenges’ proposed in the shared task. Keywords Named entity recognition and classification, Entity linking, Historical texts, Information extraction, Digitised newspapers, Classical commentaries, Digital humanities 1. Introduction Through decades of massive digitisation, an unprecedented amount of historical documents became available in digital format, along with their machine-readable texts. While this represents a major step forward in terms of preservation and accessibility, it also bears the potential for new ways to engage with historical documents’ contents. The application of machine reading to CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ maud.ehrmann@epfl.ch (M. Ehrmann); matteo.romanello@unil.ch (M. Romanello); sven.najem-meyer@epfl.ch (S. Najem-Meyer); antoine.doucet@univ-lr.fr (A. Doucet); simon.clematide@uzh.ch (S. Clematide) 0000-0001-9900-2193 (M. Ehrmann); 0000-0002-7406-6286 (M. Romanello); 0000-0002-3661-4579 (S. Najem-Meyer); 0000-0001-6160-3356 (A. Doucet); 0000-0003-1365-0662 (S. Clematide) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) historical documents is potentially transformative and the next fundamental challenge is to adapt and develop appropriate technologies to efficiently search, retrieve and explore information from this ‘big data of the past’ [2]. Semantic indexing of historical documents is in great demand among humanities scholars, and the interdisciplinary efforts of the digital humanities (DH), natural language processing (NLP), computer vision and cultural heritage communities are progressively pushing forward the processing of facsimiles, as well as the extraction, linking and representation of the complex information enclosed in transcriptions of digitised collections [3]. In this regard, information extraction techniques, and particularly named entity (NE) processing, can be considered among the first and most crucial processing steps. Yet, the recognition, classification and disambiguation of NEs in historical texts is not straight- forward, and performances are not on par with what is usually observed on contemporary well-edited English news material [4]. In particular, NE processing on historical documents faces the challenges of domain heterogeneity, input noisiness, dynamics of language, and lack of resources [5]. Although some of these issues have already been tackled in isolation in other contexts (with e.g., user-generated text), what makes the task particularly difficult is their simultaneous combination and their magnitude: texts are severely noisy, and domains and time periods are far apart. Motivation and Objectives. As the first evaluation campaign of its kind on multilingual historical newspaper material, the CLEF-HIPE-2020 edition 1 [6, 7] proposed the tasks of NE recognition and classification (NERC) and entity linking (EL) in ca. 200 years of historical newspapers written in English, French and German. HIPE-2020 brought together 13 teams who submitted a total of 75 runs for 5 different task bundles. The main conclusion of this edition was that neural-based approaches can achieve good performances on historical NERC when provided with enough training data, but that progress is still needed to further improve performances, adequately handle OCR noise and small-data settings, and better address entity linking. HIPE-2022 attempts to drive further progress on these points, and also confront systems with new challenges. An additional point is that in the meantime several European cultural heritage projects have prepared additional NE-annotated text material, thus opening a unique window of opportunity to organize a second edition of the HIPE evaluation lab in 2022. HIPE-20222 shared task focuses on named entity processing in historical documents covering the period from the 18th to the 20th century and featuring several languages. Compared to the first edition, HIPE-2022 introduces several novelties: • the addition of a new type of document alongside historical newspapers, namely classical commentaries3 ; • the consideration of a broader language spectrum, with 5 languages for historical news- papers (3 for the previous edition), and 3 for classical commentaries; • the confrontation with heterogeneous annotation tag sets and guidelines. 1 https://impresso.github.io/CLEF-HIPE-2020 2 https://hipe-eval.github.io/HIPE-2022/ 3 Classical commentaries are scholarly publications dedicated to the in-depth analysis and explanation of ancient literary works. As such, they aim to facilitate the reading and understanding of a given literary text. Overall, HIPE-2022 confronts participants with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation schemas. The objectives of the evaluation lab are to contribute new insights on how best to ensure the transferability of NE processing approaches across languages, time periods, document and annotation types, and to answer the question whether one architecture or model can be optimised to perform well across settings and annotation targets in a cultural heritage context. In particular, the following research questions are addressed: 1. How well can general prior knowledge transfer to historical texts? 2. Are in-domain language representations (i.e. language models learned on the historical document collections) beneficial, and under which conditions? 3. How can systems adapt and integrate training material with different annotations? 4. How can systems, with limited additional in-domain training material, (re)target models to produce a certain type of annotation? Recent work on NERC showed encouraging progress on several of these topics: Beryozkin et al. [8] proposed a method to deal with related, but heterogeneous tag sets. Several researchers successfully applied meta-learning strategies to NERC to improve transfer learning: Li et al. [9] improved results for extreme low-resource few-shot settings where only a handful of annotated examples for each entity class are used for training; Wu et al. [10] presented techniques to improve cross-lingual transfer; and Li et al. [11] tackled the problem of domain shifts and heterogeneous label sets using meta-learning, proposing a highly data-efficient domain adaptation approach. The remainder of this paper is organized as follows. Sections 2 and 3 present the tasks and the material used for the evaluation. Section 4 details the evaluation framework, with evaluation metrics and the organisation of system submissions around tracks and challenges. Section 5 introduces the participating systems, while Section 6 presents and discusses their results. Finally, Section 7 summarizes the benefits of the task and concludes.4 2. Task Description HIPE-2022 focuses on the same tasks as CLEF-HIPE-2020, namely: Task 1: Named Entity Recognition and Classification (NERC) • Subtask NERC-Coarse: this task includes the recognition and classification of high-level entity types (person, organisation, location, product and domain-specific entities, e.g. mythological characters or literary works in classical commentaries). • Subtask NERC-Fine: includes the recognition and classification of entity mentions according to fine-grained types, plus the detection and classification of nested entities of depth 1. This subtask is proposed for English, French and German only. 4 For space reasons, the discussion of related work is included in the extended version of this overview [12]. Table 1 Overview of HIPE-2022 datasets with an indication of which tasks they are suitable for according to their annotation types. Dataset alias Document type Languages Suitable for hipe2020 historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL newseye historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL sonar historical newspapers de NERC-Coarse, EL letemps historical newspapers fr NERC-Coarse, NERC-Fine topres19th historical newspapers en NERC-Coarse, EL ajmc classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL Task 2: Named Entity Linking (EL) This task corresponds to the linking of named entity mentions to a unique item ID in Wikidata, our knowledge base of choice, or to a NIL value if the mention does not have a corresponding item in the knowledge base (KB). We will allow submissions of both end-to-end systems (NERC and EL) and of systems performing exclusively EL on gold entity mentions provided by the organizers (EL-only). 3. Data HIPE-2022 data consists of six NE-annotated datasets composed of historical newspapers and classic commentaries covering ca. 200 years. Datasets originate from the previous HIPE-2020 campaign, from HIPE organisers’ previous research project, and from several European cultural heritage projects which agreed to postpone the publication of 10% to 20% of their annotated material to support HIPE-2022. Original datasets feature several languages and were annotated with different entity tag sets and according to different annotation guidelines. See Table 1 for an overview. 3.1. Original Datasets Historical newspapers. The historical newspaper data is composed of several datasets in English, Finnish, French, German and Swedish which originate from various projects and national libraries in Europe: • HIPE-2020 data corresponds to the datasets of the first HIPE-2020 campaign. They are composed of articles from Swiss, Luxembourgish and American newspapers in French, German and English (19C-20C) that were assembled during the impresso project5 [13]. Together, the train, dev and test hipe2020 datasets contain 17,553 linked entity mentions, classified according to a fine-grained tag set, where nested entities, mention components and metonymic senses are also annotated [14]. 5 https://impresso-project.ch • NewsEye data corresponds to a set of NE-annotated datasets composed of newspaper articles in French, German, Finnish and Swedish (19C-20C) [15]. Built in the context of the NewsEye project6 , the newseye train, dev and test sets contain 36,790 linked entity mentions, classified according to a coarse-grained tag set and annotated on the basis of guidelines similar to the ones used for hipe2020. Roughly 20% of the data was retained from the original dataset publication and is published for the first time for HIPE-2022, where it is used as test data (thus the previously published test set became a second dev set in HIPE-2022 data distribution). • SoNAR data is an NE-annotated dataset composed of newspaper articles from the Berlin State library newspaper collections in German (19C-20C), produced in the context of the SoNAR project7 . The sonar dataset contains 1,125 linked entity mentions, classified according to a coarse-grained tag set. It was thoroughly revised and corrected on NE and EL levels by the HIPE-2022 organisers. It is split in a dev and test set – without providing a dedicated train set. • Le Temps data: a previously unpublished, NE-annotated diachronic dataset composed of historical newspaper articles from two Swiss newspapers in French (19C-20C) [4]. This dataset contains 11,045 entity mentions classified according to a fine-grained tag set similar to hipe2020. • Living with Machines data corresponds to an NE-annotated dataset composed of news- paper articles from the British Library newspapers in English (18C-19C) and assembled in the context of the Living with Machine project8 . The topres19th dataset contains 4,601 linked entity mentions, exclusively of geographical types annotated following their own annotation guidelines [16]. Part of this data has been retained from the original dataset publication and is used and released for the first time for HIPE-2022. Historical commentaries. The classical commentaries data originates from the Ajax Multi- Commentary project and is composed of OCRed 19C commentaries published in French, German and English [17], annotated with both universal NEs (person, location, organisation) and domain- specific NEs (bibliographic references to primary and secondary literature). In the field of classical studies, commentaries constitute one of the most important and enduring forms of scholarship, together with critical editions and translations. They are information-rich texts, characterised by a high density of NEs. These six datasets compose the HIPE-2022 corpus. They underwent several preparation steps, with conversion to the tab-separated HIPE format, correction of data inconsistencies, metadata consolidation, re-annotation of parts of the datasets, deletion of extremely rare entities (esp. for topres19th), and rearrangement or composition of train and dev splits9 . 6 https://www.newseye.eu/ 7 https://sonar.fh-potsdam.de/ 8 https://livingwithmachines.ac.uk/ 9 Additional information is available online by following the links indicated for each datasets in Table 1. Table 2 Entity types used for NERC tasks, per dataset and with information whether nesting and linking apply. *: these types are not present in letemps data. **: linking applies, unless the token is flagged as InSecondaryReference. Dataset Coarse tag set Fine tag set Nesting Linking hipe2020 pers pers.ind yes yes letemps pers.coll pers.ind.articleauthor org* org.adm yes yes org.ent org.ent.pressagency prod* prod.media no yes prod.doctr time* time.date.abs no no loc loc.adm.town yes yes loc.adm.reg loc.adm.nat loc.adm.sup loc.phys.geo yes yes loc.phys.hydro loc.phys.astro loc.oro yes yes loc.fac yes yes loc.add.phys yes yes loc.add.elec loc.unk no no newseye pers pers.articleauthor yes yes org - yes yes humanprod - yes yes loc - no yes topres19th loc - no yes building - no yes street - no yes ajmc pers pers.author yes yes** pers.editor pers.myth pers.other work work.primlit yes yes** work.seclit work.fragm loc - yes yes** object object.manuscr yes no object.museum date - yes no scope - yes no sonar pers - no yes loc - no yes org - no yes 3.2. Corpora Characteristics Overall, the HIPE-2022 corpus covers five languages (English, French, Finnish, German and Swedish), with a total of over 2.3 million tokens (2,211,449 for newspapers and 111,218 for classical commentaries) and 78,000 entities classified according to five different entity typologies and linked to Wikidata records. Detailed statistics about the datasets are provided in Table 3, 4 and 5 The datasets in the corpus are quite heterogeneous in terms of annotation guidelines. Two datasets – hipe2020 and letemps – follow the same guidelines [14, 18], and newseye was annotated using a slightly modified version of these guidelines. In the sonar dataset, persons, locations and organisations were annotated, whereas in topres19th only toponyms were considered. Compared to the other datasets, ajmc stands out for having being annotated according to domain-specific guidelines [19], which focus on bibliographic references to primary and secondary literature. This heterogeneity of guidelines leads to a wide variety of entity types and sub-types for the NERC task (see Table 2 and 5). Among these types, only persons, locations and organisations are found in all datasets (except for topres19th), thus constituting a set of “universal” entity types. Certain entity types are under-represented in some datasets (e.g. objects, locations and dates in ajmc) and, as such, constitute good candidates for the application of data augmentation strategies. Moreover, while nested entities are annotated in all datasets except topres19th and sonar, only hipe2020 and newseye have a sizable number of such entities. Detailed information about entity mentions that are affected by OCR mistakes is provided in ajmc and hipe2020 (only for the test set for the latter). As OCR noise constitutes one of the main challenges of historical NE processing [5], this information can be extremely useful to explain differences in performance between datasets or between languages in the same dataset. For instance, looking at the percentage of noisy mentions for the different languages in ajmc, we find that it is three times higher in French documents than in the other two languages. HIPE-2022 datasets show significant differences in terms of lexical overlap between train, dev and test sets. Following the observations of Augenstein et al. [20] and Taillé et al. [21] on the impact of lexical overlap on NERC performance, we computed the percentage of mention overlap between data folds for each dataset, based on the number of identical entity mentions (in terms of surface form) between train+dev and test sets (see Table 6). Evaluation results obtained on training and test sets with low mention overlap, for example, can be taken as an indicator of the ability of the models to generalise well to unseen mentions. We find that ajmc, letemps and topres19th have a mention overlap which is almost twice that of hipe2020, sonar and newseye. Finally, regarding entity linking, it is interesting to observe that the percentage of NIL entities (i.e. entities not linked to Wikidata) varies substantially across datasets. The Wikidata coverage is drastically lower for newseye than for the other newspaper datasets (44.36%). Conversely, only 1.45% of the entities found in ajmc cannot be linked to Wikidata. This fact is not at all surprising considering that commentaries mention mostly mythological figures, scholars of the past and literary works, while newspapers mention many relatively obscure or unknown individuals, for whom no Wikidata entry exists. Table 3 Overview of newspaper corpora statistics (hipe-2022 release v2.1). NIL percentages are computed based on linkable entities (i.e., excluding time entities for hipe2020). Dataset Lang. Fold Docs Tokens Mentions All Fine Nested %noisy %NIL hipe2020 de Train 103 86,446 3,494 3,494 158 - 15.70 Dev 33 32,672 1,242 1,242 67 - 18.76 Test 49 30,738 1,147 1,147 73 12.55 17.40 Total 185 149,856 5,883 5,883 298 - 16.66 en Train - - - - - - - Dev 80 29,060 966 - - - 44.18 Test 46 16,635 449 - - 5.57 40.28 Total 126 45,695 1,415 - - - 42.95 fr Train 158 166,218 6,926 6,926 473 - 25.26 Dev 43 37,953 1,729 1,729 91 - 19.81 Test 43 40,855 1,600 1,600 82 11.25 20.23 Total 244 245,026 10,255 10,255 646 - 23.55 Total 555 440,577 17,553 16,138 944 - 22.82 newseye de Train 7 374,250 11,381 21 876 - 51.07 Dev 12 40,046 539 5 27 - 22.08 Dev2 12 39,450 882 4 64 - 53.74 Test 13 99,711 2,401 13 89 - 48.52 Total 44 553,457 15,203 43 1,056 - 49.79 fi Train 24 48,223 2,146 15 224 - 40.31 Dev 24 6,351 223 1 25 - 40.36 Dev2 21 4,705 203 4 22 - 42.86 Test 24 14,964 691 7 42 - 47.47 Total 93 74,243 3,263 27 313 - 41.99 fr Train 35 255,138 10,423 99 482 - 42.42 Dev 35 21,726 752 3 29 - 30.45 Dev2 35 30,457 1,298 10 63 - 38.91 Test 35 70,790 2,530 34 131 - 44.82 Total 140 378,111 15,003 146 705 - 41.92 sv Train 21 56,307 2,140 16 110 - 32.38 Dev 21 6,907 266 1 7 - 25.19 Dev2 21 6,987 311 1 20 - 37.30 Test 21 16,163 604 0 26 - 35.43 Total 84 86,364 3,321 18 163 - 32.82 Total 361 1,092,175 36,790 234 2,237 - 44.36 letemps fr Train 414 379,481 9,159 9,159 69 - - Dev 51 38,650 869 869 12 - - Test 51 48,469 1,017 1,017 12 - - Total 516 466,600 11,045 11,045 93 - - topres19th en Train 309 123,977 3,179 - - - 18.34 Dev 34 11,916 236 - - - 13.98 Test 112 43,263 1,186 - - - 17.2 Total 455 179,156 4,601 - - - 17.82 Total 455 179,156 4,601 - - - 17.82 sonar de Train - - - - - - - Dev 10 17,477 654 - - - 22.48 Test 10 15,464 471 - - - 33.33 Total 20 32,941 1,125 - - - 27.02 Total 20 32,941 1,125 - - - 27.02 Grand Total (newspapers) 1,907 2,211,449 71,114 27,417 3,274 30.23 Table 4 Corpus statistics for the ajmc dataset (HIPE-2022 release v2.1). Dataset Lang. Fold Docs Tokens Mentions All Fine Nested %noisy %NIL ajmc de Train 76 22,694 1,738 1,738 11 13.81 0.92 Dev 14 4,703 403 403 2 11.41 0.74 Test 16 4,846 382 382 0 10.99 1.83 Total 106 32,243 2,523 2,523 13 13.00 1.03 en Train 60 30,929 1,823 1,823 4 10.97 1.66 Dev 14 6,507 416 416 0 16.83 1.70 Test 13 6,052 348 348 0 10.34 2.61 Total 87 43,488 2,587 2,587 4 11.83 1.79 fr Train 72 24,670 1,621 1,621 9 30.72 0.99 Dev 17 5,426 391 391 0 36.32 2.56 Test 15 5,391 360 360 0 27.50 2.80 Total 104 35,487 2,372 2,372 9 31.16 1.52 Grand Total (ajmc) 297 111,218 7,482 7,482 26 1.45 Table 5 Entity counts by coarse type (HIPE-2022 release v2.1). Although they appear under the same label, identical types present in different data sets may be annotated differently. hipe2020 letemps newseye sonar topsres19th ajmc de en fr fr de fi fr sv de en de en fr Universal pers 1849 558 3706 4086 4061 1212 6201 1132 399 910 844 839 loc 2923 565 4717 6367 6620 1338 5502 1446 477 3727 43 45 24 org 652 194 1125 592 3584 350 1758 230 249 Space building 563 street 316 Time time 236 46 397 date 2 20 5 prod 223 52 310 Man-made humanprod 6620 1338 5502 1446 object 12 3 10 work 465 678 557 scope 1091 997 937 Table 6 Overlap of mentions between test and train (plus dev) sets as percentage of the total number of mentions. Dataset Lang. % overlap Folds ajmc de 31.43 train+dev vs test en 30.50 train+dev vs test fr 27.53 train+dev vs test Total 29.87 hipe2020 de 16.22 train+dev vs test en 6.22 dev vs test fr 19.14 train+dev vs test Total 17.12 letemps fr 25.70 train+dev vs test sonar de 10.13 dev vs test newseye fr 14.79 train+dev vs test de 20.77 train+dev vs test fi 6.63 train+dev vs test sv 10.36 train+dev vs test Total 16.18 topres19th en 32.33 train+dev vs test 3.3. HIPE-2022 Releases HIPE-2022 data is released as a single package consisting of the neatly structured and homo- geneously formatted original datasets. The data is released in IOB format with hierarchical information, similarly to CoNLL-U10 , and consists of UTF-8 encoded, tab-separated values (TSV) files containing the necessary information for all tasks (NERC-Coarse, NERC-Fine, and EL). There is one TSV file per dataset, language and split. Original datasets provide different document metadata with different granularity. This information is kept in the files in the form of metadata blocks that encode as much information as necessary to ensure that each document is self-contained with respect to HIPE-2022 settings. Metadata blocks use namespacing to distinguish between mandatory shared task metadata and dataset-specific metadata. HIPE-2022 data releases are published on the HIPE-eval GitHub organisation repository11 and on Zenodo12 . Various licences (of type CC-BY and CC-BY-NC-SA) apply to the original datasets – we refer the reader to the online documentation. 10 https://universaldependencies.org/format.html 11 https://github.com/impresso/CLEF-HIPE-2020/tree/master/data 12 https://doi.org/10.5281/zenodo.6579950 4. Evaluation Framework 4.1. Task Bundles, Tracks and Challenges To accommodate the different dimensions that characterise the HIPE-2022 shared task (lan- guages, document types, entity tag sets, tasks) and to foster research on transferability, the evaluation lab is organised around tracks and challenges. Challenges guide participation towards the development of approaches that work across settings, e.g. with documents in at least two different languages or annotated according to two different tag sets or guidelines, and provide a well-defined and multi-perspective evaluation frame. To manage the total combinations of datasets, languages, document types and tasks, we defined the following elements (see also Figure 1): • Task bundle: a task bundle is a predefined set of tasks as in HIPE-2020 (see bundle table in Fig. 1). Task bundles offer participating teams great flexibility in choosing which tasks to compete for, while maintaining a manageable evaluation frame. Concretely, teams were allowed to submit several ‘submission bundles’, i.e. a triple composed of dataset/language/taskbundle, with up to 2 runs each. • Track: a track corresponds to a triple composed of dataset/language/task and forms the basic unit for which results are reported. • Challenge: a challenge corresponds to a predefined set of tracks. A challenge can be seen as a kind of tournament composed of tracks. HIPE-2022 specifically evaluates 3 challenges: 1. Multilingual Newspaper Challenge (MNC): This challenge aims at fostering the de- velopment of multilingual NE processing approaches on historical newspapers. The requirements for participation in this challenge are that submission bundles consist only of newspaper datasets and include at least two languages for the same task (so teams had to submit a minimum of two submission bundles for this challenge). 2. Multilingual Classical Commentary Challenge (MCC): This challenge aims at adapt- ing NE solutions to domain-specific entities in a specific digital humanities text type of classic commentaries. The requirements are that submission bundles consist only of the ajmc dataset and include at least three languages for the same task. 3. Global Adaptation Challenge (GAC): Finally, the global adaptation challenge aims at assessing how efficiently systems can be retargeted to any language, document type and guidelines. Bundles submitted for this challenge could be the same as those submitted for MNC and MCC challenges. The requirements are that they consist of datasets of both types (commentaries and newspaper) and include at least two languages for the same task. Figure 1: Overview of HIPE-2022 evaluation setting. 4.2. Evaluation Measures As in HIPE-2020, NERC and EL tasks are evaluated in terms of Precision, Recall and F-measure (F1-score) [22]. Evaluation is carried out at entity level according to two computation schemes: micro average, based on true positives, false positives, and false negative figures computed over all documents, and macro average, based on averages of micro figures per document. Our definition of macro differs from the usual one: averaging is done at document level and not at entity type level. This allows to account for variance in document length and entity distribution within documents and avoids distortions that would occur due to the unevenly distributed entity classes. Both NERC and EL benefit from strict and fuzzy evaluation regimes, depending on how strictly entity type and boundaries correctness are judged. For NERC (Coarse and Fine), the strict regime corresponds to exact type and boundary matching, and the fuzzy to exact type and overlapping boundaries. It is to be noted that in the strict regime, predicting wrong boundaries leads to a ‘double’ punishment of one false negative (entity present in the gold standard but not predicted by the system) and one false positive (entity predicted by the system but not present in the gold standard). Although it penalizes harshly, we keep this metric to be consistent with CoNLL and refer to the fuzzy regime when boundaries are of less importance. The definition of strict and fuzzy regimes differs for entity linking. In terms of boundaries, EL is always evaluated according to overlapping boundaries in both regimes (what is of interest is the capacity to provide the correct link rather than the correct boundaries). EL strict regime considers only the system’s top link prediction (NIL or Wikidata QID), while the fuzzy regime expands system predictions with a set of historically related entity QIDs. For example, “Germany” QID is complemented with the QID of the more specific “Confederation of the Rhine” entity and both are considered as valid answers. The resource allowing for such historical normalization was compiled by the task organizers for the entities of the test data sets (for hipe2020 and ajmc datasets), and are released as part of the HIPE-scorer. For this regime, participants were invited to submit more than one link, and F-measure is additionally computed with cut-offs @3 and @5 (meaning, counting a true positive if the ground truth QID can be found within the first 3 or 5 candidates). 4.3. System Evaluation, Scorer and Evaluation Toolkit Teams were asked to submit system responses based on submission bundles and to specify at least one challenge to which their submitted bundles belong. Micro and macro scores were computed and published for each track, but only micro figures are reported here. The evaluation of challenges, which corresponds to an aggregation of tracks, was defined as follows: given a specific challenge and the tracks submitted by a team for this challenge, the submitted systems are rewarded points according to their F1-based rank for each track (considering only the best of the submitted runs for a given track). The points obtained are summed over all submitted tracks, and systems/teams are ranked according to their total points. Further details on system submission and evaluation can be found in the HIPE Participation Guidelines [23]. The evaluation is performed using the HIPE-scorer13 . Developed during the first edition of HIPE, the scorer has been improved with minor bug fixes and additional parameterisation (input format, evaluation regimes, HIPE editions). Participants could use the HIPE-scorer when developing their systems. After the evaluation phase, a complete evaluation toolkit was also released, including the data used for evaluation (v2.1), the system runs submitted by participating teams, and all the evaluation recipes and resources (e.g. historical mappings) needed to replicate the present evaluation14 . 13 https://github.com/hipe-eval/HIPE-scorer 14 https://github.com/hipe-eval/HIPE-2022-eval 5. System Descriptions In this second HIPE edition, 5 teams submitted a total of 103 system runs. Submitted runs do not cover all of the 35 possible tracks (dataset/language/task combinations), nevertheless we received submission for all datasets, with most of them focusing on NERC-Coarse. 5.1. Baselines As a neural baseline (Neur-bsl) for NERC-Coarse and NERC-fine, we fine-tuned separately for each HIPE-2022 dataset XLM-R𝐵𝐴𝑆𝐸 , a multilingual transformer-based language representation model pre-trained on 2.5TB of filtered CommonCrawl texts [24]. The models are implemented using HuggingFace15 [25]. Since transformers rely primarily on subword-based tokenisers, we chose to label only the first subwords. This allows to map the model outputs to the original text more easily. Tokenised texts are split into input segments of length 512. For each HIPE-2022 dataset, fine-tuning is performed on the train set (except for sonar and hipe2020-en which has only dev sets) for 10 epochs using the default hyperparameters (Adam 𝜖 = 10e−8, Learning rate 𝛼 = 5e−5). The code of this baseline (configuration files, scripts) is published in a dedicated repository on the HIPE-eval Github organisation16 , and results are published in the evaluation toolkit. For entity linking in EL-only setting, we provide the NIL baseline (Nil-bsl), where each entity link is replaced with the NIL value. 5.2. Participating Systems The following system descriptions are compiled from information provided by the participants. More details on the implementation and results can be found in the system papers of the participants [26]. Team L3i, affiliated with La Rochelle University and with the University of Toulouse, France, successfully tackled an impressive amount of multilingual newspaper datasets with strong runs for NERC-coarse, NERC-fine and EL. For the classical commentary datasets (ajmc) the team had excellent results for NERC17 . For NERC, L3i – the winning team in HIPE’s 2020 edition – builds on their transformer-based approach [27]. Using transformer-based adapters [28], parameter-efficient fine-tuning in a hierarchical multitask setup (NERC-coarse and NERC-fine) has been shown to work well with historical noisy texts [27]. The innovation for this year’s submission lies in the addition of context information in the form of external knowledge from two sources (inspired by [29]). First, French and German Wikipedia documents based on dense vector representations computed by a multilingual Sentence-BERT model [30], including a k-Nearest-Neighbor search functionality provided by ElasticSearch framework. Second, English Wikidata knowledge graph (KG) embeddings that are combined with the first paragraph of English Wikipedia pages (Wikidata5m) [31]. For the knowledge graph embeddings, two methods are tested on the HIPE-2022 data: 1) the one-stage KG Embedding Retrieval Module that retrieves 15 https://github.com/huggingface/transformers/ 16 https://github.com/hipe-eval/HIPE-2022-baseline/ 17 The EL results for ajmc were low, probably due to some processing issues. top-𝑘 KG “documents” (in this context, a document is an ElasticSearch retrieval unit that consists of an entity identifier, an entity description and an entity embedding) via vector similarity on the dense entity embedding vector space; 2) the two-stage KG Embedding Retrieval Module that retrieves the single top similar document first and then in a second retrieval step gets the 𝑘 most similar documents based on that first entity. All context enrichment techniques work by simply concatenating the original input segment with the retrieved context segments and processing the contextualized segments through their “normal” hierarchical NER architecture. Since the L3i team’s internal evaluation on HIPE-2022 data (using a multilingual BERT base pre-trained model) indicated that the two-stage KG retrieval was the best context generator overall, it was used for one of the two officially submitted runs. The other “baseline” run did not use any context enrichment techniques. Both runs additionally used stacked monolingual BERT embeddings for English, French and German, for the latter two languages in the form of Europeana models that were built from digitized historical newspaper text material. Even with improved historical monolingual BERT embeddings, the context-enriched run was consistently better in terms of F1-score in NERC-Coarse and -Fine settings. Team histeria, affiliated with the Bayerische Staatsbibliothek München, Germany, the Digital Philology department of the University of Vienna, Austria and the NLP Expert Center, Volkswagen AG Munich, Germany, focused on the ajmc dataset for their NERC-coarse submission (best results for French and English, second best for German), but also provides experimental results for all languages of the newseye datasets18 . Their NER tagging experiments tackle two important questions: a) How to build an optimal multilingual pre-trained BERT language representation model for historical OCRized documents? They propose and release hmBERT19 , which includes English, Finnish, French, German and Swedish in various model and vocabulary sizes, and specifically apply methods to deal with OCR noise and imbalanced corpus sizes per language. In the end, roughly 27GB of text per language is used in pre-training. b) How to fine-tune a multilingual pre-trained model given comparable NER annotations in multiple languages? They compare a single-model approach (training models separately for each language) with a one-model approach (training only one model that covers all languages). The results indicate that, most of the time, the single-model approach works slightly better, but the difference may not be large enough to justify the considerably greater effort to train and apply the models in practice. histeria submitted two runs for each ajmc datasets, using careful hyperparameter grid search on the dev sets in the process. Both runs build on the one-model approach in a first multilingual fine-tuning step. Similar to [29], they build monolingual models by further fine- tuning on language-specific training data20 . Run 1 of their submission is based on hmBERT with vocabulary size 32k, while run 2 has a vocabulary size of 64k. Somewhat unexpectedly, the larger vocabulary does not improve the results in general on the development set. For 18 Note that these experiments are evaluated using the officially published Newseye test sets [15] (released as dev2 dataset as part of HIPE-2022) and not the HIPE-2022 newseye test sets, which were unpublished prior to the HIPE 2022 campaign. 19 For English data, they used the Digitised Books. c. 1510 - c. 1900, all other languages use Europeana newspaper text data. 20 This improves the results by 1.2% on average on the HIPE-2022 data. the test set, though, the larger vocabulary model is substantially better overall. Similar to the team L3i, histeria also experimented with context enrichment techniques suggested by [29]. However, for the specific domain of classical commentaries, general-purpose knowledge bases such as Wikipedia could not improve the results. Interestingly, L3i also observed much less improvement with Wikipedia context enrichment on ajmc in comparison to the hipe2020 newspaper datasets. In summary, histeria outperformed the strong neural baseline by about 10 F1-score percentage points in strict boundary setting, thereby demonstrating the importance of carefully constructed domain-specific pre-trained language representation models. Team Aauzh, affiliated with University of Zurich, Switzerland and University of Milan, Italy, focused on the multilingual newspaper challenge in NERC-coarse setting and experimented with 21 different monolingual and multilingual, as well as contemporary and historical transformer- based language representation models available on the HuggingFace platform. For fine-tuning, they used the standard token classification head of the transformer library for NER tagging with default hyperparameters and trained each dataset for 3 epochs. In a preprocessing step, token-level NER IOB labels were mapped onto all subtokens. At inference time, a simple but effective summing pooling strategy for NER for aggregating subtoken-level to token-level labels was used [32]. Run 2 of Aauzh are the predictions of the best single model. Run 1 is the result of a hard-label ensembling from different pre-trained models: in case of ties between O and B/I labels, the entity labels were preferred. The performance of the submitted runs varies strongly in comparison with the neural baseline: for German and English it generally beats the baseline clearly for hipe2020 and sonar datasets, but suffers on French hipe2020 and German/Finnish newseye datasets. This again indicates that in transfer learning approaches to historical NER, the selection of pre-trained models has a considerable impact. The team also performed some post-submission experiments to investigate the effect of design choices: Applying soft-label ensembling using averaged token-level probabilities turned out to improve results on the French newseye datasets by 1.5 percentage point in micro average and 2.4 points in macro average (F1-score). For all languages of the newseye, they also tested a one-model approach with multilingual training. The best multilingual dbmdz Europeana BERT model had a better performance on average (58%) than the best monolingual models (56%). However, several other multilingual pre-trained language models had substantially worse performance, resulting in 57% ensemble F1-score (5 models), which was much lower than 67% achieved by the monolingual ensemble. Team Sbb, affiliated with the Berlin State Library, Germany, participated exclusively in the EL-only subtask, but covered all datasets in English, German and French. Their system builds on models and methods developed in the HIPE-2020 edition [33]. Their approach uses Wikipedia sentences with an explicit link to a Wikipedia page as textual representations of its connected Wikidata entity. The system makes use of the metadata of the HIPE-2022 documents to exclude entities that were not existing at the time of its publication. Going via Wikipedia reduces the amount of accessible Wikidata IDs, however, for all datasets but ajmc the coverage is still 90%. Given the specialised domain of ajmc, a coverage of about 55% is to be accepted. The entity linking is done in the following steps: a) A candidate lookup retrieves a given number of candidates (25 for submission run 1, 50 for submission run 2) using a nearest neighbour index based on word embeddings of Wikipedia page titles. An absolute cut-off value is used to limit the retrieval (0.05 for submission 1 and 0.13 for submission 2). b) A probabilistic candidate sentence matching is performed by pairwise comparing the sentence with the mention to link and a knowledge base text snippet. To this end, a BERT model was fine-tuned on the task of whether or not two sentences mention the same entity. c) The final ranking of candidates includes the candidate sentence matching information as well as lookup features from step (a) and more word embedding information from the context. A random forest model calculates the overall probability of a match between the entity mention and an entity linking candidate. If the probability of a candidate is below a given threshold (0.2 for submission run 1 and 2), it is discarded. The random forest model was trained on concatenated training sets of the same language across datasets. There are no conclusive insights from HIPE-2022 EL-only results whether run 1 or 2 settings are preferable. Post-submission experiments in their system description paper investigate the influence of specific hyperparameter settings on the system performances. 6. Results and Discussion We report results for the best run of each team and consider micro Precision, Recall and F1-score exclusively. Results for NERC-Coarse and NERC-Fine for all languages and datasets according to both evaluation regimes are presented in Table 7 and 8 respectively. Table 10 reports performances for EL-only, with a cut-off @1. We refer the reader to the HIPE-2022 website and the evaluation toolkit for more detailed results21 , and to the extended overview paper for further discussion of the results [12]. 6.1. General Observations All systems now use transformer-based approaches with strong pre-trained models. The choice of the pre-trained model – and the corresponding text types used in pre-training – have a strong influence on performance. The quality of available multilingual pre-trained models for fine-tuning on NER tasks proved to be competitive compared to training individual monolingual models. However, to get the max- imum performance out of it, the multilingual fine-tuning in a first phase must be complemented by a monolingual second phase. NERC. In general, the systems demonstrated a good ability to adapt to heterogeneous annotation guidelines. They achieved their highest F1-scores for the NERC-Coarse task on ajmc, a dataset annotated with domain-specific entities and of relatively small size compared to the newspaper datasets, thus confirming the ability of strong pre-trained models to achieve good results when fine-tuned on relatively small datasets. The good results obtained on ajmc, however, may be partly due to the relatively high mention overlap between train and test sets (see Section 3.2). Moreover, it is worth noting that performances on the French subset of the ajmc dataset do not substantially degrade despite the high rate of noisy mentions (three times higher than English and German), which shows a good resilience of transformer-based models to OCR noise on this specific dataset. 21 See https://hipe-eval.github.io/HIPE-2022 and https://github.com/hipe-eval/HIPE-2022-eval Table 7 Results for NERC-Coarse (micro P, R and F1-score). Bold font indicates the highest, and underlined font the second-highest value. Strict Fuzzy Strict Fuzzy Strict Fuzzy P R F P R F P R F P R F P R F P R F hipe2020 French German English Aauzh .718 .675 .696 .825 .776 .800 .716 .735 .725 .812 .833 .822 .538 490 .513 .726 .661 .692 L3i .786 .831 .808 .883 .933 .907 .784 .805 .794 .865 .888 .876 .624 .617 .620 .793 .784 .788 Neur-bsl .730 .785 .757 .836 .899 .866 .665 .746 .703 .750 .842 .793 .432 .532 .477 .564 .695 .623 letemps sonar topRes19th French German English Aauzh .589 .710 .644 .642 .773 .701 .512 .548 .529 .655 .741 .695 .816 .760 .787 .869 .810 .838 Neur-bsl .595 .744 .661 .639 .800 .711 .267 .361 .307 .410 .554 .471 .747 .782 .764 .798 .836 .816 ajmc French German English HISTeria .834 .850 .842 .874 .903 .888 .930 .898 .913 .938 .953 .945 .826 .885 .854 .879 .943 .910 L3i .810 .842 .826 .856 .889 .872 .946 .921 .934 .965 .940 .952 .824 .876 .850 .868 .922 .894 Neur-bsl .707 .778 .741 .788 .867 .825 .792 .846 .818 .846 .903 .873 .680 .802 .736 .766 .902 .828 newseye French German Aauzh .655 .657 .656 .785 .787 .786 .395 .421 .408 .480 .512 .495 Neur-bsl .634 .676 .654 .755 .805 .779 .429 .537 .477 .512 .642 .570 Finnish Swedish Aauzh .618 .524 .567 .730 .619 .670 .686 .604 .643 .797 .702 .746 Neur-bsl .605 .687 .644 .715 .812 .760 .588 .728 .651 .675 .836 .747 Table 8 Results for NERC-Fine and Nested (micro P, R and F1-score). French German English Strict Fuzzy Strict Fuzzy Strict Fuzzy P R F P R F P R F P R F P R F P R F hipe2020 (Fine) L3i .702 .782 .740 .784 .873 .826 .691 .747 .718 .776 .840 .807 - - - - - - Neur-bsl .685 .733 .708 .769 .822 .795 .584 .673 .625 .659 .759 .706 - - - - - - hipe2020 (Nested) L3i .390 .366 .377 .416 .390 .403 .714 .411 .522 .738 .425 .539 - - - - - - ajmc (Fine) L3i .646 .694 .669 .703 .756 .728 .915 .898 .906 .941 .924 .933 .754 .848 .798 .801 .899 .847 Neur-bsl .526 .567 .545 .616 .664 .639 .819 .817 .818 .866 .864 .865 .600 .744 .664 .676 .839 .749 EL-only. Entity linking on already identified mentions appears to be considerably more challenging than NERC, with F1-scores varying considerably across datasets. The linking of toponyms in topres19th is where systems achieved the overall best performances. Conversely, EL-only on historical commentaries (ajmc) appears to be the most difficult, with the lowest Table 9 NERC-Coarse recall by entity type. Figures are from the same runs reported in the generic NERC-Coarse table (although recall scores from another run submitted by a team may be better in some cases.) Dataset Lang. Type Neur-bsl Aauzh L3i HISTeria hipe2020 fr pers .934 .882 .970 - loc .930 .760 .954 - org .593 .580 .724 - prod .603 .721 .929 - time .810 .739 .803 - de pers .936 .884 .974 - loc .874 .881 .929 - org .407 .560 .645 - prod .579 .607 .722 - time .882 .930 .906 - en pers .872 .846 .872 - loc .867 .646 .790 - org .128 .478 .637 - prod - - .722 - time .538 .938 1.000 - letemps fr pers .841 .816 - - loc .832 .827 - - newseye de pers .529 .486 - - loc .658 .561 - - org .338 .330 - - humanprod - .100 - - fi pers .863 .899 - - loc .836 .698 - - org .324 .491 - - humanprod .429 .789 - - fr .889 .888 - - loc .785 .818 - - org .539 .540 - - humanprod .600 .704 - - sv pers .684 .765 - - loc .898 .751 - - org .326 .533 - - humanprod .571 .889 - - sonar de pers .394 .656 - - loc .808 .836 - - org .128 .496 - - topres19th en loc .892 .860 - - building .600 .738 - - street .661 .685 - - ajmc de pers .953 - .938 .930 loc - - 1.000 .500 work .729 - .958 .985 scope .861 - .939 .950 en pers .821 - .901 .945 loc - - .800 1.000 work .926 - .947 .947 scope .887 - .921 .921 fr pers .878 - .928 .906 loc - - .444 .556 work .700 - .831 .910 scope .876 - .883 .879 Table 10 Results for EL-only and End-to-end EL (micro P, R and F1-score @1). For End-to-end EL, only Team L3i submitted runs for hipe-2020. Bold font indicates the highest value. Strict Relaxed Strict Relaxed Strict Relaxed P R F P R F P R F P R F P R F P R F hipe2020 French German English EL-only L3i .602 .602 .602 .620 .620 .620 .481 .481 .481 .497 .497 .497 .546 .546 .546 .546 .546 .546 SBB .707 .515 .596 .730 .532 .616 .603 .435 .506 .626 .452 .525 .503 .323 .393 .503 .323 .393 Nil-bsl .209 .209 .209 .209 .209 .209 .481 .314 .380 .481 .314 .380 .228 .228 .228 .228 .228 .228 End-to-end EL L3i .546 .576 .560 .563 .594 .578 .446 .451 .449 .462 .466 .464 .463 .474 .469 .463 .474 .469 sonar topres19th German English SBB .616 .446 .517 .616 .446 .517 .778 .559 .651 .781 .562 .654 Nil-bsl 333 .333 .333 .333 .333 .333 - - - - - - newseye French German SBB .534 .361 .431 .539 .364 .435 .522 .387 .444 .535 .396 .455 Nil-bsl .448 .448 .448 .448 .448 .448 .485 .485 .485 .485 .485 .485 ajmc French German English SBB .621 .378 .470 .614 .373 .464 .712 .389 .503 .712 .389 .503 .578 .284 .381 .578 .284 .381 Nil-bsl .037 .037 .037 .037 .037 .037 .049 .049 .049 .049 .049 .049 .046 .046 .046 .046 .046 .046 F1-scores compared to the other datasets. The EL-only performances of the SBB system on the ajmc dataset deserve some further considerations, as they are well representative of the challenges faced when applying a generic entity linking system to a domain-specific dataset. Firstly, SBB team reported that ajmc is the dataset with the lowest Wikidata coverage: only 57% of the Wikidata IDs in the test set are found in the knowledge base used by their system (a combination of Wikidata record and Wikipedia textual content), whereas the coverage for all other datasets ranges between 86% (hipe2020) and 99% (topres19th). The reason for the low coverage in ajmc is that, when constructing the knowledge base, only Wikidata records describing persons, locations and organisations were kept. In contrast, a substantial number of entities in ajmc are literary works, which would have required to retain also records with Wikidata type “literary work” (Q7725634) when building the KB. Secondly, a characteristic of ajmc is that both person and work mentions are frequently abbreviated, and these abbreviations tend to be lacking as lexical information in large-scale KBs such as Wikidata. Indeed, an error analysis of SBB’s system results shows that only 1.4% of the correctly predicted entity links (true positives) correspond to abbreviated mentions, which nevertheless represent about 47% of all linkable mentions. 6.2. Observations on Challenges A complementary way of looking at the results is to consider them in light of the specific challenges raised by historical documents. Such challenges are one of the novelties of HIPE- 2022 and were introduced as “thematic aggregations” of system rankings across datasets and languages (see Section 4).22 Multilingual Newspaper Challenge (MNC). Overall, four teams participated in this challenge: three teams in the NERC-Coarse task, two in EL-only and one team in end-to-end EL. The top-ranked team, Aauzh, was able to tackle five different newspaper datasets in five languages in total, with performances above the strong neural baseline in 6 out of 10 cases. However, it should be noted that they used two different systems, one based on the best fine-tuned model (selected by development set performance) and another one being an ensemble of all their fine-tuned models. Thus, despite leading in the MNC challenge, their work does not answer the question of which single system works best across datasets and in different languages. Conversely, the second-ranked system by L3I team submitted for fewer datasets and languages, but showed an overall higher quality of predictions both for NERC-Coarse and EL-only for languages they covered. It would be interesting to see how this system performs on the remaining languages and datasets, especially in comparison with the baseline. In general, one aspect of the MNC challenge which remained unexplored is entity linking beyond a standard set of languages such as English, German and French, as no runs for EL-only on Finnish and Swedish newspapers were submitted. Multilingual Classical Commentary Challenge (MCC). This challenge had in total three participants: two teams participated in this challenge in the NERC-Coarse task, and one team participated in the end-to-end EL and EL-only. Given the overlap between MCC and GAC challenges in terms of languages and datasets, teams participating in the latter were also considered for the former. The NERC-Coarse results showed how a BERT-based multilingual model pre-trained on large corpora of historical documents and fine-tuned on domain-specific data performs better or on-par with a system implementing a more complex transformer-based architecture. An interesting insight which emerged from this challenge is that methods employing context enrich- ment techniques which rely on lexical information from Wikipedia do not yield performances improvements as they do when applied to other document types, such as newspapers. Regarding EL, MCC exemplified well some characteristics an EL system needs to have for it to be applied successfully across different domains. In particular, assumptions about which entities are to be retained when constructing a knowledge base for this task need to be relaxed. Moreover, an aspect that emerged from this challenge and will deserve more research in the future is the linking of abbreviated entities (e.g. mentions of literary works in commentaries), which proved to be challenging for participating systems. 22 The challenge leaderboards can be found at the HIPE-2022 results page in the Challenge Evaluation Results section, see https://hipe-eval.github.io/HIPE-2022/results. Global Adaptation Challenge (GAC). The high level of difficulty entailed by this challenge is reflected by the number of participants: one team for the NERC-Coarse task and one team for end-to-end EL and EL-only. Both teams had already participated in the first edition of HIPE with very good results, and tackled this year the challenge of adapting their systems to work in a multi-domain and multilingual scenario. In general, the results of this challenge confirm that the systems proposed by L3I and SBB respectively for the tasks of NERC-Coarse and EL are suitable to be applied to data originating from heterogeneous domains. They also show that EL across languages and domains remains a more challenging task than NERC, calling for more future research on this topic. Moreover, no team has worked on adapting annotation models to be able to combine different NER training datasets with sometimes incompatible annotations and benefit from a larger dataset overall. This data augmentation strategy to global adaptation, which could be beneficial for underrepresented entity types (e.g. dates or locations in the ajmc dataset), remains to be explored in future work. 7. Conclusion and Perspectives From the perspective of natural language processing, this second edition of HIPE provided the possibility to test the robustness of existing approaches and to experiment with transfer learning and domain adaptation methods, whose performances could be systematically evaluated and compared on broad historical and multilingual data sets. Besides gaining new insights with respect to domain and language adaptation and advancing the state of the art in semantic indexing of historical material, the lab also contributed an unprecedented set of multilingual and historical NE-annotated datasets that can be used for further experimentation and benchmarking. From the perspective of digital humanities, the lab’s outcomes will help DH practitioners in mapping state-of-the-art solutions for NE processing of historical texts, and in getting a better understanding of what is already possible as opposed to what is still challenging. Most importantly, digital scholars are in need of support to explore the large quantities of digitised text they currently have at hand, and NE processing is high on the agenda. Such processing can support research questions in various domains (e.g. history, political science, literature, historical linguistics) and knowing about their performance is crucial in order to make an informed use of the processed data. From the perspective of cultural heritage professionals, who increasingly focus on advancing the usage of artificial intelligence methods on cultural heritage text collections [34, 35], the HIPE-2022 shared task and datasets represent an excellent opportunity to experiment with multilingual and multi-domain data of various quality and annotation depth, a setting close to the real-world scenarios they are often confronted with. Overall, HIPE-2022 has contributed to further advance the state of the art in semantic index- ing of historical documents. By expanding the language spectrum and document types and integrating datasets with various annotation tag sets, this second edition has set the bar high, and there remains much to explore and experiment. 8. Acknowledgments The HIPE-2022 team expresses her greatest appreciation to the HIPE-2022 partnering projects, namely AjMC, impresso-HIPE-2020, Living with Machines, NewsEye, and SoNAR, for contribut- ing (and hiding) their NE-annotated datasets. We particularly thank Mariona Coll-Ardanuy (LwM), Ahmed Hamdi (NewsEye) and Clemens Neudecker (SoNAR) for their support regarding data provision, and the members of the HIPE-2022 advisory board, namely Sally Chambers, Frédéric Kaplan and Clemens Neudecker. References [1] M. Ehrmann, M. Romanello, S. Najem-Meyer, A. Doucet, S. Clematide, Overview of HIPE- 2022: Named Entity Recognition and Linking in Multilingual Historical Documents, in: A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, A. Hanbury, M. Potthast, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science (LNCS), Springer, 2022. [2] F. Kaplan, I. di Lenardo, Big Data of the Past, Frontiers in Digital Humanities 4 (2017) 1– 21. URL: https://www.frontiersin.org/articles/10.3389/fdigh.2017.00012/full. doi:10.3389/ fdigh.2017.00012, publisher: Frontiers. [3] M. Ridge, G. Colavizza, L. Brake, M. Ehrmann, J.-P. Moreux, A. Prescott, The past, present and future of digital scholarship with newspaper collections, in: DH 2019 book of abstracts, Utrecht, The Netherlands, 2019, pp. 1–9. URL: http://infoscience.epfl.ch/record/271329. [4] M. Ehrmann, G. Colavizza, Y. Rochat, F. Kaplan, Diachronic evaluation of NER systems on old newspapers, in: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), Bochumer Linguistische Arbeitsberichte, Bochum, 2016, pp. 97–107. URL: https://infoscience.epfl.ch/record/221391. [5] M. Ehrmann, A. Hamdi, E. L. Pontes, M. Romanello, A. Doucet, Named Entity Recognition and Classification on Historical Documents: A Survey, arXiv:2109.11406 [cs] (2021 (to appear in ACM journal Computing Surveys in 2022). URL: http://arxiv.org/abs/2109.11406, arXiv: 2109.11406. [6] M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide, Overview of CLEF HIPE 2020: Named entity recognition and linking on historical newspapers, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappel- lato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Sciences, Springer International Publishing, Cham, 2020, pp. 288–310. [7] M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide, Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, volume 2696, CEUR-WS, Thessaloniki, Greece, 2020, p. 38. URL: https: //infoscience.epfl.ch/record/281054. doi:10.5281/zenodo.4117566. [8] G. Beryozkin, Y. Drori, O. Gilon, T. Hartman, I. Szpektor, A joint named-entity recognizer for heterogeneous tag-sets using a tag hierarchy, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 140–150. URL: https://aclanthology.org/P19-1014. [9] J. Li, B. Chiu, S. Feng, H. Wang, Few-shot named entity recognition via meta-learning, IEEE Transactions on Knowledge and Data Engineering (2020) 1–1. [10] Q. Wu, Z. Lin, G. Wang, H. Chen, B. F. Karlsson, B. Huang, C. Lin, Enhanced meta-learning for cross-lingual named entity recognition with minimal resources, CoRR abs/1911.06161 (2019). URL: http://arxiv.org/abs/1911.06161. [11] J. Li, S. Shang, L. Shao, Metaner: Named entity recognition with meta-learning, in: Proceedings of The Web Conference 2020, WWW ’20, Association for Computing Machin- ery, New York, NY, USA, 2020, p. 429–440. URL: https://doi.org/10.1145/3366423.3380127. doi:10.1145/3366423.3380127. [12] M. Ehrmann, M. Romanello, S. Najem-Meyer, A. Doucet, S. Clematide, Extended Overview of HIPE-2022: Named Entity Recognition and Linking on Multilingual Historical Docu- ments, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CEUR-WS, 2022. [13] M. Ehrmann, M. Romanello, S. Clematide, P. B. Ströbel, R. Barman, Language Resources for Historical Newspapers: The Impresso Collection, in: Proceedings of the 12th Language Re- sources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 958–968. [14] M. Ehrmann, M. Romanello, A. Flückiger, S. Clematide, Impresso Named Entity Annotation Guidelines, Annotation Guidelines, Ecole Polytechnique Fédérale de Lausanne (EPFL) and Zurich University (UZH), 2020. URL: https://zenodo.org/record/3585750. doi:10.5281/ zenodo.3604227. [15] A. Hamdi, E. Linhares Pontes, E. Boros, T. T. H. Nguyen, G. Hackl, J. G. Moreno, A. Doucet, A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 2328–2334. doi:10.1145/3404835.3463255. [16] M. Coll Ardanuy, D. Beavan, K. Beelen, K. Hosseini, J. Lawrence, Dataset for Toponym Resolution in Nineteenth-Century English Newspapers, 2021. URL: https://https//doi.org/ 10.23636/b1c4-py78. doi:10.23636/b1c4-py78. [17] M. Romanello, N.-M. Sven, B. Robertson, Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs, in: The 6th International Workshop on Historical Document Imaging and Processing (HIP ’21), Association for Computing Machinery, Lausanne, 2021. URL: https://doi.org/10.1145/3476887.3476911. doi:10.1145/ 3476887.3476911. [18] S. Rosset, Grouin, Cyril, Zweigenbaum, Pierre, Entités Nommées Structurées : Guide d’annotation Quaero, Technical Report 2011-04, LIMSI-CNRS, Orsay, France, 2011. [19] M. Romanello, S. Najem-Meyer, Guidelines for the Annotation of Named Entities in the Domain of Classics, 2022. URL: https://doi.org/10.5281/zenodo.6368101. doi:10.5281/ zenodo.6368101. [20] I. Augenstein, L. Derczynski, K. Bontcheva, Generalisation in named entity recognition: A quantitative analysis, Computer Speech & Language 44 (2017) 61–83. URL: http://www. sciencedirect.com/science/article/pii/S088523081630002X. doi:10.1016/j.csl.2017.01. 012. [21] B. Taillé, V. Guigue, P. Gallinari, Contextualized Embeddings in Named-Entity Recognition: An Empirical Study on Generalization, in: J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, F. Martins (Eds.), Advances in Information Retrieval, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 383–391. doi:10. 1007/978-3-030-45442-5_48. [22] J. Makhoul, F. Kubala, R. Schwartz, R. Weischedel, Performance measures for information extraction, in: In Proceedings of DARPA Broadcast News Workshop, 1999, pp. 249–252. [23] M. Ehrmann, M. Romanello, A. Doucet, S. Clematide, HIPE 2022 Shared Task Participa- tion Guidelines, Technical Report, Zenodo, 2022. URL: https://zenodo.org/record/6045662. doi:10.5281/zenodo.6045662. [24] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning at Scale, 2020. arXiv:1911.02116. [25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [26] G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CEUR-WS, 2022. [27] E. Boros, A. Hamdi, E. Linhares Pontes, L. A. Cabrera-Diego, J. G. Moreno, N. Sidere, A. Doucet, Alleviating digitization errors in named entity recognition for historical documents, in: Proceedings of the 24th Conference on Computational Natural Language Learning, Association for Computational Linguistics, Online, 2020, pp. 431–441. doi:10. 18653/v1/2020.conll-1.35. [28] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learning for NLP, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 2790– 2799. URL: https://proceedings.mlr.press/v97/houlsby19a.html. [29] X. Wang, Y. Shen, J. Cai, T. Wang, X. Wang, P. Xie, F. Huang, W. Lu, Y. Zhuang, K. Tu, W. Lu, Y. Jiang, DAMO-NLP at SemEval-2022 task 11: A knowledge-based system for multilingual named entity recognition, 2022. URL: https://arxiv.org/abs/2203.00545. doi:10. 48550/ARXIV.2203.00545. [30] N. Reimers, I. Gurevych, Making monolingual sentence embeddings multilingual using knowledge distillation, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 4512–4525. URL: https://aclanthology.org/2020.emnlp-main.365. doi:10.18653/ v1/2020.emnlp-main.365. [31] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li, J. Tang, Kepler: A unified model for knowledge embedding and pre-trained language representation (2019). URL: https://arxiv. org/pdf/1911.06136.pdf. arXiv:1911.06136. [32] J. Ács, Á. Kádár, A. Kornai, Subword pooling makes a difference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, 2021, pp. 2284–2295. URL: https://aclanthology.org/2021.eacl-main.194. doi:10.18653/v1/2021.eacl-main.194. [33] K. Labusch, C. Neudecker, Named entity disambiguation and linking on historic newspaper OCR with BERT, in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, number 2696 in CEUR Workshop Proceedings, CEUR-WS, 2020. URL: http://ceur-ws. org/Vol-2696/paper_163.pdf. [34] T. Padilla, Responsible Operations: Data Science, Machine Learning, and AI in Libraries, Technical Report, OCLC Research, USA, 2020. doi:10.25333/xk7z-9g97. [35] M. Gregory, C. Neudecker, A. Isaac, G. Bergel, Others, AI in Relation to GLAMs Task FOrce - Report and Recommendations, Technical Report, Europeana Network ASsociation, 2021. URL: https://pro.europeana.eu/project/ai-in-relation-to-glams.