1. Introduction and Motivation

Computer Science Review

NERMuD at EVALITA 2023: Overview of the Named-Entities Recognition on Multi-Domain Documents Task

Alessio Palmero Aprosio

Teresa Paccosi

0 1 0 Dipartimento di Psicologia e Scienze Cognitive, Università di Trento , Corso Bettini 84, I-38068 Rovereto (TN) , Italy 1 Fondazione Bruno Kessler , Via Sommarive 18, I-38121 Trento , Italy

2023

29 2018 282 289

In this paper, we describe NERMuD, a Named-Entities Recognition (NER) shared task presented at the EVALITA 2023 evaluation campaign. NERMuD is organized into two diferent sub-tasks: a domain-agnostic classification and a domainspecific one. We display the evaluation of the system presented by the only task participant, ExtremITA. ExtremITA proposes a unified approach for all the tasks of EVALITA 2023, and it addresses in our case only the domain-agnostic sub-task. We present an updated version of KIND, the dataset distributed for the training of the system. We then provide the baselines proposed, the results of the evaluation, and a brief discussion.

1. Introduction and Motivation

([8]), NEEL-IT 2016 [9] and NER 2011 [10].

The rest of this article is structured as follows. SecNamed-entity recognition (NER) is one of the most com- tion 2 describes the task, and Section 3 gives an overview mon and important task in the field of Natural Language of the dataset provided. In Section 4 we portray the Processing (NLP). It involves identifying and classifying baseline and the evaluation metric, while in Section 5 mentions of entities in texts and it is widely used in ap- we describe the work of the participant ExtremITA. In plications such as text understanding [ 1 ], information the end, Section 6 contains a brief discussion, while in retrieval [ 2 ], knowledge base construction [ 3 ], and the Section 7 we draw some conclusions. protection of personal data [ 4 ]. These entities can belong to a set of predefined categories, with people, locations, and organizations being the most common ones. 2. Task description

Manually annotated data play a crucial role in training and evaluating NER systems, similar to other NLP tasks. In this Section, we describe NERMuD, a task presented Systems trained on datasets from specific domains often at EVALITA 2023 [11] that involves the extraction and do not perform well when applied to diferent types of classification of named entities – including persons, ortexts [5]. ganizations, and locations – from documents in various

NER has been addressed in almost all languages, in- domains. dicating a significant interest in the topic [ 6]. It is an NERMuD 2023 includes two diferent sub-tasks: important task in its own right, as it can be used to process large archival collections. While NER is considered a solved task, some studies have shown that there is always room for improvement depending on factors such as labels, languages, and topics [7]. It is worth noting that, despite the great number of studies on this topic, datasets and tasks for NER often focus on news and, more recently, social media, as seen in initiatives like I-CAB • Domain-agnostic classification (DAC). Participants are required to select and classify entities into three categories (person, organization, location) from diferent types of texts (news, fiction, political speeches) using a single general model. • Domain-specific classification (DSC). Participants are required to make use of a diferent model for each of the above types, trying to increase the accuracy of every considered type.

Each participant can submit up to 3 runs for each sub- 3.1. Wikinews (WN) task.

The runs should be contained in a TSV file with fields delimited by a tab and it should follow the same format of the training dataset. No missing data are allowed: a label should be predicted for each token in the test set.

3. Available dataset Wikinews is a multi-language free-content project of col

laborative journalism. The Italian chapter contains more than 11,000 news articles,2 released under the Creative Commons Attribution 2.5 License.3

In building the dataset, we randomly choose 1,198 articles evenly distributed in the last 20 years, for a total of 364,816 tokens.

The corpus that can be used for training is the Kessler 3.2. Literature (FIC) Italian Named-entities Dataset (KIND) [12], presented in 2021 at the Language Resources and Evaluation Confer- For the annotation of fiction literature, we have included ence (LREC). KIND is available and freely downloadable 86 book chapters from a collection of 11 publicly available on Github.1 Italian-authored books. This annotated dataset comprises

The original dataset comprises over one million tokens a total of 219,638 tokens. While the majority of the seand includes annotations for three entity classes: person, lected books are novels, we have also included a mix of location, and organization. The majority of the dataset, epistles and biographies. The plain texts come from the approximately 600K tokens, features manual gold anno- Liber Liber website.4 tations across three distinct domains: news, literature, In particular, we select: Il giorno delle Mésules (Ettore and political discourses. This specific subset can be used Castiglioni, 1993, 12,853 tokens), L’amante di Cesare (Auas the training data for the NERMuD 2023 task, which gusto De Angelis, 1936, 13,464 tokens), Canne al vento focuses on Named Entity Recognition and Multi-domain (Grazia Deledda, 1913, 13,945 tokens), 1861-1911 - CinClassification. quant’anni di vita nazionale ricordati ai fanciulli (Guido

All the texts used for the annotation are publicly avail- Fabiani, 1911, 10,801 tokens), Lettere dal carcere (Antonio able, under a license that allows both research and com- Gramsci, 1947, 10,655), Anarchismo e democrazia (Errico mercial use. In particular, the texts used for the NERMuD Malatesta, 1974, 11,557 tokens), L’amore negato (Maria task come from: Messina, 1928, 31,115 tokens), La luna e i falò (Cesare Pavese, 1950, 10,705 tokens), La coscienza di Zeno (Italo • Wikinews (WN) as a source providing news texts Svevo, 1923, 56,364 tokens), Le cose più grandi di lui (Lufrom the last few decades; ciano Zuccoli, 1922, 20,989 tokens), L’occhio del lago (Tul• Some Italian fiction books (FIC) in the public do- lio Giordana, 1899, 27,190 tokens).

main, freely accessible for use; We prioritized selecting texts in the public domain • Writings and speeches from the Italian politician that are as recent as possible (considering that, under the Alcide De Gasperi (ADG), a collection of texts current legislation, copyright expires 70 years after the including the works and speeches of Alcide De death of the author). This choice was made to ensure that Gasperi, the Italian politician. the model trained on this data would be well-suited for applying to novels written in recent years. By focusing

Since the dataset is already publicly released and avail- on more contemporary texts, the language used in these able, a new set of data has been annotated and shared novels is expected to be more similar to the language using the same guidelines (available on the KIND reposi- used in present-day novels. Additionally, for the test tory on Github). data, we specifically chose works by the author Tullio

The dataset has been collected in full compliance with Giordana. His works are then not included in the train ethical standards, ensuring that it aligns with the terms or the dev sets, to not have a model possibly biased in of use of the sources and that respects the intellectual terms of style. property and privacy rights of the original authors of the texts. 3.3. Alcide De Gasperi’s Writings (ADG) Table 1 displays an overview of the dataset.

In the next subsections, we provide a quick descrip- Finally, we annotate 173 documents (164,537 tokens) from tion of the domains included in the dataset. For more the corpus described in [13], spanning 50 years of Euroinformation about the creation of the dataset, the text pean history. The corpus is composed of a comprehensive processing, and the annotation guidelines please refer to collection of Alcide De Gasperi’s public documents, 2,762 [12].

1https://github.com/dhfbk/KIND 2https://it.wikinews.org/wiki/Speciale:Statistiche 3https://creativecommons.org/licenses/by/2.5/ 4https://www.liberliber.it/

in total, written or transcribed between 1901 and 1954, capable of tackling a wide array of heterogeneous tasks and it is available for consultation on the Alcide Digitale (among them, NERMuD). website.5 The authors tested two diferent models:

4. Baseline and Evaluation During the definition of the task, we proposed two base

lines: an old-style Conditional Random Field [14], and a plain BERT [15] implementation. These options represent the most efective algorithms that can be implemented without the use of GPUs, as well as the simplest algorithms that can be performed using transformers. Both implementations of the baselines can be found on Github.6

The CRF model is based on the classifier available in scikit-learn out-of-the-box. In addition to standard features extracted from the text, including vector information from fastText models [16], we also used a set of gazetteers (list of persons, organizations and locations) collected from the Italian Wikipedia using some of the classes contained in DBpedia [17]: Person, Organization, and Place, respectively.

The BERT NER classification model is inspired by the blog post of Tobias Sterbak,7 using BertForTokenClassification8 from Hugging Face.

Final results will be calculated in terms of macroaverage 1. The evaluation script is released in the KIND oficial Github project. 9

5. Participants The task has only one participant, the “ExtremITA” group [18], who participated in all the tasks presented at EVALITA 2023 with two unified multi-task learning approaches.

The purpose of ExtremITA is to investigate how the adoption of a Large Language Model can be taken to its extreme consequences by proposing a single model

5https://alcidedigitale.fbk.eu/

6https://github.com/dhfbk/bert-ner 7https://bit.ly/ner-bert 8https://bit.ly/BertForTokenClassification 9https://github.com/dhfbk/KIND extremIT5 - An Encoder-Decoder model based on IT5 [19] consisting of approximately 110 million parameters. This model is trained by concatenating the name of the task and the input sentence/paragraph in the input texts, each representing an example from a generic EVALITA task. Its purpose is to generate a piece of text that solves the target task. For NERMuD, in particular, the list of expected Named Entities is reported as a sequence of text spans, each associated with the corresponding entity type (in the form [〈entity_type〉] 〈text_span_that_evokes_entity〉). extremITLLaMA - An instruction-tuned Decoder-only model, built upon the LLaMA foundational models [20], with a total of 7k million parameters. The initial model was trained using the LoRA technique [21] on Italian translations of Alpaca [22] instruction data. The adapters are then merged into the original model. A final fine-tuning phase using LLaMA is then performed. For each example from EVALITA, an input text is paired with a manually crafted question that simulates an instruction to be solved, representing the specific task. The natural language instruction used in NERMuD is “Scrivi le menzioni di entità nel testo, indicandone il tipo: [PER] (persona), [LOC] (luogo), [ORG] (organizzazione).”10.

In both cases, NERMuD was transformed into a sequence-to-sequence task from its original token classiifcation format. 6. Discussion

10Write the entities’ mentions in the text, indicating their type: [PER] (person), [LOC] (location), and [ORG] (organization)

The evaluation of ORG entities for the fiction domain

is missing, as none of the classifiers were able to correctly identify the only ORG entity present in the test set (the work “Borsa” in the sentence “Ha avuto disgrazie alla Borsa”). Overall, the BERT baseline outperforms ExtremITA in most runs, with the exception of LOC extraction in fictional texts, where ExtremITLLaMA performs better. This diference in performance can likely be attributed to the textual data used to train the models.

In general, it is possible to notice that the best ExtremITA run overcomes almost always the classification in terms of precision.

7. Conclusions In this paper we described the first evaluation task for

multi-domain named-entity recognition in Italian texts. The task evaluated the performance of participant systems in terms of extracting entities that refers to persons, organizations, and location. The texts used for the tasks cover three diferent domains: news, political speeches, ifction.

Unfortunately, the task attracted only one participant, ExtremITA, who however presented an interesting and very innovative multi-task approach, probably the first one dealing with so many diferent tasks in Italian. Although in general the results of ExtremITA do not overcome the two strong baselines proposed (CRF w/ gazetteers, and BERT), the diference in terms of 1 is very small, demonstrating a promising future for that kind of approaches.

As an outcome of the task, a new version of the KIND dataset is released, increasing its size with respect to the previous version.

[1]

Zhang , X. Han,

Liu ,

Jiang ,

Sun , Q. Liu, ERNIE: Enhanced language representation with informative entities, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 1441 - 1451 . URL: https://aclanthology.org/P19-1139. doi:10.1 8653/v1/ P19 -1139.

[2]

Guo ,

Xu , X. Cheng, H. Li, Named entity recognition in query , in: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '09, Association for Computing Machinery, New York, NY, USA, 2009 , p. 267 - 274 . URL: https://doi.org/ 10.1145/1571941.1571989. doi: 10 .1145/1571941. 1571989.

[3]

Etzioni ,

Cafarella ,

Downey , A.-M. Popescu , T.

Shaked , S.

Soderland , D. S.

Weld , A.

Yates , Unsupervised named-entity extraction from the web: An experimental study , Artificial Intelligence 165 ( 2005 ) 91 - 134 . URL: https://www.sciencedirec t.com/science/article/pii/S0004370205000366. doi:https://doi.org/10.1016/j. artint.2 005 .03.001.

[4]

Paccosi ,

Palmero Aprosio , Redit: A tool and