1. Introduction

PharmaER.IT: an Italian Dataset for Entity Recognition in the Pharmaceutical Domain

Andrea Zugarini

Leonardo Rigutini

expert.ai

Siena (Italy)

2025

Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.

eol>NER Pharmaceutical NER Dataset LLM

1. Introduction

licensing restrictions.

In this article we present PharmaER.IT, a novel dataset Named Entity Recognition (NER) is a crucial task in Nat- for NER in the pharmaceutical field for the Italian lanural Language Processing (NLP), and arguably among guage. The dataset is derived from Riassunti delle Caratthe most demanded in industrial applications. While teristiche del Prodotto (RCPs), the oficial drug inforrecent advances in transformer-based models [1] and mation leaflets, made publicly available by the Agenzia Large Language Models (LLMs) [2, 3, 4, 5] have signifi- Italiana del Farmaco (AIFA). cantly improved entity extraction performance on stan- PharmaER.IT is composed of two complementary cordard benchmarks, their application to real-world business pora: a curated GOLD corpus, consisting of 57 RCPs anand professional contexts remains dificult. A primary notated using a committee-based approach and refined challenge is the discrepancy between academic datasets through expert manual validation, and a SILVER corpus, and the often noisy, domain-specific texts encountered in comprising 2138 RCPs automatically annotated without real-world practice. This challenge is further exacerbated human intervention. in specialized domains such as finance, law, and medicine. This dual-corpus structure enables both evaluation and When dealing with languages other than English, anno- large-scale experimentation, facilitating the development tated resources are even scarcer or non-existent. of high-quality models as well as scalable weakly su

In the medical and pharmaceutical domain, accurate pervised approaches. To establish baseline performance, entity recognition is critical for applications ranging from we evaluate a range of token classification models and drug safety monitoring to automated clinical documen- several LLMs under diverse zero-shot settings. tation. However, the Italian language remains underrepresented in the landscape of medical NER resources, limiting the development and evaluation of robust systems 2. Related work for local healthcare and regulatory contexts. Existing datasets are either too small, lack suficient domain speciifcity, or are unavailable for public use due to privacy or

CoNLL-2003 [6], was one of the first NER datasets and it is still a reference corpus for NER. It was constituted of news articles annotated with four entity types: person

CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- (PER), organization (ORG), location (LOC) and misceltics, September 24 — 26, 2025, Cagliari, Italy laneous (MISC). Recently, numerous NER datasets have * Corresponding author. been released, many of which have been constructed us† These authors contributed equally. ing semi-automatic or fully automated annotation meth$ azugarini@expert.ai (A. Zugarini); lrigutini@expert.ai ods [7, 8, 2, 9], significantly expanding the entity tag-set. (L. 0R0i0g0u-t0in0i0)3-0344-1656 (A. Zugarini); 0000-0002-6309-2542 For instance, In Pile-NER [2], annotations were distilled (L. Rigutini) from ChatGPT, resulting in about 45 thousands examples © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License in English and more than 13 thousands distinct entity Attribution 4.0 International (CC BY 4.0).

URL 1 The URL scheme of the RCPs on the AIFA website.

https://api.aifa.gov.it/aifa-bdf-eif-be/1.0.0/organizzazione/[sis]/farmaci/[aic]/stampati?ts=RCP types. Even so, these resources do not target documents from vertical domains, such as finance or health. Other works proposed domain-specific NER corpora, such as in the financial [ 10, 11] and healthcare [12, 13] domains.

However, these datasets are in English and mainly consist of well-curated, isolated sentences. In contrast, our work focuses on technical descriptions of pharmaceutical drugs. • Riassunto delle Caratteristiche del Prodotto (RCP).

RCPs are documents for healthcare professionals, with a slightly more complex structure and more technical language and content. RCPs are approved documents, part of the marketing authorization for a drug, and intended primarily for healthcare professionals, adopting a medicalscientific terminology. In particular, RCPs contain detailed information on how to use the medicine, for example: therapeutic indications (what the medicine treats), dosage and method of administration, contraindications, special warnings, mechanism of action, side efects.

Italian NER Datasets. The availability of NER data

sets in Italian is extremely limited, particularly outside the traditional general-purpose domains and entity labels set [14]. Indeed, most NER datasets focus on news and social media contents [15, 16, 17, 18, 19]. Given the strong technical content, we chose to use RCPs

Recently, it was introduced Multinerd [20], a multi- to build our data set. We choose to use only RCPs and iglingual dataset, covering Italian and a set of 15 distinct nore the FI since (1) the contents of the FIs are a subset of entity types. Among them, Disease and Biological Entity the contents of the second, and (2) the RCPs contain techclasses were included. However, examples originated nical information relating to pharmacological properties from Wikipedia and Wikinews sentences, which are typ- and therapeutic indications that provide information of ically educational and encyclopedic. In PharmaER.IT in- diagnostic-prescriptive value. stead, we collected drug leaflets, which present an highly technical and specialized lexicon. As an alternative strategy, [21] proposed to translate existing healthcare En- Data download glish datasets for NER, in Italian. Nonetheless, automatic For each drug authorized for sale in Italy, AIFA assigns translation may introduce errors or segmentation issues, the unique AIC (Authorization for Placing on the Market) especially on such a vertical domain. identification code 2 consisting of 9 digits in which the the 3 most significant on the left identify the type of pack3. Data Collection aging (capsules or syrup, mg, etc.), while the remaining 6 on the right (eventually padded with zeros) uniquely In order to create a highly pharmaceutical-oriented identify the drug (it is also referred as AIC6). Similarly, dataset in Italian, we collected documents from the AIFA also for the companies producing the authorized drugs, website. AIFA assigns an unique three-digit code called SIS. The open-data section of the AIFA website3 contains Target documents several databases, including the list of drugs approved and distributed in Italy. This list can be downloaded The AIFA is the oficial government institution that reg- as a csv file 4 and itemizes the AIC codes of the authoulates the distribution of drugs in Italy1. The agency rized class A and class H drugs, together with a series maintains the list of drugs authorized for sale in Italy, the of auxiliary information. In addition, the AIFA website list of pharmaceutical companies producing them and all also provides an API endpoint to download all available the documentation made available by the manufacturer documentation for authorized drugs. The API allows to for each drug, including the drug leaflet. The leaflet is download a drug RCP by specifying the SIS and the AIC6 the short information document that accompanies the codes and the type of document required (RCP) in the drug in the package and is divided into two types: request URL, according to the scheme reported in URL 1.

• Foglietto Illustrativo (FI) - the Package Leaflet.

This is a document aimed at patients, with a sim- 2In collaboration with the European Medicines Agency (EMA), if plified structure and language. the drug is intended for multiple European countries 3https://www.aifa.gov.it/open-data 4https://www.aifa.gov.it/web/guest/liste-dei-farmaci

Using the drug list and the download API, we collected

8634 RCP files in PDF format relating to class A and class H drugs, which were then converted to raw text ifles using a PDF-to-text conversion tool. Figure 1 shows an excerpt from the first page of a downloaded RCP.

4. Data annotation

For data labeling, we followed a semi-automatic procedure which included a human in the loop. Specifically, we ifrst exploited a “committee” approach based on the use of two diferent automatic annotation models. Secondly, the annotated documents were reviewed by humans, with particular attention to the cases of discordant annotations returned by the two automatic annotators.

Tag-set

In designing the tag-set, we identified three families of

data points: Chemicals , Condition and Organism. In this way, for each family it is possible to define a subset of related entity types. In the first version of the data, only Condition has more than an entity type, but we intend to extend the groups in the future. The resulting tag-set is reported in Table 1.

Automatic Pre-annotation

In the first step of annotation, documents are automatically labeled using a committee approach. The idea is to employ a limited group of automatic annotators, usually consisting of diferent algorithms and models, to generate multiple annotations for a single file. The acceptability of each annotation is subsequently assessed by examining the levels of concordance and discordance among these automatic annotators. We selected two approaches that were considered very diferent so that the concordance cases would provide a higher degree of reliability: (a) a neural annotator based on the use of a generative LLM and (b) a symbolic pre-annotator based on the use of an NLP Platform.

LLM-based pre-annotator. This automatic annotator was based on the use of a generative LLM. In particular, using a prompt specifically designed and developed for this task (Prompt 1), for each data-point type, this annotator model asks to the LLM to identify all the entities present within the text (provided as input) belonging to the target data-point. When the length of the content exceeds the input size of the LLM, the content of the drug information leaflet is divided into smaller chunks respecting the sentence grammar. We chose to use LLama 3.1 70B5, a state-of-the-art, open source generative language model released by Meta that has reported excellent performance on several NLP tasks.

Given the generative nature of the LLM (as opposed to the word-classifying nature of the task), the result consists of a list of the identified entities without the position in which they were found (the start-end pairs). To obtain the final set of occurrences, a post-processing procedure performs the placement of the entities in the text using a string-matching search approach.

Rule-based pre-annotator. This annotator was based on the use of deep linguistic analysis. In particular, we used a NLP platform that, thanks to integrated linguistic resources (knowledge graph, semantic disambiguator and linguistic rules), allowed us to identify the occurrences of entities within the RCPs. For this task we used the proprietary NLP Platform of expert.ai6 which consists in an integrated environment for deep language understanding and provides a complete natural language workflow with end-to-end support for annotation, labeling, model training, testing and workflow orchestration. To increase the recognition performances, the selected NLP Platform has been specialized by integrating knowledge and linguistic rules for the medical and pharmaceutical domains. Given the lower generalization capacity typical of expert system approaches, for the entities identified by the rule-based annotator but missed by the LLM-based annotator, an additional verification step was included. In particular, for such cases of discordance, a further query was performed to an LLM in which confirmation of the extraction performed by the linguistic annotator is requested. The used prompt is reported in Prompt 2. For this additional step, we used the OpenAI GPT APIs7, and 5https://ai.meta.com/blog/meta-llama-3-1/?utm_source=chatgpt. com 6https://www.expert.ai/products/expert-ai-platform/ 7https://openai.com/api/

The output of the automatic pre-annotation phase con

sists of duplicate versions of the same RCP, each with labels inserted by the two diferent automatic annotators.

To produce the single and final annotated version, a subsequent review phase was necessary in which, for each document, the outputs of the two models were analyzed in order to be accepted or rejected, and in which any tags Review guidelines. To improve the consistency of the missed during the pre-annotation phase could also be ifnal annotations, a document containing guidelines was inserted. In particular, a merged version of each RCP was drafted and provided to the reviewers. In this document, then created, reporting the outputs of the two annotators, a set of indications on how to consider ambiguous cases also highlighting the cases of agreement (both models were specified, mainly based on the context in which had identified the occurrence of an entity) and disagree- they appear. ment (only one of the two models had hypothesized the occurrence of an entity). These “merged” documents were then distributed to human experts to examine the Annotation quality assessment annotations inserted by the pre-annotation phase (accept- To estimate the quality of the final annotation outputs, we ing or rejecting them) and with the possibility of adding exploited the set of 6 RCPs reviewed by a pair of human new ones. experts. In particular, by indicating with 1() and

For this human validation phase, we employed a panel 2() the two sets of annotations resulting from of five human reviewers and assigned each of them a set the review phase of reviewer rev1 and rev2 respectively of RCPs randomly drawn from the total. To subsequently measure the degree of consistency of the final annotation outputs, we designed the assignment so that part of them were blindly shared between two reviewers. In this way, we obtained a total of 57 RCPs selected for human review, 6 of which were randomly and hiddenly assigned to two reviewers. This step has been performed using the annotating support of the expert.ai natural language platform6.

Prompt 2

The prompt used for the LLM-based validation step for the linguistic-based pre-annotator. on the same document, we calculated several stan- Cohen’s kappa ( ) show a substantial agreement in the resulting annotated data [23]. (a) Joint Probability of Agreement, which measures the chance of having a match between the annotations resulting from the two reviewers: JPA = ##((11∪∩22)) .

∈ {1, 2}. (b) Conditional Probability of Agreement of rev, which measures the naive probability that annotations resulting from the reviewer have a match with the annotations resulting from the other reviewer: CPA = #(1∩2) , ∈ {1, 2}.

#() (c) Coverage of rev, which measures the probability that a randomly selected annotation in comes from the reviewer : Cov = #(1∪2)

#() , (d) Cohen’s kappa ( ), which extends the Joint Probability of Agreement taking into account that agreement may occur by chance [22]: the total number of annotations.

2 where = JPA is the observed agreement, = #(1)× #(2) estimates the probability of a random agreement and = #(1 ∪ 2) is = − 1 −

The values were evaluated for each , and then av

eraged over all RCPs (micro-average), separately for each data point. 8https://en.wikipedia.org/wiki/Inter-rater_reliability 9https://huggingface.co/datasets/expertai/PharmaER.IT 0 0 10000

6. Experiments

NER has been traditionally tackled as a token classification problem, with models fine-tuned on the downstream task. With the emergence of LLMs, alternative approaches to NER based on prompting in zero-shot or few-shot settings have gained popularity. Therefore, we established a set of baselines on PharmaER.IT using both strategies and a wide range of models.

Experimental setup pre-training corpora. In particular, we assessed on PharmaER.IT Llama-3.1-8B, LLaMAntino-3-8B [27], Minerva7B [28], Velvet-14B11, Salamandra-7B [29], EuroLLM9B [30] and Mistral-Small-3.1-24B12. We tested two different prompts. A simple one where the LLMs is asked to generate a structured JSON with the entity types as keys and the entities extracted as values. In the second prompt, a definition and some annotation guidelines are specified for each class. In this second evaluation, we also considered SLIMER-IT-8B [5], a fine-tuned version of LLaMAntino for zero-shot NER that follows the approach of [31]. Diferently from the rest, SLIMER-IT-8B extracts one entity type at a time, thus each context is repeated 4 times.

Document Chunking. PharmaER.IT documents are characterized by their considerable length and dense presence of annotated entities, which poses specific challenges for NER models based on transformer architectures with fixed-length input windows. As shown in Figure 2, many documents exceed the standard maximum token limit (e.g., 512 tokens). Documents’ size is also problematic for LLMs, which – despite supporting longer contexts – still face practical limits, especially when there are hundreds of entities per document. Therefore, documents are split in chunks. For encoders, we tokenize documents with their respective tokenizer. We set a maximum length of 512 and a window stride of 64. Conversely, for LLMs text is split in passages of sentences having at most 768 characters.

Training. Encoder models were fine-tuned on the

train/validation/test split reported in Table 3. To augment the training data, we also added the Silver corpus in the train set, and we evaluated its impact on the performance. We kept in all the experiments the learning rate fixed to 5 · 10− 5, 8 epochs and early stopping with patience 3. Batch size was set to 16 in all the experiments without silver data, and 128 otherwise. Unlike encoder-based models, LLMs were used without fine-tuning, relying solely on zero-shot prompts for entity extraction.

Models. We evaluated several state-of-the-art

transformer-based architectures for token classification that are widely adopted: bert [24], roberta [25] and Metrics. All the models were evaluated on the test set xlm-roberta [26]. We studied them on diferent sizes measuring the F1 score. However, due to the diferent and pre-trained versions specialized for Italian10 or chunking and the non-positional nature of generative multilingual. models, LLMs and token classifiers were evaluated in

Concerning LLMs, we considered several backbones dependently. We adopted the standard micro-F1 score ranging from 7B to 24B parameters, i.e. in the small- (simply denoted as F1) for token classification models medium size tier. We paid particular attention to models on their positional predictions. In LLMs instead, evaluathat were either pre-trained or further adapted for the tion occurs at document-level. First, we collect in each Italian language, or that explicitly included Italian in their passage all the unique text spans extracted per-class by the LLM, then we measure the F1 score against all the 10https://huggingface.co/dbmdz/bert-base-italian-cased 11https://huggingface.co/Almawave/Velvet-14B 12mistralai/Mistral-Small-3.1-24B-Instruct-2503 unique target entities of the document, following the UniNER [2] implementation13. Please note that these two F1 scores are computed on fundamentally diferent values, and therefore they are not comparable.

Results Token Classification. From Table 5, we can observe

that F1 score varies from about 66 to 72 across all models when fine-tuned on the training set without silver corpus. Roberta architectures yield the best scores, in particular xlm-roberta-large that achieves the best result (in the no silver setting).

LLMs with Definition and Guidelines. When the

prompt is enriched with entity type definition and annotation guidelines, all the LLMs generally improve their scores, with the exception of LlaMAntino, which regImpact of Silver Partition. The results, shown in Ta- isters a small flexion. In particular, all the models exble 5, clearly demonstrate that augmenting the training tracted some entities. This suggests that with appropriset with pre-annotated (silver) documents significantly ate prompt design there is room for improving these enhances model performance. All evaluated models ben- baselines. Results are presented in Table 7. efit from this data augmentation, with improvements reaching up to 9.64 F1 points. Notably, smaller mod- Table 7 els gain the most from the additional data, efectively LLMs with definition and guidelines. narrowing the performance gap between base and large architectures. As a result, the base versions of RoBERTa Model Precision Recall F1 and XLM-RoBERTa emerge as the best and second-best performing models, respectively.

Of-the-shelf LLMs. Zero-shot entity extraction of

pharmaceutical entities is a challenging task in such an unfamiliar domain. Albeit Mistral-small and LLamabased models achieve relevant scores, other LLMs like, Salamandra, Minerva and Velvet-14B, were not able to follow the provided instructions. Therefore, we reported in Table 6 the F1 scores of only the models that were able to extract entities. 13https://github.com/universal-ner Minerva-7B salamandra-7b Llama-3.1-8B LLaMAntino-3-8B SLIMER-IT-8B EuroLLM-9B Velvet-14B Mistral-Small-3.1-24B

7. Conclusions and future works In this work, we presented PharmaER.IT, an Entity Recog

nition dataset for the pharmaceutical domain in Italian language. PharmaER.IT was created from AIFA drug information leaflets. It includes two corpora: a curated GOLD corpus 57 of created semi-automatically, and the [3] H. Touvron, et al., Llama 2: Open founSILVER corpus, consisting of 2138 annotated RCPs with- dation and fine-tuned chat models, 2023. out human intervention. arXiv:2307.09288.

To establish comprehensive baselines, we assessed a [4] O. Sainz, et al., Gollie: Annotation guidelines selection of diferent NER models, both based on token- improve zero-shot information-extraction, 2024. classification models and zero-shot extraction with LLMs. arXiv:2310.03668.

The resulting PharmaER.IT dataset has been released in [5] A. Zamai, L. Rigutini, M. Maggini, A. Zugarini, HuggingFace14. Slimer-it: Zero-shot ner on italian language, arXiv

In the future, we intend to extend PharmaER.IT in two preprint arXiv:2409.15933 (2024). directions. On one side, we plan to increase the amount [6] E. F. T. K. Sang, F. D. Meulder, Introduction to of manually labeled data and to extend the labels set with the conll-2003 shared task: Language-independent more domain-specific tags. On the other hand, we aim to named entity recognition, in: Proceedings of the introduce relations between entities in order to extend Seventh Conference on Natural Language Learning the dataset to Relational Extraction. at HLT-NAACL 2003, 2003.

[7] D. S. Menezes, P. Savarese, R. L. Milidiú, Building a massive corpus for named entity recognition using Acknowledgements free open data sources, arxiv (2019). URL: https: We thank S. Ligabue, V. Masucci, M. Spagnolli and S. M. //arxiv.org/abs/1908.05758. arXiv:1908.05758. Marotta for their support in the data preparation and [8] D. Alves, G. Thakkar, M. Tadić, Building and evaluannotation process. ating universal named-entity recognition english The work was partially funded by: corpus, arxiv (2022). URL: https://arxiv.org/abs/ 2212.07162. arXiv:2212.07162. • “MAESTRO - Mitigare le Allucinazioni dei Large [9] N. Ringland, X. Dai, B. Hachey, S. Karimi, C. Paris, Language Models: ESTRazione di informazioni J. R. Curran, Nne: A dataset for nested named entity Ottimizzate” a project funded by Provincia Au- recognition in english newswire, in: 57th Annual tonoma di Trento with the Lp 6/99 Art. 5:ricerca e Meeting of the Association for Computational Linsviluppo, PAT/RFS067-05/06/2024-0428372, CUP: guistics (ACL), 2019. URL: https://aclanthology.org/ C79J2300117000115; P19-1510/.

[10] L. Loukas, M. Fergadiotis, I. Chalkidis, E. Spy• “ReSpiRA - REplicabilità, SPIegabilità e Ragiona- ropoulou, P. Malakasiotis, I. Androutsopoulos, mento”, a project financed by FAIR, Afiliated to G. Paliouras, Finer: Financial numeric entity recogspoke no. 2, falling within the PNRR MUR pro- nition for xbrl tagging, in: Proceedings of the 60th gramme, Mission 4, Component 2, Investment 1.3, Annual Meeting of the Association for ComputaD.D. No. 341 of 03/15/2022, Project PE0000013, tional Linguistics (Volume 1: Long Papers), 2022, CUP B43D2200090000416; pp. 4419–4431. • Villanova, a project financed by IPICEI-CIS, Prog. [11] A. Zugarini, A. Zamai, M. Ernandes, L. Rigutini, n. SA. 102519 - CUP B29J2400085000517. Buster: a ‘business transaction entity recognition’ dataset, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language ProcessReferences ing: Industry Track, 2023. URL: https://doi.org/10. 18653/v1/2023.emnlp-industry.57. doi:10.18653/ [1] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois, v1/2023.emnlp-industry.57.

Gliner: Generalist model for named entity [12] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei, recognition using bidirectional transformer, 2023. R. Leaman, A. P. Davis, C. J. Mattingly, T. C. arXiv:2311.08526. Wiegers, Z. Lu, Biocreative v cdr task corpus: a [2] W. Zhou, S. Zhang, Y. Gu, M. Chen, H. Poon, Uni- resource for chemical disease relation extraction, versalner: Targeted distillation from large language Database 2016 (2016). models for open named entity recognition, arXiv [13] C. Quirk, H. Poon, Distant supervision for relation preprint arXiv:2308.03279 (2023). extraction beyond the sentence boundary, arXiv 14https://huggingface.co/datasets/expertai/PharmaER.IT preprint arXiv:1609.04873 (2016). 15MAESTRO: https://www.opencup.gov.it/portale/web/opencup/ [14] M. Marrero, J. Urbano, S. Sánchez-Cuadrado, home/progetto/-/cup/C79J23001170001 J. Morato, J. M. Gómez-Berbís, Named entity 16RESPIRA: https://www.opencup.gov.it/portale/web/opencup/ recognition: Fallacies, challenges and opportunities, 17Vhoilmlaen/opvrao:getthot/t-p/csu:/p/w/Bw43wD.o2p2e0n00cu90p0.g0o0v4.it/portale/web/opencup/ Computer Standards & Interfaces 35 (2013) 482– home/progetto/-/cup/B29J24000850005 489. URL: https://www.sciencedirect.com/science/ article/pii/S0920548912001080. doi:https://doi. [23] J. R. Landis, G. G. Koch, The measurement of oborg/10.1016/j.csi.2012.09.004. server agreement for categorical data, biometrics [15] C. Bosco, V. Lombardo, L. Vassallo, A. Lesmo, Build- (1977) 159–174.

ing a treebank for italian: a data-driven annotation [24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, schema, in: Proceedings of the Second International Bert: Pre-training of deep bidirectional transformConference on Language Resources and Evaluation ers for language understanding, arXiv preprint (LREC’00), 2000. arXiv:1810.04805 (2018). [16] B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Ro- [25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, mano, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, I-CAB: the Italian content annotation bank, in: Roberta: A robustly optimized bert pretraining apN. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, proach, arXiv preprint arXiv:1907.11692 (2019). J. Mariani, J. Odijk, D. Tapias (Eds.), Proceedings [26] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudof the Fifth International Conference on Language hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, Resources and Evaluation (LREC’06), European Lan- L. Zettlemoyer, V. Stoyanov, Unsupervised crossguage Resources Association (ELRA), Genoa, Italy, lingual representation learning at scale, arXiv 2006. URL: http://www.lrec-conf.org/proceedings/ preprint arXiv:1911.02116 (2019).

lrec2006/pdf/518_pdf.pdf. [27] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, [17] V. Bartalesi Lenzi, M. Speranza, R. Sprugnoli, G. Fiameni, G. Semeraro, Llamantino: Llama 2 modNamed entity recognition on transcribed broadcast els for efective text generation in italian language, news at evalita 2011, in: B. Magnini, F. Cutugno, arXiv preprint arXiv:2312.09993 (2023). M. Falcone, E. Pianta (Eds.), Evaluation of Natural [28] R. Orlando, L. Moroni, P.-L. H. Cabot, S. Conia, Language and Speech Tools for Italian, Springer E. Barba, S. Orlandini, G. Fiameni, R. Navigli, MinBerlin Heidelberg, Berlin, Heidelberg, 2013, pp. 86– erva llms: The first family of large language models 97. trained from scratch on italian data, in: Proceedings [18] P. Basile, A. Caputo, A. Gentile, G. Rizzo, Overview of the 10th Italian Conference on Computational of the evalita 2016 named entity recognition and Linguistics (CLiC-it 2024), 2024, pp. 707–719. linking in italian tweets (neel-it) task, 2016. [29] A. Gonzalez-Agirre, M. Pàmies, J. Llop, I. Baucells, [19] T. Paccosi, A. Palmero Aprosio, KIND: an Italian S. Da Dalt, D. Tamayo, J. J. Saiz, F. Espuña, J. Prats, multi-domain dataset for named entity recognition, J. Aula-Blasco, et al., Salamandra technical report, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, arXiv preprint arXiv:2502.08489 (2025). C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Mae- [30] P. H. Martins, J. Alves, P. Fernandes, N. M. Guergaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis reiro, R. Rei, A. Farajian, M. Klimaszewski, D. M. (Eds.), Proceedings of the Thirteenth Language Alves, J. Pombal, M. Faysse, et al., Eurollm-9b: TechResources and Evaluation Conference, European nical report, arXiv preprint arXiv:2506.04079 (2025). Language Resources Association, Marseille, France, [31] A. Zamai, A. Zugarini, L. Rigutini, M. Ernandes, 2022, pp. 501–507. URL: https://aclanthology.org/ M. Maggini, Show less, instruct more: Enriching 2022.lrec-1.52. prompts with definitions and guidelines for zero[20] S. Tedeschi, R. Navigli, MultiNERD: A multi- shot ner, arXiv preprint arXiv:2407.01272 (2024). lingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation), in: Findings of the Association for Computational Linguistics: NAACL 2022, Association for Computational Linguistics, Seattle, United States, 2022, pp. 801–812. URL: https://aclanthology.org/ 2022.findings-naacl.60. doi: 10.18653/v1/2022.

findings-naacl.60. [21] T. M. Buonocore, C. Crema, A. Redolfi, R. Bellazzi, E. Parimbelli, Localizing in-domain adaptation of transformer-based biomedical language models, Journal of Biomedical Informatics (2023). URL: https://doi.org/10.1016/j.jbi.2023.

104431. doi:10.1016/j.jbi.2023.104431. [22] J. Cohen, A coeficient of agreement for nominal scales, Educational and Psychological Measurement 20 (1960) 37 – 46.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.