<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PharmaER.IT: an Italian Dataset for Entity Recognition in the Pharmaceutical Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Zugarini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Rigutini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>expert.ai</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siena (Italy)</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Despite significant advances in Natural Language Processing, applying state-of-the-art models to real-world business remains challenging. A key obstacle is the mismatch between widely used academic benchmarks and the noisy, imbalanced data often encountered in domains such as finance, law, and medicine, especially in non-English languages, where resources are typically scarce. To address this gap, we introduce PharmaER.IT, a new dataset for entity recognition in the pharmaceutical and medical domain for the Italian language. PharmaER.IT is constructed from drug information leaflets obtained from the Agenzia Italiana del Farmaco, and annotated using either semi-automatic or fully automatic methods. The dataset comprises two complementary corpora: (1) the GOLD corpus, consisting of 57 leaflets annotated via a committee-based algorithm followed by expert manual validation, yielding 16833 high-quality entity mentions; and (2) the SILVER corpus, containing 2138 leaflets annotated solely through the automatic pipeline, without any human curation. We establish reference performance evaluating a range of token classification models and several LLMs under zero-shot conditions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;NER</kwd>
        <kwd>Pharmaceutical NER</kwd>
        <kwd>Dataset</kwd>
        <kwd>LLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>licensing restrictions.</p>
      <p>In this article we present PharmaER.IT, a novel dataset
Named Entity Recognition (NER) is a crucial task in Nat- for NER in the pharmaceutical field for the Italian
lanural Language Processing (NLP), and arguably among guage. The dataset is derived from Riassunti delle
Caratthe most demanded in industrial applications. While teristiche del Prodotto (RCPs), the oficial drug
inforrecent advances in transformer-based models [1] and mation leaflets, made publicly available by the Agenzia
Large Language Models (LLMs) [2, 3, 4, 5] have signifi- Italiana del Farmaco (AIFA).
cantly improved entity extraction performance on stan- PharmaER.IT is composed of two complementary
cordard benchmarks, their application to real-world business pora: a curated GOLD corpus, consisting of 57 RCPs
anand professional contexts remains dificult. A primary notated using a committee-based approach and refined
challenge is the discrepancy between academic datasets through expert manual validation, and a SILVER corpus,
and the often noisy, domain-specific texts encountered in comprising 2138 RCPs automatically annotated without
real-world practice. This challenge is further exacerbated human intervention.
in specialized domains such as finance, law, and medicine. This dual-corpus structure enables both evaluation and
When dealing with languages other than English, anno- large-scale experimentation, facilitating the development
tated resources are even scarcer or non-existent. of high-quality models as well as scalable weakly
su</p>
      <p>In the medical and pharmaceutical domain, accurate pervised approaches. To establish baseline performance,
entity recognition is critical for applications ranging from we evaluate a range of token classification models and
drug safety monitoring to automated clinical documen- several LLMs under diverse zero-shot settings.
tation. However, the Italian language remains
underrepresented in the landscape of medical NER resources,
limiting the development and evaluation of robust systems 2. Related work
for local healthcare and regulatory contexts. Existing
datasets are either too small, lack suficient domain
speciifcity, or are unavailable for public use due to privacy or</p>
      <sec id="sec-1-1">
        <title>CoNLL-2003 [6], was one of the first NER datasets and it is still a reference corpus for NER. It was constituted of news articles annotated with four entity types: person</title>
        <p>CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- (PER), organization (ORG), location (LOC) and
misceltics, September 24 — 26, 2025, Cagliari, Italy laneous (MISC). Recently, numerous NER datasets have
* Corresponding author. been released, many of which have been constructed
us† These authors contributed equally. ing semi-automatic or fully automated annotation
meth$ azugarini@expert.ai (A. Zugarini); lrigutini@expert.ai ods [7, 8, 2, 9], significantly expanding the entity tag-set.
(L. 0R0i0g0u-t0in0i0)3-0344-1656 (A. Zugarini); 0000-0002-6309-2542 For instance, In Pile-NER [2], annotations were distilled
(L. Rigutini) from ChatGPT, resulting in about 45 thousands examples
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License in English and more than 13 thousands distinct entity
Attribution 4.0 International (CC BY 4.0).</p>
        <p>URL 1
The URL scheme of the RCPs on the AIFA website.</p>
        <p>https://api.aifa.gov.it/aifa-bdf-eif-be/1.0.0/organizzazione/[sis]/farmaci/[aic]/stampati?ts=RCP
types. Even so, these resources do not target documents
from vertical domains, such as finance or health. Other
works proposed domain-specific NER corpora, such as
in the financial [ 10, 11] and healthcare [12, 13] domains.</p>
        <p>However, these datasets are in English and mainly
consist of well-curated, isolated sentences. In contrast, our
work focuses on technical descriptions of pharmaceutical
drugs.
• Riassunto delle Caratteristiche del Prodotto (RCP).</p>
        <p>RCPs are documents for healthcare
professionals, with a slightly more complex structure and
more technical language and content. RCPs are
approved documents, part of the marketing
authorization for a drug, and intended primarily
for healthcare professionals, adopting a
medicalscientific terminology. In particular, RCPs
contain detailed information on how to use the
medicine, for example: therapeutic indications
(what the medicine treats), dosage and method of
administration, contraindications, special
warnings, mechanism of action, side efects.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Italian NER Datasets. The availability of NER data</title>
        <p>sets in Italian is extremely limited, particularly outside
the traditional general-purpose domains and entity labels
set [14]. Indeed, most NER datasets focus on news and
social media contents [15, 16, 17, 18, 19]. Given the strong technical content, we chose to use RCPs</p>
        <p>Recently, it was introduced Multinerd [20], a multi- to build our data set. We choose to use only RCPs and
iglingual dataset, covering Italian and a set of 15 distinct nore the FI since (1) the contents of the FIs are a subset of
entity types. Among them, Disease and Biological Entity the contents of the second, and (2) the RCPs contain
techclasses were included. However, examples originated nical information relating to pharmacological properties
from Wikipedia and Wikinews sentences, which are typ- and therapeutic indications that provide information of
ically educational and encyclopedic. In PharmaER.IT in- diagnostic-prescriptive value.
stead, we collected drug leaflets, which present an highly
technical and specialized lexicon. As an alternative
strategy, [21] proposed to translate existing healthcare En- Data download
glish datasets for NER, in Italian. Nonetheless, automatic For each drug authorized for sale in Italy, AIFA assigns
translation may introduce errors or segmentation issues, the unique AIC (Authorization for Placing on the Market)
especially on such a vertical domain. identification code 2 consisting of 9 digits in which the
the 3 most significant on the left identify the type of
pack3. Data Collection aging (capsules or syrup, mg, etc.), while the remaining
6 on the right (eventually padded with zeros) uniquely
In order to create a highly pharmaceutical-oriented identify the drug (it is also referred as AIC6). Similarly,
dataset in Italian, we collected documents from the AIFA also for the companies producing the authorized drugs,
website. AIFA assigns an unique three-digit code called SIS.
The open-data section of the AIFA website3 contains
Target documents several databases, including the list of drugs approved
and distributed in Italy. This list can be downloaded
The AIFA is the oficial government institution that reg- as a csv file 4 and itemizes the AIC codes of the
authoulates the distribution of drugs in Italy1. The agency rized class A and class H drugs, together with a series
maintains the list of drugs authorized for sale in Italy, the of auxiliary information. In addition, the AIFA website
list of pharmaceutical companies producing them and all also provides an API endpoint to download all available
the documentation made available by the manufacturer documentation for authorized drugs. The API allows to
for each drug, including the drug leaflet. The leaflet is download a drug RCP by specifying the SIS and the AIC6
the short information document that accompanies the codes and the type of document required (RCP) in the
drug in the package and is divided into two types: request URL, according to the scheme reported in URL 1.</p>
        <p>• Foglietto Illustrativo (FI) - the Package Leaflet.</p>
        <p>This is a document aimed at patients, with a sim- 2In collaboration with the European Medicines Agency (EMA), if
plified structure and language. the drug is intended for multiple European countries
3https://www.aifa.gov.it/open-data
4https://www.aifa.gov.it/web/guest/liste-dei-farmaci</p>
      </sec>
      <sec id="sec-1-3">
        <title>Using the drug list and the download API, we collected</title>
        <p>8634 RCP files in PDF format relating to class A and
class H drugs, which were then converted to raw text
ifles using a PDF-to-text conversion tool. Figure 1 shows
an excerpt from the first page of a downloaded RCP.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Data annotation</title>
      <p>For data labeling, we followed a semi-automatic
procedure which included a human in the loop. Specifically, we
ifrst exploited a “committee” approach based on the use of
two diferent automatic annotation models. Secondly, the
annotated documents were reviewed by humans, with
particular attention to the cases of discordant annotations
returned by the two automatic annotators.</p>
      <p>Tag-set</p>
      <sec id="sec-2-1">
        <title>In designing the tag-set, we identified three families of</title>
        <p>data points: Chemicals , Condition and Organism. In this
way, for each family it is possible to define a subset of
related entity types. In the first version of the data, only
Condition has more than an entity type, but we intend to
extend the groups in the future. The resulting tag-set is
reported in Table 1.</p>
        <sec id="sec-2-1-1">
          <title>Automatic Pre-annotation</title>
          <p>In the first step of annotation, documents are
automatically labeled using a committee approach. The idea is to
employ a limited group of automatic annotators, usually
consisting of diferent algorithms and models, to generate
multiple annotations for a single file. The acceptability of
each annotation is subsequently assessed by examining
the levels of concordance and discordance among these
automatic annotators. We selected two approaches that
were considered very diferent so that the concordance
cases would provide a higher degree of reliability: (a) a
neural annotator based on the use of a generative LLM
and (b) a symbolic pre-annotator based on the use of an
NLP Platform.</p>
          <p>LLM-based pre-annotator. This automatic annotator
was based on the use of a generative LLM. In particular,
using a prompt specifically designed and developed for
this task (Prompt 1), for each data-point type, this
annotator model asks to the LLM to identify all the entities
present within the text (provided as input) belonging to
the target data-point. When the length of the content
exceeds the input size of the LLM, the content of the drug
information leaflet is divided into smaller chunks
respecting the sentence grammar. We chose to use LLama 3.1
70B5, a state-of-the-art, open source generative language
model released by Meta that has reported excellent
performance on several NLP tasks.</p>
          <p>Given the generative nature of the LLM (as opposed to
the word-classifying nature of the task), the result
consists of a list of the identified entities without the position
in which they were found (the start-end pairs). To obtain
the final set of occurrences, a post-processing procedure
performs the placement of the entities in the text using a
string-matching search approach.</p>
          <p>Rule-based pre-annotator. This annotator was based
on the use of deep linguistic analysis. In particular, we
used a NLP platform that, thanks to integrated linguistic
resources (knowledge graph, semantic disambiguator and
linguistic rules), allowed us to identify the occurrences
of entities within the RCPs. For this task we used the
proprietary NLP Platform of expert.ai6 which consists in an
integrated environment for deep language understanding
and provides a complete natural language workflow with
end-to-end support for annotation, labeling, model
training, testing and workflow orchestration. To increase the
recognition performances, the selected NLP Platform has
been specialized by integrating knowledge and linguistic
rules for the medical and pharmaceutical domains.
Given the lower generalization capacity typical of
expert system approaches, for the entities identified by the
rule-based annotator but missed by the LLM-based
annotator, an additional verification step was included. In
particular, for such cases of discordance, a further query
was performed to an LLM in which confirmation of the
extraction performed by the linguistic annotator is
requested. The used prompt is reported in Prompt 2. For
this additional step, we used the OpenAI GPT APIs7, and
5https://ai.meta.com/blog/meta-llama-3-1/?utm_source=chatgpt.
com
6https://www.expert.ai/products/expert-ai-platform/
7https://openai.com/api/</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The output of the automatic pre-annotation phase con</title>
        <p>sists of duplicate versions of the same RCP, each with
labels inserted by the two diferent automatic annotators.</p>
        <p>To produce the single and final annotated version, a
subsequent review phase was necessary in which, for each
document, the outputs of the two models were analyzed
in order to be accepted or rejected, and in which any tags Review guidelines. To improve the consistency of the
missed during the pre-annotation phase could also be ifnal annotations, a document containing guidelines was
inserted. In particular, a merged version of each RCP was drafted and provided to the reviewers. In this document,
then created, reporting the outputs of the two annotators, a set of indications on how to consider ambiguous cases
also highlighting the cases of agreement (both models were specified, mainly based on the context in which
had identified the occurrence of an entity) and disagree- they appear.
ment (only one of the two models had hypothesized the
occurrence of an entity). These “merged” documents
were then distributed to human experts to examine the Annotation quality assessment
annotations inserted by the pre-annotation phase (accept- To estimate the quality of the final annotation outputs, we
ing or rejecting them) and with the possibility of adding exploited the set of 6 RCPs reviewed by a pair of human
new ones. experts. In particular, by indicating with 1() and</p>
        <p>For this human validation phase, we employed a panel 2() the two sets of annotations resulting from
of five human reviewers and assigned each of them a set the review phase of reviewer rev1 and rev2 respectively
of RCPs randomly drawn from the total. To subsequently
measure the degree of consistency of the final annotation
outputs, we designed the assignment so that part of them
were blindly shared between two reviewers. In this way,
we obtained a total of 57 RCPs selected for human
review, 6 of which were randomly and hiddenly assigned
to two reviewers. This step has been performed using
the annotating support of the expert.ai natural language
platform6.</p>
        <sec id="sec-2-2-1">
          <title>Prompt 2</title>
          <p>The prompt used for the LLM-based validation step for the linguistic-based pre-annotator.
on the same  document, we calculated several stan- Cohen’s kappa ( ) show a substantial agreement in the
resulting annotated data [23].
(a) Joint Probability of Agreement, which measures
the chance of having a match between the
annotations resulting from the two reviewers:
JPA = ##((11∪∩22)) .</p>
          <p>∈ {1, 2}.
(b) Conditional Probability of Agreement of rev,
which measures the naive probability that
annotations resulting from the reviewer  have a
match with the annotations resulting from the
other reviewer: CPA = #(1∩2) ,  ∈ {1, 2}.</p>
          <p>#()
(c) Coverage of rev, which measures the
probability that a randomly selected annotation in 
comes from the reviewer : Cov = #(1∪2)</p>
          <p>#() ,
(d) Cohen’s kappa ( ), which extends the Joint
Probability of Agreement taking into account that
agreement may occur by chance [22]: 
the total number of annotations.</p>
          <p>2
where  = JPA is the observed agreement,
 = #(1)× #(2) estimates the probability of
a random agreement and  = #(1 ∪ 2) is
= − 
1
−</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>The values were evaluated for each , and then av</title>
        <p>eraged over all RCPs (micro-average), separately for each
data point.
8https://en.wikipedia.org/wiki/Inter-rater_reliability
9https://huggingface.co/datasets/expertai/PharmaER.IT
0 0
10000</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Experiments</title>
      <p>NER has been traditionally tackled as a token
classification problem, with models fine-tuned on the
downstream task. With the emergence of LLMs, alternative
approaches to NER based on prompting in zero-shot or
few-shot settings have gained popularity. Therefore, we
established a set of baselines on PharmaER.IT using both
strategies and a wide range of models.</p>
      <p>Experimental setup
pre-training corpora. In particular, we assessed on
PharmaER.IT Llama-3.1-8B, LLaMAntino-3-8B [27],
Minerva7B [28], Velvet-14B11, Salamandra-7B [29],
EuroLLM9B [30] and Mistral-Small-3.1-24B12. We tested two
different prompts. A simple one where the LLMs is asked
to generate a structured JSON with the entity types as
keys and the entities extracted as values. In the second
prompt, a definition and some annotation guidelines are
specified for each class. In this second evaluation, we
also considered SLIMER-IT-8B [5], a fine-tuned version
of LLaMAntino for zero-shot NER that follows the
approach of [31]. Diferently from the rest, SLIMER-IT-8B
extracts one entity type at a time, thus each context is
repeated 4 times.</p>
      <p>Document Chunking. PharmaER.IT documents are
characterized by their considerable length and dense
presence of annotated entities, which poses specific
challenges for NER models based on transformer
architectures with fixed-length input windows. As shown in
Figure 2, many documents exceed the standard maximum
token limit (e.g., 512 tokens). Documents’ size is also
problematic for LLMs, which – despite supporting longer
contexts – still face practical limits, especially when there
are hundreds of entities per document. Therefore,
documents are split in chunks. For encoders, we tokenize
documents with their respective tokenizer. We set a
maximum length of 512 and a window stride of 64. Conversely,
for LLMs text is split in passages of sentences having at
most 768 characters.</p>
      <sec id="sec-3-1">
        <title>Training. Encoder models were fine-tuned on the</title>
        <p>train/validation/test split reported in Table 3. To augment
the training data, we also added the Silver corpus in the
train set, and we evaluated its impact on the performance.
We kept in all the experiments the learning rate fixed to
5 · 10− 5, 8 epochs and early stopping with patience 3.
Batch size was set to 16 in all the experiments without
silver data, and 128 otherwise. Unlike encoder-based
models, LLMs were used without fine-tuning, relying
solely on zero-shot prompts for entity extraction.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Models. We evaluated several state-of-the-art</title>
        <p>transformer-based architectures for token classification
that are widely adopted: bert [24], roberta [25] and Metrics. All the models were evaluated on the test set
xlm-roberta [26]. We studied them on diferent sizes measuring the F1 score. However, due to the diferent
and pre-trained versions specialized for Italian10 or chunking and the non-positional nature of generative
multilingual. models, LLMs and token classifiers were evaluated
in</p>
        <p>Concerning LLMs, we considered several backbones dependently. We adopted the standard micro-F1 score
ranging from 7B to 24B parameters, i.e. in the small- (simply denoted as F1) for token classification models
medium size tier. We paid particular attention to models on their positional predictions. In LLMs instead,
evaluathat were either pre-trained or further adapted for the tion occurs at document-level. First, we collect in each
Italian language, or that explicitly included Italian in their passage all the unique text spans extracted per-class by
the LLM, then we measure the F1 score against all the
10https://huggingface.co/dbmdz/bert-base-italian-cased
11https://huggingface.co/Almawave/Velvet-14B
12mistralai/Mistral-Small-3.1-24B-Instruct-2503
unique target entities of the document, following the
UniNER [2] implementation13. Please note that these
two F1 scores are computed on fundamentally diferent
values, and therefore they are not comparable.</p>
        <sec id="sec-3-2-1">
          <title>Results</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Token Classification. From Table 5, we can observe</title>
        <p>that F1 score varies from about 66 to 72 across all models
when fine-tuned on the training set without silver corpus.
Roberta architectures yield the best scores, in particular
xlm-roberta-large that achieves the best result (in the no
silver setting).</p>
        <sec id="sec-3-3-1">
          <title>LLMs with Definition and Guidelines. When the</title>
          <p>prompt is enriched with entity type definition and
annotation guidelines, all the LLMs generally improve their
scores, with the exception of LlaMAntino, which
regImpact of Silver Partition. The results, shown in Ta- isters a small flexion. In particular, all the models
exble 5, clearly demonstrate that augmenting the training tracted some entities. This suggests that with
appropriset with pre-annotated (silver) documents significantly ate prompt design there is room for improving these
enhances model performance. All evaluated models ben- baselines. Results are presented in Table 7.
efit from this data augmentation, with improvements
reaching up to 9.64 F1 points. Notably, smaller mod- Table 7
els gain the most from the additional data, efectively LLMs with definition and guidelines.
narrowing the performance gap between base and large
architectures. As a result, the base versions of RoBERTa Model Precision Recall F1
and XLM-RoBERTa emerge as the best and second-best
performing models, respectively.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Of-the-shelf LLMs. Zero-shot entity extraction of</title>
        <p>pharmaceutical entities is a challenging task in such
an unfamiliar domain. Albeit Mistral-small and
LLamabased models achieve relevant scores, other LLMs like,
Salamandra, Minerva and Velvet-14B, were not able to
follow the provided instructions. Therefore, we reported
in Table 6 the F1 scores of only the models that were able
to extract entities.
13https://github.com/universal-ner
Minerva-7B
salamandra-7b
Llama-3.1-8B
LLaMAntino-3-8B
SLIMER-IT-8B
EuroLLM-9B
Velvet-14B
Mistral-Small-3.1-24B</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusions and future works</title>
      <sec id="sec-4-1">
        <title>In this work, we presented PharmaER.IT, an Entity Recog</title>
        <p>nition dataset for the pharmaceutical domain in Italian
language. PharmaER.IT was created from AIFA drug
information leaflets. It includes two corpora: a curated
GOLD corpus 57 of created semi-automatically, and the [3] H. Touvron, et al., Llama 2: Open
founSILVER corpus, consisting of 2138 annotated RCPs with- dation and fine-tuned chat models, 2023.
out human intervention. arXiv:2307.09288.</p>
        <p>To establish comprehensive baselines, we assessed a [4] O. Sainz, et al., Gollie: Annotation guidelines
selection of diferent NER models, both based on token- improve zero-shot information-extraction, 2024.
classification models and zero-shot extraction with LLMs. arXiv:2310.03668.</p>
        <p>The resulting PharmaER.IT dataset has been released in [5] A. Zamai, L. Rigutini, M. Maggini, A. Zugarini,
HuggingFace14. Slimer-it: Zero-shot ner on italian language, arXiv</p>
        <p>In the future, we intend to extend PharmaER.IT in two preprint arXiv:2409.15933 (2024).
directions. On one side, we plan to increase the amount [6] E. F. T. K. Sang, F. D. Meulder, Introduction to
of manually labeled data and to extend the labels set with the conll-2003 shared task: Language-independent
more domain-specific tags. On the other hand, we aim to named entity recognition, in: Proceedings of the
introduce relations between entities in order to extend Seventh Conference on Natural Language Learning
the dataset to Relational Extraction. at HLT-NAACL 2003, 2003.</p>
        <p>[7] D. S. Menezes, P. Savarese, R. L. Milidiú, Building a
massive corpus for named entity recognition using
Acknowledgements free open data sources, arxiv (2019). URL: https:
We thank S. Ligabue, V. Masucci, M. Spagnolli and S. M. //arxiv.org/abs/1908.05758. arXiv:1908.05758.
Marotta for their support in the data preparation and [8] D. Alves, G. Thakkar, M. Tadić, Building and
evaluannotation process. ating universal named-entity recognition english
The work was partially funded by: corpus, arxiv (2022). URL: https://arxiv.org/abs/
2212.07162. arXiv:2212.07162.
• “MAESTRO - Mitigare le Allucinazioni dei Large [9] N. Ringland, X. Dai, B. Hachey, S. Karimi, C. Paris,
Language Models: ESTRazione di informazioni J. R. Curran, Nne: A dataset for nested named entity
Ottimizzate” a project funded by Provincia Au- recognition in english newswire, in: 57th Annual
tonoma di Trento with the Lp 6/99 Art. 5:ricerca e Meeting of the Association for Computational
Linsviluppo, PAT/RFS067-05/06/2024-0428372, CUP: guistics (ACL), 2019. URL: https://aclanthology.org/
C79J2300117000115; P19-1510/.</p>
        <p>[10] L. Loukas, M. Fergadiotis, I. Chalkidis, E.
Spy• “ReSpiRA - REplicabilità, SPIegabilità e Ragiona- ropoulou, P. Malakasiotis, I. Androutsopoulos,
mento”, a project financed by FAIR, Afiliated to G. Paliouras, Finer: Financial numeric entity
recogspoke no. 2, falling within the PNRR MUR pro- nition for xbrl tagging, in: Proceedings of the 60th
gramme, Mission 4, Component 2, Investment 1.3, Annual Meeting of the Association for
ComputaD.D. No. 341 of 03/15/2022, Project PE0000013, tional Linguistics (Volume 1: Long Papers), 2022,
CUP B43D2200090000416; pp. 4419–4431.
• Villanova, a project financed by IPICEI-CIS, Prog. [11] A. Zugarini, A. Zamai, M. Ernandes, L. Rigutini,
n. SA. 102519 - CUP B29J2400085000517. Buster: a ‘business transaction entity recognition’
dataset, in: Proceedings of the 2023 Conference on
Empirical Methods in Natural Language
ProcessReferences ing: Industry Track, 2023. URL: https://doi.org/10.
18653/v1/2023.emnlp-industry.57. doi:10.18653/
[1] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois, v1/2023.emnlp-industry.57.</p>
        <p>Gliner: Generalist model for named entity [12] J. Li, Y. Sun, R. J. Johnson, D. Sciaky, C.-H. Wei,
recognition using bidirectional transformer, 2023. R. Leaman, A. P. Davis, C. J. Mattingly, T. C.
arXiv:2311.08526. Wiegers, Z. Lu, Biocreative v cdr task corpus: a
[2] W. Zhou, S. Zhang, Y. Gu, M. Chen, H. Poon, Uni- resource for chemical disease relation extraction,
versalner: Targeted distillation from large language Database 2016 (2016).
models for open named entity recognition, arXiv [13] C. Quirk, H. Poon, Distant supervision for relation
preprint arXiv:2308.03279 (2023). extraction beyond the sentence boundary, arXiv
14https://huggingface.co/datasets/expertai/PharmaER.IT preprint arXiv:1609.04873 (2016).
15MAESTRO: https://www.opencup.gov.it/portale/web/opencup/ [14] M. Marrero, J. Urbano, S. Sánchez-Cuadrado,
home/progetto/-/cup/C79J23001170001 J. Morato, J. M. Gómez-Berbís, Named entity
16RESPIRA: https://www.opencup.gov.it/portale/web/opencup/ recognition: Fallacies, challenges and opportunities,
17Vhoilmlaen/opvrao:getthot/t-p/csu:/p/w/Bw43wD.o2p2e0n00cu90p0.g0o0v4.it/portale/web/opencup/ Computer Standards &amp; Interfaces 35 (2013) 482–
home/progetto/-/cup/B29J24000850005 489. URL: https://www.sciencedirect.com/science/
article/pii/S0920548912001080. doi:https://doi. [23] J. R. Landis, G. G. Koch, The measurement of
oborg/10.1016/j.csi.2012.09.004. server agreement for categorical data, biometrics
[15] C. Bosco, V. Lombardo, L. Vassallo, A. Lesmo, Build- (1977) 159–174.</p>
        <p>ing a treebank for italian: a data-driven annotation [24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
schema, in: Proceedings of the Second International Bert: Pre-training of deep bidirectional
transformConference on Language Resources and Evaluation ers for language understanding, arXiv preprint
(LREC’00), 2000. arXiv:1810.04805 (2018).
[16] B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Ro- [25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
mano, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
I-CAB: the Italian content annotation bank, in: Roberta: A robustly optimized bert pretraining
apN. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, proach, arXiv preprint arXiv:1907.11692 (2019).
J. Mariani, J. Odijk, D. Tapias (Eds.), Proceedings [26] A. Conneau, K. Khandelwal, N. Goyal, V.
Chaudof the Fifth International Conference on Language hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
Resources and Evaluation (LREC’06), European Lan- L. Zettlemoyer, V. Stoyanov, Unsupervised
crossguage Resources Association (ELRA), Genoa, Italy, lingual representation learning at scale, arXiv
2006. URL: http://www.lrec-conf.org/proceedings/ preprint arXiv:1911.02116 (2019).</p>
        <p>lrec2006/pdf/518_pdf.pdf. [27] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
[17] V. Bartalesi Lenzi, M. Speranza, R. Sprugnoli, G. Fiameni, G. Semeraro, Llamantino: Llama 2
modNamed entity recognition on transcribed broadcast els for efective text generation in italian language,
news at evalita 2011, in: B. Magnini, F. Cutugno, arXiv preprint arXiv:2312.09993 (2023).
M. Falcone, E. Pianta (Eds.), Evaluation of Natural [28] R. Orlando, L. Moroni, P.-L. H. Cabot, S. Conia,
Language and Speech Tools for Italian, Springer E. Barba, S. Orlandini, G. Fiameni, R. Navigli,
MinBerlin Heidelberg, Berlin, Heidelberg, 2013, pp. 86– erva llms: The first family of large language models
97. trained from scratch on italian data, in: Proceedings
[18] P. Basile, A. Caputo, A. Gentile, G. Rizzo, Overview of the 10th Italian Conference on Computational
of the evalita 2016 named entity recognition and Linguistics (CLiC-it 2024), 2024, pp. 707–719.
linking in italian tweets (neel-it) task, 2016. [29] A. Gonzalez-Agirre, M. Pàmies, J. Llop, I. Baucells,
[19] T. Paccosi, A. Palmero Aprosio, KIND: an Italian S. Da Dalt, D. Tamayo, J. J. Saiz, F. Espuña, J. Prats,
multi-domain dataset for named entity recognition, J. Aula-Blasco, et al., Salamandra technical report,
in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, arXiv preprint arXiv:2502.08489 (2025).
C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Mae- [30] P. H. Martins, J. Alves, P. Fernandes, N. M.
Guergaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis reiro, R. Rei, A. Farajian, M. Klimaszewski, D. M.
(Eds.), Proceedings of the Thirteenth Language Alves, J. Pombal, M. Faysse, et al., Eurollm-9b:
TechResources and Evaluation Conference, European nical report, arXiv preprint arXiv:2506.04079 (2025).
Language Resources Association, Marseille, France, [31] A. Zamai, A. Zugarini, L. Rigutini, M. Ernandes,
2022, pp. 501–507. URL: https://aclanthology.org/ M. Maggini, Show less, instruct more: Enriching
2022.lrec-1.52. prompts with definitions and guidelines for
zero[20] S. Tedeschi, R. Navigli, MultiNERD: A multi- shot ner, arXiv preprint arXiv:2407.01272 (2024).
lingual, multi-genre and fine-grained dataset for
named entity recognition (and disambiguation),
in: Findings of the Association for
Computational Linguistics: NAACL 2022, Association for
Computational Linguistics, Seattle, United States,
2022, pp. 801–812. URL: https://aclanthology.org/
2022.findings-naacl.60. doi: 10.18653/v1/2022.</p>
        <p>findings-naacl.60.
[21] T. M. Buonocore, C. Crema, A. Redolfi, R.
Bellazzi, E. Parimbelli, Localizing in-domain
adaptation of transformer-based biomedical
language models, Journal of Biomedical
Informatics (2023). URL: https://doi.org/10.1016/j.jbi.2023.</p>
        <p>104431. doi:10.1016/j.jbi.2023.104431.
[22] J. Cohen, A coeficient of agreement for nominal
scales, Educational and Psychological Measurement
20 (1960) 37 – 46.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword, Improve writing style, and Grammar and spelling check. After using these
tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>