SLIMER-IT: Zero-Shot NER on Italian Language
                                Andrew Zamai1,2 , Leonardo Rigutini2 , Marco Maggini1 and Andrea Zugarini2,*
                                1
                                    Università degli Studi di Siena, Italy
                                2
                                    expert.ai, Siena, Italy


                                                   Abstract
                                                   Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although
                                                   these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to
                                                   out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated
                                                   strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages.
                                                   In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we
                                                   introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts
                                                   enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of
                                                   SLIMER-IT on never-seen-before entity tags.

                                                   Keywords
                                                   Named Entity Recognition, Zero-Shot NER, Large Language Models, Instruction tuning


                                1. Introduction
                                Named Entity Recognition (NER) plays a fundamental
                                role in Natural Language Processing (NLP), often being a
                                key component in information extraction pipelines. The
                                task involves identifying and categorizing entities in a
                                given text according to a predefined set of labels. While
                                person, organization, and location are the most common,
                                applications of NER in certain fields may require the
                                identification of domain-specific entities.
                                   Manually annotated data has always been critical for
                                the training of NER systems [1]. Traditional methods
                                tackle NER as a token classification problem, where mod-
                                els are specialized on a narrow domain and a pre-defined
                                labels set [2]. While achieving strong performance for
                                the data distribution they were trained on, they require
                                extensive human annotations relative to the downstream
                                task at hand. Additionally, they lack generalization capa-
                                bilities when it comes to addressing out-of-distribution
                                input domains and/or unseen labels [1, 3, 4].
                                   On the contrary, Large Language Models (LLMs)                                                            Figure 1: SLIMER-IT instruction tuning prompt. Dedicated
                                have recently demonstrated strong zero-shot capabilities.                                                   entity definition and guidelines steer the model labelling.
                                Models like GPT-3 can tackle NER via In-Context Learn-
                                ing [5, 6], with Instruction-Tuning further improving per-
                                formance [7, 8, 9]. To this end, several models have been                                                      However, little has been done for zero-shot NER in
                                proposed to tackle zero-shot NER [10, 4, 3, 11, 12, 13]. In                                                 non-English data. More in general, as pointed out in [1],
                                particular, SLIMER [13] proved to be particularly effective                                                 NER is understudied in languages like Italian, especially
                                on unseen named entity types, by leveraging definitions                                                     outside the traditional news domain and person, location,
                                and guidelines to steer the model generation.                                                               organization classes.
                                                                                                                                               To this end, we propose in this paper an evaluation
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                        framework for Zero-Shot NER, and we apply it to the
                                Dec 04 — 06, 2024, Pisa, Italy
                                                                                                                                            Italian language. In addition, we fine-tune a version of
                                *
                                  Corresponding author.
                                $ andrew.zamai@unisi.it (A. Zamai); lrigutini@expert.ai                                                     SLIMER for Italian, which we call SLIMER-IT1 . In the
                                (L. Rigutini); marco.maggini@unisi.it (M. Maggini);                                                         experiments, we explore different LLM backbones and
                                azugarini@expert.ai (A. Zugarini)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                             Attribution 4.0 International (CC BY 4.0).                                                         https://github.com/andrewzamai/SLIMER_IT


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
we assess the impact of Definition and Guidelines (D&G). 3. Zero-Shot NER Framework
When comparing SLIMER-IT with state-of-the-art ap-
proaches, either using models pre-trained on English In traditional Machine-Learning theory, a model 𝑓 ,
or adapted for Italian, results demonstrate SLIMER-IT trained for a task (e.g. NER) represented by a dataset
superiority in labelling unseen entity tags.                𝒳 , 𝒴, is typically evaluated on an held-out test set sam-
                                                            pled from the same task and distribution of the training.
                                                            In zero-shot learning instead, a model is expected to go
2. Related Work                                             beyond what experienced during training. There are
                                                            different levels of generalization indicating up to what
Several works tackle Zero-Shot NER on English, such as extent the model goes beyond what directly learnt.
InstructUIE [10], UniNER [4], GoLLIE [3], GLiNER [11],         In the case of zero-shot NER, a model should be able
GNER [12] and SLIMER [13]. Most of them are based on to extract entities from inputs belonging to the same do-
the instruction tuning of an LLM and mainly differ in the main it was trained on (in-domain) and across other do-
prompt and output format design. GLiNER distinguishes mains not encountered before (out-of-domain). More-
itself by being a smaller encoder-only model, combined over, it should also generalize well to novel entity classes
with a span classifier head, that achieves competitive (unseen named entities). In our zero-shot evaluation
performance at a lower computational cost.                  framework we aim to measure each level independently.
   As highlighted in SLIMER [13], most approaches Hence, we define an evaluation benchmark that includes
mainly focus on zero-shot NER in Out-Of-Distribution a collection of NER datasets divided by degree of gen-
input domains (OOD), since they are typically fine-tuned eralization. In the following we describe the required
on an extensive number of entity classes highly or com- properties to fit in.
pletely overlapping between training and test sets. In
view of this, we proposed a lighter instruction-tuning
                                                            In-domain. This evaluation helps measure how well
methodology for LLMs, training on data overlapping in
                                                            the model can generalize from its training data to similar,
lesser degree with the test sets, while steering the model
                                                            but not identical, data. The model is evaluated on the
annotation process with a definition and guidelines for
                                                            same input-domains and named entities as those in the
the NE category to be annotated. From this, the name
                                                            training set. This data often consists in the test partitions
SLIMER: Show Less, Instruct More Entity Recognition.
                                                            associated with each training set used for fine-tuning the
   Although the authors of GLiNER propose also a multi-
                                                            model.
lingual model and evaluate zero-shot generalizability
across different languages, neither they nor any other
work has addressed the task of Zero-Shot NER specifi- Out-Of-Domain (OOD). OOD evaluation tests the
cally for the Italian language.                             model’s ability to generalize to input texts from domains
                                                            that it has not encountered during training. While the
                                                            named entities have been seen during training, this type
NER for Italian. While NER has been extensively stud-
                                                            of evaluation is particularly challenging because different
ied on English, less has been done in other languages,
                                                            input domains often exhibit unique linguistic patterns
particularly outside the traditional general-purpose do-
                                                            and domain-specific terminology.
mains and entity labels set [14]. Indeed, in Italian, most
NER datasets focus on news and, more recently, social me-
dia contents [15, 16, 17]. Currently, there has been no re- Unseen Named Entities. This evaluation tests the
search into zero-shot NER, only a few exploratory studies model’s ability to identify and classify entities that has
into multi-domain NER. This challenge was introduced not encountered during its training phase. The tag set
in the NERMuD task (NER Multi-Domain) at EVALITA comprises fine-grained categories which are often specif-
20232 , in which one sub-task required to develop a single ically defined for the domain in which NER is deployed.
model capable of classifying the common entities - person, Because of this, the input data may often be also Out-
organization, location - from different types of text, in- Of-Domain (OOD), making this evaluation include the
cluding news, fiction and political speeches. ExtremITA previously mentioned OOD scenario as well.
team [18] addressed the challenge proposing the adop-
tion of a single LLM capable of tackling all the different 4. SLIMER-IT
tasks at EVALITA 2023, among which NERMuD. All the
tasks were converted into text-to-text problems and two To adapt SLIMER for Italian, we translate the instruction-
LLMs (LLaMA and T5 based) were instruction-tuned on tuning prompt of [13], as shown in Figure 1. The prompt
the union of all the available datasets for the challenge. is designed to extract the occurrences of one entity type
                                                            per call. While this has the drawback of requiring |NE|
2
    https://www.evalita.it/campaigns/evalita-2023/tasks/
inference calls on each input text, it allows the model to       vehicle. We keep the Italian examples only. Such a dataset
better focus on a single NE type at a time.                      constitutes a perfect choice to assess models’ capabilities
   As in [13], we query gpt-3.5-turbo-1106 via OpenAI’s          on unseen NEs. Indeed, data belongs to the same news
Chat-GPT APIs to automatically generate definition and           domain of the NERMuD split chosen for fine-tuning, but
guidelines for each needed entity tag. The definition for        it includes a broader label set. Since we want to measure
a NE is meant to be a short sentence describing the tag.         performance on never-seen-before entities, we exclude
The guidelines instead provide annotation instructions           entity types seen in training, i.e. person, organization and
to align the model’s labelling with the desired annotation       location. We also remove biological entity, being poorly
scheme. Guidelines can be used to prevent the model              underrepresented, with a support of just 4 instances.
from labelling certain edge cases or to provide examples
of such NE. Such an informative prompt is extremely              5.2. Backbones
valuable when dealing with unfamiliar entity tags, and
can also be used to distinguish between polysemous cat-   We implemented several version of SLIMER-IT based on
egories.                                                  different backbone models. We consider similarly sized
   Finally, the model is requested to generate the named  LLMs, all in the 7B parameters range. In particular, we
entities in a parsable JSON format containing the list of selected five backbones: Camoscio4 [21], LLaMA-2-7b-
NEs extracted for the given tag.                          chat [22], Mistral-7B-Instruct [23], LLaMA-3-8B-Instruct,
                                                          LLaMAntino-3-ANITA-8B-Inst-DPO-ITA5 [24].
                                                             LLaMA-2-7b-chat was originally used in SLIMER [13],
5. Experiments                                            and LLaMA-3-8B-Instruct is the newest, improved ver-
                                                          sion of it. As LLaMA family, Mistral-7B-Instruct is a
Experiments aim to assess our approach in Italian. We multilingual model mainly English-oriented, but it has
study the impact of guidelines and the usage of different demonstrated greater fluency on Italian. Camoscio and
backbones. Then, we compare our approach against state- LLaMAntino-3-ANITA-8B-Inst-DPO-ITA, instead, are
of-the-art alternatives.                                  two LLMs specifically fine-tuned on Italian instructions.

5.1. Datasets                                                    5.3. Compared Models
We construct the zero-shot NER framework (described              We compare the SLIMER-IT approach, implemented with
in Section 3) for Italian upon NerMuD shared task and            different backbones, against other state-of-the-art ap-
Multinerd dataset. In particular, we use NerMuD to build         proaches for zero-shot NER. All the methods are trained
in-domain and OOD evaluation sets, while Multinerd-              and evaluated in the defined zero-shot NER framework
IT is used to assess the behaviour in the unseen named           for a fair comparison. We evaluate against:
entites scenario.
                                                                 Token classification. Although certainly not being
NERMuD. NERMuD [1] is a shared task organized at
                                                                 suited for zero-shot NER, due to its architectural inability
evalita-2023, built based on the Kessler Italian Named-
                                                                 to cope with unseen tags, we decided to evaluate the most
entities Dataset (KIND) [19]. It contains annotations
                                                                 known approach to NER as baseline. As in NERMuD
for the three classic NER tags: person, organization and
                                                                 [1], we use the training framework dhfbk/bert-ner 6 . We
location. Examples are organized in three distinct do-
                                                                 fine-tune two different base models, bert-base-cased, pre-
mains: news, literature and political discourses. Unlike
                                                                 trained on English, and dbmdz/bert-base-italian-cased 7 ,
NERMuD, we restrict fine-tuning to a single domain. In
                                                                 an Italian version.
such a way, we can evaluate both in-domain and out-
of-domain capabilities of the model. In particular, we
designate WikiNews (WN) sub-set for training and in-             GNER. It is the best performing approach on zero-shot
domain evaluation, being the most generic domain, while          NER in OOD English benchmark. In GNER [12], they
Fiction (FIC) and Alcide De Gasperi (ADG) splits are kept        propose a BIO-like generation, replicating in output the
for out-of-domain evaluation only.                               same input text, along with a token-by-token BIO label.
                                                                 Here, we consider LLaMAntino-3 as its backbone.
Multinerd-IT. To construct the unseen NEs evalua-
tion set, we exploit Multinerd3 [20], a multilingual NER
dataset made of 15 tags: person, organization, location, an-     4
                                                                   https://huggingface.co/teelinsan/camoscio-7b-llama
                                                                 5
imal, biological entity, celestial body, disease, event, food,     https://huggingface.co/swap-uniba/
instrument, media, plant, mythological entity, time and            LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
                                                                 6
                                                                   https://github.com/dhfbk/bert-ner
3                                                                7
    https://github.com/Babelscape/multinerd                        https://huggingface.co/dbmdz/bert-base-italian-cased
Table 1
Comparing SLIMER-IT based on different backbones, with and without Definition and Guidelines (D&G) in the prompt. LLMs
with † symbol were instruction-tuned on Italian. In parentheses the (±Δ𝐹 1) of performance given by the usage of D&G.

            Backbone                   Params   w/ D&G   In-Domain                       OOD                  unseen NEs
                                                            WN                   FIC             ADG             MN
                                                 False       81.80               82.44          79.01             32.28
            Camoscio †                   7B
                                                 True     81.50 (-0.3)       85.08 (+2.64)   76.00 (-3.01)     38.68 (+6.4)
                                                 False       80.69               80.45           73.81             32.38
           LLaMA-2-chat                  7B
                                                 True    83.24 (+2.55)       88.81 (+8.36)   79.26 (+5.45)     35.16 (+2.78)
                                                 False       82.71               85.61           75.80             35.63
          Mistral-Instruct               7B
                                                 True    85.55 (+2.84)       92.78 (+7.17)   80.56 (+4.76)     40.64 (+5.01)
                                                 False      85.93                82.85          80.00             27.62
         LLaMA-3-Instruct                8B
                                                 True    85.38 (-0.55)       84.38 (+1.53)   78.29 (-1.71)    50.74 (+23.12)
                                                 False       84.12               77.06           74.35            30.90
     LLaMAntino-3-ANITA †                8B
                                                 True    85.78 (+1.66)       82.52 (+5.46)   81.65 (+7.30)    54.65 (+23.75)


                                                                       100
GLiNER. Differently from all other methods, GLiNER                      90
                                                                                                                      Camoscio
                                                                                                                      LLaMA2
is based on a smaller encoder-only model, combined with                 80                                            Mistral
                                                                                                                      LLaMA3
a span classifier head, able to achieve competitive per-                70                                            LLaMAntino3
formance on the OOD English benchmark at a lower                        60
                                                                        50
                                                                 -F1


computational cost. We fine-tune it both using its orig-
                                                                        40
inal deberta-v3-large English backbone and the Italian                  30
dbmdz/bert-base-italian-cased model.                                    20
                                                                        10
extremITLLaMA. Already described in Section 2, it          WN (supervised) FIC (OOD) ADG (OOD)   MN (unseen NEs)
represents an interesting approach to compare against.
Based on Camoscio LLM, we compare it with SLIMER-IT Figure 2: SLIMER-IT performance for different backbones.
approach implemented with the same backbone.
                                                                 Table 2
5.4. Experimental setup                                          Comparison with existing off-the-shelf models for zero-shot
                                                                 NER on Italian. We omit in-domain evaluation to not disad-
We kept the same training configuration of SLIMER [13]           vantage them against SLIMER-IT.
on English, except that we trained on all available
samples. Depending on the backbone, the instruction-                   Model                     OOD           unseen NEs
tuning prompt (see Figure 1) was adjusted accord-                                             FIC  ADG            MN
ingly to the structure of its template (e.g. [INST] or                 Universal-NER-ITA     32.4      43.2   12.8 (all seen)
<|start_header_id|> formats). For all the competitors, we              GLiNER-ITA-Large      36.6      42.0   15.5 (all seen)
replicated their training setup using their scripts and sug-           GLiNER-ML             46.5      49.4   17.4 (all seen)
gested hyper-parameters. For the evaluation, we use the
micro-F1 as computed in the UniNER8 implementation.                    SLIMER-IT             82.5     81.7         54.7


5.5. Results                                                     D&Gs and the one not using them. Generally, definition
Impact of Definition and Guidelines (D&G). We                    and guidelines yield improvements in F1. In particular,
compare SLIMER-IT with a version devoid of definition            the gap is contained when evaluating on in-domain data,
and guidelines in the prompt. To demonstrate the ro-             whereas it becomes significant in OOD and even more
bustness of the approach, we train several SLIMER-IT             substantial in unseen NEs. This is expected since D&G
instances, based on different LLM backbones. In Table            help the most in conditions unseen during training. No-
1, we report the results, highlighting the absolute dif-         tably, LLaMA-3-based backbones benefit the most from
ference in performance between the model steered by              definition and guidelines, with improvements beyond
                                                                 23 absolute F1 points, surpassing all the other models
8
    https://github.com/universal-ner                             by substantial margins in never-seen-before entity tags.
Table 3
Comparing SLIMER-IT with state-of-the-art approaches trained in the same zero-shot setting, and adopting the same backbone
when possible. *Note that extremITLLaMA was fine-tuned also on the FIC and ADG train sets for the NERMuD task, so these
datasets are not actually OOD for this model.

     Approach                 Backbone         Language        Params     In-Domain          OOD          unseen NEs
                                                                             WN           FIC  ADG           MN
     Token classification     BERT-base            EN            0.11B        83.9        75.6    75.0          -
     Token classification     BERT-base            IT            0.11B        89.8        87.0    82.3          -
     GLiNER                 deberta-v3-large       EN            0.44B        87.8        77.2    80.3         0.2
     GLiNER                   BERT-base            IT            0.11B        89.3        87.5    84.9         0.6
     extremITLLaMA             Camoscio            IT               7B        89.1       90.3*    83.4*        0.2
     SLIMER-IT                 Camoscio            IT               7B        81.5       85.1     76.0         38.7
     GNER                    LLaMAntino-3          IT               8B        90.3        88.9    82.5         1.2
     SLIMER-IT               LLaMAntino-3          IT               8B        85.8        82.5    81.7        54.7


Some qualitative examples are shown in Appendix A.              State-of-the-art comparison. Thanks to the defini-
                                                                tion of our zero-shot evaluation framework, we can com-
Impact of Backbones. Regarding the choice of the                pare different state-of-the-art approaches fairly. Results
SLIMER-IT backbone, we better illustrate results in Fig-        are outlined in Table 3. When evaluating in the same
ure 2. We can observe no remarkable difference in in-           domain where the model was trained, encoder-only archi-
domain evaluation, where most recent models outper-             tectures obtain strong results despite being much smaller
form older ones, as one might expect. Also globally,            models. This result is not surprising, given the acknowl-
Camoscio and LLaMA-2-chat obtain lower scores than              edged performance of these architectures for supervised
the rest of the backbones, with the only exception of           NER. More unexpected is their ability to generalize well
FIC dataset, where LLaMA-3 based architecture under-            to OOD inputs. Also GNER proves to be quite competitive
perform. However, LLaMAntino-3-ANITA reaches the                achieving the best results in in-domain evaluation, and
best performance on 3 out of 4 datasets, with a strong gap      in OOD on FIC dataset. However, all these approaches
especially in unseen named entities scenario, the most          dramatically fail on never-seen-before tags, in contrast
challenging one. Interestingly enough, thanks to their          to SLIMER-IT that achieves almost 55 F1 score points.
better understanding capabilities, backbones specialized        Compared with LLM-based approaches like GNER and
on Italian are particularly effective in the unseen NEs sce-    extremITLLaMA, this proves once again that without
nario. This is the case of LLaMAntino-3-ANITA and even          definition and guidelines LLMs struggle in tagging novel
Camoscio, which demonstrates higher F1 than LLaMA-2.            kind of entities.

Off-the-shelf Italian NER models. Although there                6. Conclusions
has been no prior work defining a Zero-Shot NER eval-
uation framework for Italian, there exist fine-tune spe-       In this paper, we proposed an evaluation framework for
cialized state-of-the-art zero-shot NER models for Italian     Zero-Shot NER that we applied to Italian. Thanks to such
language. In particular, we consider: GLiNER-ML [11],          a framework, we can better investigate different zero-shot
a multilingual instance of GLiNER, Universal-NER-ITA9          properties depending on the scenario (in-domain, OOD,
and GLiNER-ITA-Large10 , both specialized on Italian.          unseen NEs). On top of that, we compared several state-
These models were trained on synthetic data covering a         of-the-art approaches, with particular focus on SLIMER,
vast number of different entity classes (up to 97k). Thus,     which, thanks to the usage of definition and guidelines,
it is impossible to directly compare them in a pure zero-      is well suited to deal with novel entity types. Indeed,
shot framework, since there are no entity tags actually        SLIMER-IT, our fine-tuned model based on LLaMAntino-
never-seen-before during training. However, we still re-       3, surpasses other state-of-the-art techniques by large
port their results against SLIMER-IT. Table 2 reports the      margins. In the future, we plan to further extend the zero-
results. Despite this advantage, SLIMER-IT outperforms         shot NER benchmark, and implement an input caching
all these models by large a margin.                            mechanism for scalability to large label sets.

9
    https://huggingface.co/DeepMount00/universal_ner_ita
10
     https://huggingface.co/DeepMount00/GLiNER_ITA_LARGE
Acknowledgments                                                 [9] Y. Wang, et al., Super-Natural Instructions: Gen-
                                                                    eralization via declarative instructions on 1600+
The work was partially funded by:                                   NLP tasks, in: Y. Goldberg, Z. Kozareva, Y. Zhang
                                                                    (Eds.), Proceedings of the 2022 Conference on Em-
     • “ReSpiRA - REplicabilità, SPIegabilità e Ragiona-
                                                                    pirical Methods in Natural Language Processing,
       mento”, a project financed by FAIR, Affiliated to
                                                                    Association for Computational Linguistics, Abu
       spoke no. 2, falling within the PNRR MUR pro-
                                                                    Dhabi, United Arab Emirates, 2022, pp. 5085–5109.
       gramme, Mission 4, Component 2, Investment 1.3,
                                                                    URL: https://aclanthology.org/2022.emnlp-main.
       D.D. No. 341 of 03/15/2022, Project PE0000013,
                                                                    340. doi:10.18653/v1/2022.emnlp-main.340.
       CUP B43D22000900004 11 ;
                                                               [10] X. Wang, W. Zhou, C. Zu, H. Xia, T. Chen, Y. Zhang,
     • “MAESTRO - Mitigare le Allucinazioni dei Large               R. Zheng, J. Ye, Q. Zhang, T. Gui, et al., Instructuie:
       Language Models: ESTRazione di informazioni                  multi-task instruction tuning for unified informa-
       Ottimizzate” a project funded by Provincia Au-               tion extraction, arXiv preprint arXiv:2304.08085
       tonoma di Trento with the Lp 6/99 Art. 5:ricerca e           (2023).
       sviluppo, PAT/RFS067-05/06/2024-0428372, CUP:           [11] U. Zaratiana, N. Tomeh, P. Holat, T. Charnois,
       C79J23001170001 12 ;                                         Gliner: Generalist model for named entity
     • “enRichMyData - Enabling Data Enrichment                     recognition using bidirectional transformer, 2023.
       Pipelines for AI-driven Business Products and                arXiv:2311.08526.
       Services”, an Horizon Europe (HE) project, grant        [12] Y. Ding, J. Li, P. Wang, Z. Tang, B. Yan, M. Zhang, Re-
       agreement ID: 101070284 13 .                                 thinking negative instances for generative named
                                                                    entity recognition, 2024. arXiv:2402.16602.
References                                                     [13] A. Zamai, A. Zugarini, L. Rigutini, M. Ernandes,
                                                                    M. Maggini, Show less, instruct more: Enrich-
 [1] A. P. Aprosio, T. Paccosi, Nermud at evalita 2023:             ing prompts with definitions and guidelines for
     Overview of the named-entities recognition on                  zero-shot ner, 2024. URL: https://arxiv.org/abs/2407.
     multi-domain documents task (short paper), in: In-             01272. arXiv:2407.01272.
     ternational Workshop on Evaluation of Natural Lan-        [14] M. Marrero, J. Urbano, S. Sánchez-Cuadrado,
     guage and Speech Tools for Italian, 2023. URL: https:          J. Morato, J. M. Gómez-Berbís, Named entity
     //api.semanticscholar.org/CorpusID:261529782.                  recognition: Fallacies, challenges and opportunities,
 [2] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning        Computer Standards & Interfaces 35 (2013) 482–
     for named entity recognition, IEEE Transactions on             489. URL: https://www.sciencedirect.com/science/
     Knowledge and Data Engineering 34 (2020) 50–70.                article/pii/S0920548912001080. doi:https://doi.
 [3] O. Sainz, et al., Gollie: Annotation guidelines                org/10.1016/j.csi.2012.09.004.
     improve zero-shot information-extraction, 2024.           [15] B. Magnini, E. Pianta, C. Girardi, M. Negri, L. Ro-
     arXiv:2310.03668.                                              mano, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli,
 [4] W. Zhou, S. Zhang, Y. Gu, M. Chen, H. Poon, Uni-               I-CAB: the Italian content annotation bank, in:
     versalner: Targeted distillation from large language           N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard,
     models for open named entity recognition, arXiv                J. Mariani, J. Odijk, D. Tapias (Eds.), Proceedings
     preprint arXiv:2308.03279 (2023).                              of the Fifth International Conference on Language
 [5] A. Radford, et al., Language models are unsuper-               Resources and Evaluation (LREC’06), European Lan-
     vised multitask learners, OpenAI blog 1 (2019) 9.              guage Resources Association (ELRA), Genoa, Italy,
 [6] T. Brown, et al., Language models are few-shot                 2006. URL: http://www.lrec-conf.org/proceedings/
     learners, Advances in neural information process-              lrec2006/pdf/518_pdf.pdf.
     ing systems 33 (2020) 1877–1901.                          [16] V. Bartalesi Lenzi, M. Speranza, R. Sprugnoli,
 [7] J. Wei, et al., Finetuned language models are                  Named entity recognition on transcribed broadcast
     zero-shot learners, in: International Conference               news at evalita 2011, in: B. Magnini, F. Cutugno,
     on Learning Representations, 2022. URL: https:                 M. Falcone, E. Pianta (Eds.), Evaluation of Natural
     //openreview.net/forum?id=gEZrGCozdqR.                         Language and Speech Tools for Italian, Springer
 [8] H. W. Chung, et al., Scaling instruction-finetuned             Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 86–
     language models, 2022. arXiv:2210.11416.                       97.
                                                               [17] P. Basile, A. Caputo, A. Gentile, G. Rizzo, Overview
11
   RESPIRA: https://www.opencup.gov.it/portale/web/opencup/         of the evalita 2016 named entity recognition and
   home/progetto/-/cup/B43D22000900004                              linking in italian tweets (neel-it) task, 2016.
12
   MAESTRO: https://www.opencup.gov.it/portale/web/opencup/
   home/progetto/-/cup/C79J23001170001                         [18] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extrem-
13
   https://doi.org/10.3030/101070284                                ita at EVALITA 2023: Multi-task sustainable scaling
     to large language models at its extreme, in: M. Lai, A. SLIMER-IT on some NE tags
     S. Menini, M. Polignano, V. Russo, R. Sprugnoli,
     G. Venturi (Eds.), Proceedings of the Eighth Evalua- In Table 4 we compare SLIMER-IT (LLaMAntino-based)
     tion Campaign of Natural Language Processing and with a version of it devoid of Definition and Guidelines
     Speech Tools for Italian. Final Workshop (EVALITA (D&G), in order to get a better insight into the usefulness
     2023), Parma, Italy, September 7th-8th, 2023, vol- of such components in zero-shot NER. We present results
     ume 3473 of CEUR Workshop Proceedings, CEUR- for both unseen named entities (from Multinerd) and pre-
     WS.org, 2023. URL: https://ceur-ws.org/Vol-3473/ viously seen tags person, location and organization, but in
     paper13.pdf.                                           out-of-domain inputs (ADG and FIC datasets). The D&G
[19] T. Paccosi, A. Palmero Aprosio, KIND: an Italian components improve performance by up to 37 points for
     multi-domain dataset for named entity recognition, unseen named entities, serving as a source of additional
     in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, knowledge to the model and providing annotation direc-
     C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Mae- tives about what should be labeled. Particularly for these
     gaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis named entities, the D&G enhance precision by reducing
     (Eds.), Proceedings of the Thirteenth Language the number of false positives the model would otherwise
     Resources and Evaluation Conference, European generate. The performance gain provided by D&G for
     Language Resources Association, Marseille, France, known tags within out-of-domain inputs is smaller, with
     2022, pp. 501–507. URL: https://aclanthology.org/ improvements of up to 17 points on some named entity
     2022.lrec-1.52.                                        tags. In this context, the definitions and guidelines serve
[20] S. Tedeschi, R. Navigli, MultiNERD: A multi- more as a reasoning support than as a source of additional
     lingual, multi-genre and fine-grained dataset for knowledge.
     named entity recognition (and disambiguation),
     in: Findings of the Association for Computa-
     tional Linguistics: NAACL 2022, Association for
     Computational Linguistics, Seattle, United States,
     2022, pp. 801–812. URL: https://aclanthology.org/
     2022.findings-naacl.60. doi:10.18653/v1/2022.
     findings-naacl.60.
[21] A. Santilli, E. Rodolà, Camoscio: an italian
     instruction-tuned llama, 2023. URL: https://arxiv.
     org/abs/2307.16456. arXiv:2307.16456.
[22] H. Touvron, et al., Llama 2: Open foun-
     dation and fine-tuned chat models, 2023.
     arXiv:2307.09288.
[23] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
     ford, D. S. Chaplot, D. de las Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
     A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
     T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
     //arxiv.org/abs/2310.06825. arXiv:2310.06825.
[24] M. Polignano, P. Basile, G. Semeraro, Advanced
     natural-based interaction for the italian language:
     Llamantino-3-anita, 2024. URL: https://arxiv.org/
     abs/2405.07101. arXiv:2405.07101.
Table 4
Some examples of definition and guidelines. Absolute F1 gains between SLIMER-IT and its version without definition and
guidelines are reported. In green we highlight examples on unseen named entities, in blue examples on known tags such
person, organization and location, but in Out-Of-Domain input distributions.
 NE (dataset)                               Definition & Guidelines                              w/o D&G F1   w/ D&G F1   Δ F1
 Corpo celeste (MN)     Definizione: ’CORPO CELESTE’ si riferisce a oggetti astronomici             27.07       64.00     +36.93
                        come pianeti, stelle, satelliti, costellazioni, galassie, comete e as-
                        teroidi. Linee guida: Evita di etichettare come ’corpo celeste’
                        entità non direttamente collegate al campo dell’astronomia. Ad
                        esempio, ’Vergine’ potrebbe riferirsi anche a un segno astrologico,
                        quindi il contesto è importante. Assicurati di non includere nomi
                        di fenomeni non astronomici come ’alba’ o ’tramonto’. Potresti
                        incontrare ambiguità quando un termine è usato sia in campo
                        astronomico che in contesti non astronomici, ad esempio ’aurora’
                        che può riferirsi sia all’evento astronomico che al nome di persona.
 Pianta (MN)            Definizione: ’PIANTA’ si riferisce a organismi vegetali come alberi,        13.76       49.89     +36.13
                        arbusti, erbe e altre forme di vegetazione., Linee Guida: Quando
                        identifichi entità ’pianta’, assicurati di etichettare solo nomi di
                        specie vegetali specifiche, come ’Fagus sylvatica’, ’Suaeda vera’,
                        ’Betula pendula’, evitando generici come ’alberi’ o ’arbusti’ se non
                        accompagnati da una specificazione della specie.
 Media (MN)             Definizione: ’MEDIA’ si riferisce a entità come nomi di giornali,           47.78       65.86     +18.08
                        riviste, libri, album musicali, film, programmi televisivi, spettacoli
                        teatrali e altre opere creative e di comunicazione., Linee Guida:
                        Assicurati di etichettare solo nomi specifici di opere creative e di
                        comunicazione, evitando generici come ’musica’ o ’libro’. Presta
                        attenzione alle ambiguità, ad esempio ’Apple’ potrebbe riferirsi
                        alla società tecnologica o ad un’opera d’arte. Escludi i nomi di
                        artisti, autori o registi, che dovrebbero essere etichettati come
                        ’persona’, e nomi generici di strumenti musicali o generi letterari
                        che non rappresentano opere specifiche.
 Luogo (FIC)            Definizione: ’LUOGO’ denota nomi propri di luoghi geografici,               59.34       76.32     +16.98
                        comprendendo città, paesi, stati, regioni, continenti, punti di inter-
                        esse naturale, e indirizzi specifici., Linee Guida: Assicurati di non
                        confondere i nomi di luoghi con nomi di persone, organizzazioni o
                        altre entità. Ad esempio, ’Washington’, potrebbe riferirsi alla città
                        di Washington D.C. o al presidente George Washington, quindi
                        considera attentamente il contesto. Escludi nomi di periodi storici,
                        eventi o concetti astratti che non rappresentano luoghi fisici. Ad
                        esempio, ’nel Rinascimento’ è un periodo storico, non un luogo
                        geografico.
 Organizzazione (ADG)   Definizione: ’ORGANIZZAZIONE’ denota nomi propri di aziende,                55.56       71.85     +16.29
                        istituzioni, gruppi o altre entità organizzative. Questo tipo di
                        entità include sia entità private che pubbliche, come società, orga-
                        nizzazioni non profit, agenzie governative, università e altri gruppi
                        strutturati. Linee Guida: Annota solo nomi propri, evita di anno-
                        tare sostantivi comuni come ’azienda’ o ’istituzione’ a meno che
                        non facciano parte del nome specifico dell’organizzazione. Assicu-
                        rati di non annotare nomi di persone come organizzazioni, anche
                        se contengono termini che potrebbero sembrare riferimenti a en-
                        tità organizzative. Ad esempio, ’Johnson & Johnson’ è un’azienda,
                        mentre ’Johnson’ da solo potrebbe essere il cognome di una per-
                        sona.
 Persona (FIC)          Definizione: ’PERSONA’ denota nomi propri di individui umani.               79.72       83.33     +3.61
                        Questo tipo di entità comprende nomi di persone reali, famose
                        o meno, personaggi storici, e può includere anche personaggi
                        di finzione. Linee Guida: Fai attenzione a non includere titoli o
                        ruoli professionali senza nomi propri (es. ’il presidente’ non è una
                        ’PERSONA’, ma ’il presidente Barack Obama’ sì).