<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SLIMER-IT: Zero-Shot NER on Italian Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrew Zamai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Rigutini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Maggini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Zugarini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Siena</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>expert.ai</institution>
          ,
          <addr-line>Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Traditional approaches to Named Entity Recognition (NER) frame the task into a BIO sequence labeling problem. Although these systems often excel in the downstream task at hand, they require extensive annotated data and struggle to generalize to out-of-distribution input domains and unseen entity types. On the contrary, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities. While several works address Zero-Shot NER in English, little has been done in other languages. In this paper, we define an evaluation framework for Zero-Shot NER, applying it to the Italian language. Furthermore, we introduce SLIMER-IT, the Italian version of SLIMER, an instruction-tuning approach for zero-shot NER leveraging prompts enriched with definition and guidelines. Comparisons with other state-of-the-art models, demonstrate the superiority of SLIMER-IT on never-seen-before entity tags.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition</kwd>
        <kwd>Zero-Shot NER</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Instruction tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Named Entity Recognition (NER) plays a fundamental
role in Natural Language Processing (NLP), often being a
key component in information extraction pipelines. The
task involves identifying and categorizing entities in a
given text according to a predefined set of labels. While
person, organization, and location are the most common,
applications of NER in certain fields may require the
identification of domain-specific entities.</p>
      <p>Manually annotated data has always been critical for
the training of NER systems [1]. Traditional methods
tackle NER as a token classification problem, where
models are specialized on a narrow domain and a pre-defined
labels set [2]. While achieving strong performance for
the data distribution they were trained on, they require
extensive human annotations relative to the downstream
task at hand. Additionally, they lack generalization
capabilities when it comes to addressing out-of-distribution
input domains and/or unseen labels [1, 3, 4].</p>
      <p>On the contrary, Large Language Models (LLMs)
have recently demonstrated strong zero-shot capabilities.
Models like GPT-3 can tackle NER via In-Context
Learning [5, 6], with Instruction-Tuning further improving
performance [7, 8, 9]. To this end, several models have been
proposed to tackle zero-shot NER [10, 4, 3, 11, 12, 13]. In
particular, SLIMER [13] proved to be particularly efective
on unseen named entity types, by leveraging definitions
and guidelines to steer the model generation.
we assess the impact of Definition and Guidelines (D&amp;G).
When comparing SLIMER-IT with state-of-the-art
approaches, either using models pre-trained on English
or adapted for Italian, results demonstrate SLIMER-IT
superiority in labelling unseen entity tags.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>Several works tackle Zero-Shot NER on English, such as</title>
        <p>InstructUIE [10], UniNER [4], GoLLIE [3], GLiNER [11],
GNER [12] and SLIMER [13]. Most of them are based on
the instruction tuning of an LLM and mainly difer in the
prompt and output format design. GLiNER distinguishes
itself by being a smaller encoder-only model, combined
with a span classifier head, that achieves competitive
performance at a lower computational cost.</p>
        <p>As highlighted in SLIMER [13], most approaches
mainly focus on zero-shot NER in Out-Of-Distribution
input domains (OOD), since they are typically fine-tuned
on an extensive number of entity classes highly or
completely overlapping between training and test sets. In
view of this, we proposed a lighter instruction-tuning
methodology for LLMs, training on data overlapping in
lesser degree with the test sets, while steering the model
annotation process with a definition and guidelines for
the NE category to be annotated. From this, the name
SLIMER: Show Less, Instruct More Entity Recognition.</p>
        <p>Although the authors of GLiNER propose also a
multilingual model and evaluate zero-shot generalizability
across diferent languages, neither they nor any other
work has addressed the task of Zero-Shot NER
specifically for the Italian language.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Zero-Shot NER Framework</title>
      <sec id="sec-3-1">
        <title>In traditional Machine-Learning theory, a model  ,</title>
        <p>trained for a task (e.g. NER) represented by a dataset
 , , is typically evaluated on an held-out test set
sampled from the same task and distribution of the training.
In zero-shot learning instead, a model is expected to go
beyond what experienced during training. There are
diferent levels of generalization indicating up to what
extent the model goes beyond what directly learnt.</p>
        <p>In the case of zero-shot NER, a model should be able
to extract entities from inputs belonging to the same
domain it was trained on (in-domain) and across other
domains not encountered before (out-of-domain).
Moreover, it should also generalize well to novel entity classes
(unseen named entities). In our zero-shot evaluation
framework we aim to measure each level independently.
Hence, we define an evaluation benchmark that includes
a collection of NER datasets divided by degree of
generalization. In the following we describe the required
properties to fit in.</p>
        <p>In-domain. This evaluation helps measure how well
the model can generalize from its training data to similar,
but not identical, data. The model is evaluated on the
same input-domains and named entities as those in the
training set. This data often consists in the test partitions
associated with each training set used for fine-tuning the
model.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Out-Of-Domain (OOD). OOD evaluation tests the</title>
        <p>model’s ability to generalize to input texts from domains
that it has not encountered during training. While the
named entities have been seen during training, this type
NER for Italian. While NER has been extensively stud- of evaluation is particularly challenging because diferent
ied on English, less has been done in other languages, input domains often exhibit unique linguistic patterns
particularly outside the traditional general-purpose do- and domain-specific terminology.
mains and entity labels set [14]. Indeed, in Italian, most
NER datasets focus on news and, more recently, social
media contents [15, 16, 17]. Currently, there has been no re- Unseen Named Entities. This evaluation tests the
search into zero-shot NER, only a few exploratory studies model’s ability to identify and classify entities that has
into multi-domain NER. This challenge was introduced not encountered during its training phase. The tag set
in the NERMuD task (NER Multi-Domain) at EVALITA comprises fine-grained categories which are often
specif20232, in which one sub-task required to develop a single ically defined for the domain in which NER is deployed.
model capable of classifying the common entities - person, Because of this, the input data may often be also
Outorganization, location - from diferent types of text, in- Of-Domain (OOD), making this evaluation include the
cluding news, fiction and political speeches. ExtremITA previously mentioned OOD scenario as well.
team [18] addressed the challenge proposing the
adoption of a single LLM capable of tackling all the diferent 4. SLIMER-IT
tasks at EVALITA 2023, among which NERMuD. All the
tasks were converted into text-to-text problems and two
LLMs (LLaMA and T5 based) were instruction-tuned on
the union of all the available datasets for the challenge.</p>
        <p>To adapt SLIMER for Italian, we translate the
instructiontuning prompt of [13], as shown in Figure 1. The prompt
is designed to extract the occurrences of one entity type
per call. While this has the drawback of requiring |NE|</p>
      </sec>
      <sec id="sec-3-3">
        <title>2https://www.evalita.it/campaigns/evalita-2023/tasks/</title>
        <p>inference calls on each input text, it allows the model to vehicle. We keep the Italian examples only. Such a dataset
better focus on a single NE type at a time. constitutes a perfect choice to assess models’ capabilities</p>
        <p>As in [13], we query gpt-3.5-turbo-1106 via OpenAI’s on unseen NEs. Indeed, data belongs to the same news
Chat-GPT APIs to automatically generate definition and domain of the NERMuD split chosen for fine-tuning, but
guidelines for each needed entity tag. The definition for it includes a broader label set. Since we want to measure
a NE is meant to be a short sentence describing the tag. performance on never-seen-before entities, we exclude
The guidelines instead provide annotation instructions entity types seen in training, i.e. person, organization and
to align the model’s labelling with the desired annotation location. We also remove biological entity, being poorly
scheme. Guidelines can be used to prevent the model underrepresented, with a support of just 4 instances.
from labelling certain edge cases or to provide examples
of such NE. Such an informative prompt is extremely 5.2. Backbones
valuable when dealing with unfamiliar entity tags, and
can also be used to distinguish between polysemous
categories.</p>
        <p>Finally, the model is requested to generate the named
entities in a parsable JSON format containing the list of
NEs extracted for the given tag.</p>
      </sec>
      <sec id="sec-3-4">
        <title>We implemented several version of SLIMER-IT based on</title>
        <p>diferent backbone models. We consider similarly sized
LLMs, all in the 7B parameters range. In particular, we
selected five backbones: Camoscio 4 [21],
LLaMA-2-7bchat [22], Mistral-7B-Instruct [23], LLaMA-3-8B-Instruct,
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA5 [24].</p>
        <p>LLaMA-2-7b-chat was originally used in SLIMER [13],
and LLaMA-3-8B-Instruct is the newest, improved
version of it. As LLaMA family, Mistral-7B-Instruct is a
multilingual model mainly English-oriented, but it has
demonstrated greater fluency on Italian. Camoscio and
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA, instead, are
two LLMs specifically fine-tuned on Italian instructions.</p>
        <sec id="sec-3-4-1">
          <title>5.3. Compared Models</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>We compare the SLIMER-IT approach, implemented with</title>
        <p>diferent backbones, against other state-of-the-art
approaches for zero-shot NER. All the methods are trained
and evaluated in the defined zero-shot NER framework
for a fair comparison. We evaluate against:</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments</title>
      <sec id="sec-4-1">
        <title>Experiments aim to assess our approach in Italian. We study the impact of guidelines and the usage of diferent backbones. Then, we compare our approach against stateof-the-art alternatives.</title>
        <sec id="sec-4-1-1">
          <title>5.1. Datasets</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>We construct the zero-shot NER framework (described</title>
        <p>in Section 3) for Italian upon NerMuD shared task and
Multinerd dataset. In particular, we use NerMuD to build
in-domain and OOD evaluation sets, while
MultinerdIT is used to assess the behaviour in the unseen named
entites scenario.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Token classification. Although certainly not being</title>
        <p>NERMuD. NERMuD [1] is a shared task organized at suited for zero-shot NER, due to its architectural inability
evalita-2023, built based on the Kessler Italian Named- to cope with unseen tags, we decided to evaluate the most
entities Dataset (KIND) [19]. It contains annotations known approach to NER as baseline. As in NERMuD
for the three classic NER tags: person, organization and [1], we use the training framework dhfbk/bert-ner6. We
location. Examples are organized in three distinct do- fine-tune two diferent base models, bert-base-cased,
premains: news, literature and political discourses. Unlike trained on English, and dbmdz/bert-base-italian-cased7,
NERMuD, we restrict fine-tuning to a single domain. In an Italian version.
such a way, we can evaluate both in-domain and
outof-domain capabilities of the model. In particular, we
designate WikiNews (WN) sub-set for training and
indomain evaluation, being the most generic domain, while
Fiction (FIC) and Alcide De Gasperi (ADG) splits are kept
for out-of-domain evaluation only.</p>
      </sec>
      <sec id="sec-4-4">
        <title>GNER. It is the best performing approach on zero-shot</title>
        <p>NER in OOD English benchmark. In GNER [12], they
propose a BIO-like generation, replicating in output the
same input text, along with a token-by-token BIO label.
Here, we consider LLaMAntino-3 as its backbone.
Multinerd-IT. To construct the unseen NEs
evaluation set, we exploit Multinerd3 [20], a multilingual NER
dataset made of 15 tags: person, organization, location,
animal, biological entity, celestial body, disease, event, food,
instrument, media, plant, mythological entity, time and</p>
      </sec>
      <sec id="sec-4-5">
        <title>3https://github.com/Babelscape/multinerd</title>
      </sec>
      <sec id="sec-4-6">
        <title>4https://huggingface.co/teelinsan/camoscio-7b-llama</title>
        <p>5https://huggingface.co/swap-uniba/
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
6https://github.com/dhfbk/bert-ner
7https://huggingface.co/dbmdz/bert-base-italian-cased
Camoscio †
LLaMA-2-chat
Mistral-Instruct</p>
        <p>LLaMA-3-Instruct
LLaMAntino-3-ANITA †
7B
7B
7B
8B
8B</p>
        <p>False
True
False
True
False
True
False
True
False
True
GLiNER. Diferently from all other methods, GLiNER
is based on a smaller encoder-only model, combined with
a span classifier head, able to achieve competitive
performance on the OOD English benchmark at a lower
computational cost. We fine-tune it both using its
original deberta-v3-large English backbone and the Italian
dbmdz/bert-base-italian-cased model.
extremITLLaMA. Already described in Section 2, it
represents an interesting approach to compare against.
Based on Camoscio LLM, we compare it with SLIMER-IT
approach implemented with the same backbone.</p>
        <sec id="sec-4-6-1">
          <title>5.4. Experimental setup</title>
          <p>We kept the same training configuration of SLIMER [ 13]
on English, except that we trained on all available
samples. Depending on the backbone, the
instructiontuning prompt (see Figure 1) was adjusted
accordingly to the structure of its template (e.g. [INST] or
&lt;|start_header_id|&gt; formats). For all the competitors, we
replicated their training setup using their scripts and
suggested hyper-parameters. For the evaluation, we use the
micro-F1 as computed in the UniNER8 implementation.
FIC</p>
        </sec>
        <sec id="sec-4-6-2">
          <title>5.5. Results</title>
        </sec>
      </sec>
      <sec id="sec-4-7">
        <title>D&amp;Gs and the one not using them. Generally, definition</title>
        <p>Impact of Definition and Guidelines (D&amp;G). We and guidelines yield improvements in F1. In particular,
compare SLIMER-IT with a version devoid of definition the gap is contained when evaluating on in-domain data,
and guidelines in the prompt. To demonstrate the ro- whereas it becomes significant in OOD and even more
bustness of the approach, we train several SLIMER-IT substantial in unseen NEs. This is expected since D&amp;G
instances, based on diferent LLM backbones. In Table help the most in conditions unseen during training.
No1, we report the results, highlighting the absolute dif- tably, LLaMA-3-based backbones benefit the most from
ference in performance between the model steered by definition and guidelines, with improvements beyond
23 absolute F1 points, surpassing all the other models
8https://github.com/universal-ner by substantial margins in never-seen-before entity tags.</p>
        <p>In-Domain</p>
        <p>WN
Some qualitative examples are shown in Appendix A. State-of-the-art comparison. Thanks to the
definition of our zero-shot evaluation framework, we can
comImpact of Backbones. Regarding the choice of the pare diferent state-of-the-art approaches fairly. Results
SLIMER-IT backbone, we better illustrate results in Fig- are outlined in Table 3. When evaluating in the same
ure 2. We can observe no remarkable diference in in- domain where the model was trained, encoder-only
archidomain evaluation, where most recent models outper- tectures obtain strong results despite being much smaller
form older ones, as one might expect. Also globally, models. This result is not surprising, given the
acknowlCamoscio and LLaMA-2-chat obtain lower scores than edged performance of these architectures for supervised
the rest of the backbones, with the only exception of NER. More unexpected is their ability to generalize well
FIC dataset, where LLaMA-3 based architecture under- to OOD inputs. Also GNER proves to be quite competitive
perform. However, LLaMAntino-3-ANITA reaches the achieving the best results in in-domain evaluation, and
best performance on 3 out of 4 datasets, with a strong gap in OOD on FIC dataset. However, all these approaches
especially in unseen named entities scenario, the most dramatically fail on never-seen-before tags, in contrast
challenging one. Interestingly enough, thanks to their to SLIMER-IT that achieves almost 55 F1 score points.
better understanding capabilities, backbones specialized Compared with LLM-based approaches like GNER and
on Italian are particularly efective in the unseen NEs sce- extremITLLaMA, this proves once again that without
nario. This is the case of LLaMAntino-3-ANITA and even definition and guidelines LLMs struggle in tagging novel
Camoscio, which demonstrates higher F1 than LLaMA-2. kind of entities.</p>
        <p>Of-the-shelf Italian NER models. Although there 6. Conclusions
has been no prior work defining a Zero-Shot NER
evaluation framework for Italian, there exist fine-tune spe- In this paper, we proposed an evaluation framework for
cialized state-of-the-art zero-shot NER models for Italian Zero-Shot NER that we applied to Italian. Thanks to such
language. In particular, we consider: GLiNER-ML [11], a framework, we can better investigate diferent zero-shot
a multilingual instance of GLiNER, Universal-NER-ITA9 properties depending on the scenario (in-domain, OOD,
and GLiNER-ITA-Large10, both specialized on Italian. unseen NEs). On top of that, we compared several
stateThese models were trained on synthetic data covering a of-the-art approaches, with particular focus on SLIMER,
vast number of diferent entity classes (up to 97k). Thus, which, thanks to the usage of definition and guidelines,
it is impossible to directly compare them in a pure zero- is well suited to deal with novel entity types. Indeed,
shot framework, since there are no entity tags actually SLIMER-IT, our fine-tuned model based on
LLaMAntinonever-seen-before during training. However, we still re- 3, surpasses other state-of-the-art techniques by large
port their results against SLIMER-IT. Table 2 reports the margins. In the future, we plan to further extend the
zeroresults. Despite this advantage, SLIMER-IT outperforms shot NER benchmark, and implement an input caching
all these models by large a margin. mechanism for scalability to large label sets.</p>
      </sec>
      <sec id="sec-4-8">
        <title>9https://huggingface.co/DeepMount00/universal_ner_ita 10https://huggingface.co/DeepMount00/GLiNER_ITA_LARGE</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
    </sec>
    <sec id="sec-6">
      <title>A. SLIMER-IT on some NE tags</title>
      <p>Organizzazione (ADG)
Persona (FIC)</p>
      <p>Definition &amp; Guidelines
Definizione: ’CORPO CELESTE’ si riferisce a oggetti astronomici
come pianeti, stelle, satelliti, costellazioni, galassie, comete e
asteroidi. Linee guida: Evita di etichettare come ’corpo celeste’
entità non direttamente collegate al campo dell’astronomia. Ad
esempio, ’Vergine’ potrebbe riferirsi anche a un segno astrologico,
quindi il contesto è importante. Assicurati di non includere nomi
di fenomeni non astronomici come ’alba’ o ’tramonto’. Potresti
incontrare ambiguità quando un termine è usato sia in campo
astronomico che in contesti non astronomici, ad esempio ’aurora’
che può riferirsi sia all’evento astronomico che al nome di persona.
Definizione: ’PIANTA’ si riferisce a organismi vegetali come alberi,
arbusti, erbe e altre forme di vegetazione., Linee Guida: Quando
identifichi entità ’pianta’, assicurati di etichettare solo nomi di
specie vegetali specifiche, come ’Fagus sylvatica’, ’Suaeda vera’,
’Betula pendula’, evitando generici come ’alberi’ o ’arbusti’ se non
accompagnati da una specificazione della specie.</p>
      <p>Definizione: ’MEDIA’ si riferisce a entità come nomi di giornali,
riviste, libri, album musicali, film, programmi televisivi, spettacoli
teatrali e altre opere creative e di comunicazione., Linee Guida:
Assicurati di etichettare solo nomi specifici di opere creative e di
comunicazione, evitando generici come ’musica’ o ’libro’. Presta
attenzione alle ambiguità, ad esempio ’Apple’ potrebbe riferirsi
alla società tecnologica o ad un’opera d’arte. Escludi i nomi di
artisti, autori o registi, che dovrebbero essere etichettati come
’persona’, e nomi generici di strumenti musicali o generi letterari
che non rappresentano opere specifiche.</p>
      <p>Definizione: ’LUOGO’ denota nomi propri di luoghi geografici,
comprendendo città, paesi, stati, regioni, continenti, punti di
interesse naturale, e indirizzi specifici., Linee Guida: Assicurati di non
confondere i nomi di luoghi con nomi di persone, organizzazioni o
altre entità. Ad esempio, ’Washington’, potrebbe riferirsi alla città
di Washington D.C. o al presidente George Washington, quindi
considera attentamente il contesto. Escludi nomi di periodi storici,
eventi o concetti astratti che non rappresentano luoghi fisici. Ad
esempio, ’nel Rinascimento’ è un periodo storico, non un luogo
geografico.</p>
      <p>Definizione: ’ORGANIZZAZIONE’ denota nomi propri di aziende,
istituzioni, gruppi o altre entità organizzative. Questo tipo di
entità include sia entità private che pubbliche, come società,
organizzazioni non profit, agenzie governative, università e altri gruppi
strutturati. Linee Guida: Annota solo nomi propri, evita di
annotare sostantivi comuni come ’azienda’ o ’istituzione’ a meno che
non facciano parte del nome specifico dell’organizzazione.
Assicurati di non annotare nomi di persone come organizzazioni, anche
se contengono termini che potrebbero sembrare riferimenti a
entità organizzative. Ad esempio, ’Johnson &amp; Johnson’ è un’azienda,
mentre ’Johnson’ da solo potrebbe essere il cognome di una
persona.</p>
      <p>Definizione: ’PERSONA’ denota nomi propri di individui umani.
Questo tipo di entità comprende nomi di persone reali, famose
o meno, personaggi storici, e può includere anche personaggi
di finzione. Linee Guida: Fai attenzione a non includere titoli o
ruoli professionali senza nomi propri (es. ’il presidente’ non è una
’PERSONA’, ma ’il presidente Barack Obama’ sì).
w/o D&amp;G F1</p>
      <p>w/ D&amp;G F1
Δ F1
+36.93</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>