<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mamma Mia! Where's My Name? De-Identifying Italian Clinical Notes with Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Miranda</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sébastien Bratières</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Patarnello</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Livia Lilli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Catholic University of the Sacred Heart</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Policlinico Universitario Agostino Gemelli IRCCS</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Translated srl</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The reuse of clinical free-text data plays a pivotal role in enabling advancements in medical research, healthcare analytics, and decision support systems. However, strict regulatory frameworks such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) impose rigorous privacy requirements, particularly concerning the removal of Protected Health Information (PII). As a result, robust de-identification systems are essential to safeguard patient confidentiality while ensuring data usability. In this work, we present an adaptation of a prompt-based de-identification pipeline, originally developed for English-language clinical texts, to the Italian medical domain. Our approach prioritizes deployability in a real-world scenario, by relying exclusively on open-source large language models (LLMs), to ensure compliance with privacy constraints. Specifically, we experimented with diferent versions of Gemma, LLaMA, Mistral, and Phi to identify and redact sensitive entities, focusing on name, age, location, and date. Our evaluation, conducted on an open-source Italian clinical dataset, employs both a classical deterministic approach and a more modern LLM-as-a-judge framework with a voting-based aggregation mechanism, both based on the comparison to a gold standard manually annotated. In the deterministic setting, the pipeline achieved promising F1 scores between 0.65 and 0.81 across entity types. These results demonstrate the potential of using open-source LLMs for clinical de-identification in low-resource language settings, ofering a privacy-compliant solution for real-world hospital deployments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models (LLMs)</kwd>
        <kwd>De-Identification</kwd>
        <kwd>Clinical Reports</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        of VRAM needed to run them), making it impossible mance over general-purpose models. Similar trends
to run them on-premise in most real-world scenarios, are observed by Tannier et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which combines
such as the case of hospitals for processing clinical data. deep learning with rule-based heuristics in a hybrid
Moreover, adapting these techniques to less-resourced pseudonymization pipeline, achieving high F1-scores
languages like Italian adds another layer of complexity, as across multiple PHI types. A notable system in the
clinimost LLMs are trained primarily on English and exhibit cal de-identification landscape is also INCOGNITUS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
limited specialization for smaller languages, impacting a modular anonymization toolbox supporting various
performance in domain-specific tasks such as clinical anonymization strategies—including NER-based,
ruletext de-identification. In this work, we address these based, and embedding-based substitution. It emphasizes
challenges by implementing and adapting an existing both recall and information preservation, and
incorpoGPT-based de-identification pipeline—originally devel- rates novel metrics to evaluate semantic loss due to
oped for English [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]—for the Italian clinical domain. Our anonymization. More recently, the emergence of Large
approach leverages smaller open-source LLMs, which Language Models (LLMs) has opened up new frontiers
are better suited for compliance with privacy regulations for clinical text anonymization. In a comparative study,
and could be run on hospitals’ proprietary clusters. As a Pissarra et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] demonstrates that open-source LLMs
ifrst experiment, we utilize an open-source Italian clini- like LLaMA and Mistral can efectively anonymize
clinical dataset to develop and evaluate our models, with the cal notes without relying on token-level labeling. Their
goal of extending the approach to proprietary datasets approach introduces six new evaluation metrics to
asfrom other hospitals in future deployments. The evalua- sess anonymization quality and utility retention,
addresstion was performed following two diferent approaches, ing the limitations of conventional frameworks,
espeboth based on the comparison with a manually annotated cially for generative anonymization. Finally, Liu et al.
gold standard: first using a deterministic assessment of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] presents a framework to systematically apply
GPTthe type of prediction and then leveraging the LLM-as-a- 4 to HIPAA-compliant de-identification, showing
sigjudge method. In this last implementation, a voting mech- nificant improvements over both traditional and deep
anism was integrated in order to aggregate the evaluation learning baselines. Recent work has explored the use
of multiple LLMs. The study framework is shown in Fig- of LLMs also as evaluators of natural language outputs.
ure 1. The full code implementation is available at the This paradigm, often referred to as LLM-as-a-judge, has
Github repository Italian-Clinical-Note-Deidentification 1. gained traction as a scalable alternative to traditional
human evaluation. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] introduced MT-Bench and Chatbot
Arena to benchmark LLMs through multi-turn
conver2. Related Works sations. Their findings exposed key challenges in
LLMbased evaluation, such as positional bias, verbosity bias,
De-identification of clinical texts has long been a cen- and self-enhancement bias—where models might favour
tral concern in biomedical informatics, particularly given their own responses when acting as judges. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
systhe stringent data protection regulations such as GDPR tematically studied whether LLMs can replace human
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and HIPAA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Recent eforts have embraced deep annotators for tasks like summarization and question
learning, particularly using Named Entity Recognition answering. They found that while LLMs can achieve
(NER) frameworks based on BiLSTM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or Transformer reasonable alignment with human judgments, their
relia[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] architectures. For instance, the work by Tobia et al. bility is sensitive to prompt design and evaluation context.
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explores the use of fine-tuned BERT models for To improve robustness, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed replacing single
PHI detection in Italian clinical reports, revealing that LLM judges with a panel of diverse models. This
endomain-specific adaptation significantly boosts perfor- semble approach showed an improved correlation with
human evaluations by mitigating individual model
biases. These studies demonstrate the promise of LLMs
      </p>
      <sec id="sec-1-1">
        <title>1https://github.com/michele17284/Italian-Clinical-Note</title>
        <p>Deidentification
DATE
AGE
LOCATION/ADDRESS
NAME
47
101
34
46
serves as a critical foundation for performing reliable and
reproducible evaluations of model performance.</p>
        <sec id="sec-1-1-1">
          <title>3.2. De-Identification</title>
          <p>in evaluation settings, while also highlighting the need
for careful prompt engineering, reference use, and model
diversity to ensure fair and consistent judgments.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methods</title>
      <p>To assess Large Language Models’ performance in the
de-identification of Italian clinical notes, we designed
a comprehensive methodological framework that
harnesses the capabilities of LLMs in two complementary
roles: as automated de-identification systems and as
evaluative agents. This dual-role approach enabled a more
nuanced analysis of model behavior and efectiveness in
handling sensitive clinical data. In addition to the
LLMbased evaluation, we also implemented a deterministic
evaluation pipeline. This component served as a
complementary baseline, providing a rule-based reference to
compare against the probabilistic and generative nature
of LLM outputs, thereby enhancing the robustness and
reliability of our overall evaluation strategy.</p>
      <p>
        The de-identification process employs an LLM-based
framework to automatically identify and redact PII and
sensitive data from our Italian annotated notes. We
leveraged the approach of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], where GPT-4 was used to
deidentify english clinical cases based on the HIPAA
definition of sensitive data. In this research, we took as a
3.1. Dataset reference both HIPAA and GDPR [
        <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
        ] when prompting
the models, targeting 19 specific categories of sensitive
inIn this study, we decided to use the CLinkaRT dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], formation, including patient names, birth dates, tax
idenwhich was developed as part of the Evalita 2023 campaign tification codes, ages, places of birth, geographical origin,
[16]. Originally constructed for a relation extraction task, health card numbers, medical record numbers, phone
the dataset is based on clinical cases drawn from the E3C numbers, email addresses, residential addresses, names
corpus [17], a publicly available multilingual resource of family members/caregivers, medical device
identificacomprising semantically annotated clinical narratives in tion numbers, attending physician names, exact
admisEnglish, French, Italian, Spanish, and Basque. The pri- sion/discharge dates, social security numbers, specific
mary objective of the original task was to identify test hospital or healthcare facility names, specific
geographresults and measurements within clinical texts and to ical locations, and any other data that could uniquely
link them to corresponding mentions of laboratory pro- identify the patient. However, our analysis focuses on
cedures and diagnostic assessments from which those a subset of entities that appear most frequently in the
results were derived. Accordingly, the dataset contains dataset, as they are the most representative and relevant
both the clinical narratives and a set of relational anno- for assessing performance. According to the above two
tations linking relevant entities. For the purpose of our laws, also health information can be used for patient
investigation—focused on the de-identification of Ital- identification, but it does not really make sense for us
ian clinical text—we made use exclusively of the textual to remove any health-related data since this is a clinical
component of the dataset. Specifically, we employed the dataset. De-identification is performed through a
care80 Italian-language clinical notes provided and manu- fully crafted prompt that instructs the LLM to replace
ally annotated them to identify instances of sensitive sensitive information with appropriate placeholder tags
information relevant to de-identification tasks. The an- such as
notation process was carried out according to predefined
entity categories, including dates, patient age, geographic
locations or addresses, and personal names. Table 1
summarizes the distribution of annotated entities across these
categories over the whole dataset.
      </p>
      <p>Every annotation is in the format:
• [NOME] for entities regarding names. From now</p>
      <p>on we will refer to this category as NAME.
• [ETÀ] for entities regarding ages. From now on</p>
      <p>we will refer to this category as AGE.
• [DATA] for entities regarding dates in any format
and level of completeness. This means that we
include from entities naming a full date to entities
naming just the year. From now on we will refer
to this category as DATE.
• [LUOGO/INDIRIZZO] for entities containing
info about locations and addresses. From now on
{"text": "agosto del 2011", "type": "DATA"}</p>
      <sec id="sec-2-1">
        <title>Through this process, we constructed a task-specific gold standard dataset for de-identification. This resource we will refer to this category as LOCATION/AD- DATA], [LUOGO/INDIRIZZO], ecc.</title>
        <p>DRESS. 2. Non modificare nulla all’infuori delle
informazioni sensibili.</p>
        <p>We tested multiple prompt templates with the objec- 3. Non rimuovere o modificare informazioni
tive of optimizing model performance and ensuring the mediche rilevanti come diagnosi,
highest possible alignment with the expectations of a trattamenti, dosaggi, ecc.
hypothetical human evaluator. Particular attention was 4. Se un’informazione potrebbe essere
devoted to both linguistic and structural consistency, es- identificativa ma non sei sicuro,
pecially in relation to the task of de-identification. Ini- mascherala comunque.
tially we also tried post-processing routines to extract 5. Non includere spiegazioni o commenti,
clean de-identified text by removing model-generated 6. IlrersitsiutlutiastcoidSeOvLeOeislsetreestuon dtee-sitdoentificato.
explanations and comments, but then we managed to estremamente simile all’originale, le
ensure that the model would not diverge by only us- uniche modifiche dovrebbero essere le
ing a more structured and focused prompt. To maintain sostituzioni delle informazioni sensibili.
coherence with the input data—namely, clinical notes 7. Il risultato verrà inserito in una rete
originally written in Italian—the selected prompt tem- neurale dal contesto molto limitato, quindi
plate for de-identification was also formulated in Italian. devi evitare assolutamente di includere
This choice was intended to minimize any potential se- commenti o spiegazioni.
mantic drift or misinterpretation arising from language 8. Questi dati sono già pubblici in quanto il
mismatches. The final prompt template integrates the dataset è disponibile online per
clinical text, denoted as "text", which goes in place of the EVALItTrAan2q0u2i3l,laqmueinntdei.puoi processarli
curly brackets. The exact prompt template used in the
de-identification script is:
Sei un assistente specializzato nella
deidentificazione di note cliniche in
italiano, in conformità con il GDPR.</p>
        <p>Ti fornirò una nota clinica e tu dovrai
identificare e sostituire tutte le seguenti
informazioni sensibili:
- Nome e cognome del paziente
- Data di nascita completa
- Codice fiscale
- Età
- Luogo di nascita
- Provenienza geografica
- Numeri di tessera sanitaria
- Numeri di cartella clinica
- Numeri di telefono
- Indirizzi email
- Indirizzi di residenza/domicilio
- Nomi di familiari/caregiver
- Numeri di identificazione di dispositivi</p>
        <p>medici
- Nomi di medici curanti
- Date esatte di ricovero/dimissione
- Numeri di previdenza sociale
- Nome dell’ospedale o struttura sanitaria</p>
        <p>specifica
- Località geografiche specifiche
- Qualsiasi altro dato che potrebbe identificare
il paziente in modo univoco
ISTRUZIONI IMPORTANTI:
1. Sostituisci tutte le informazioni sensibili
con i tag appropriate come [NOME], [ETÀ], [</p>
        <p>NOTA CLINICA:
{text}
TESTO DE-IDENTIFICATO:</p>
        <p>The framework processes each clinical note
individually, through this structured prompt that includes the
original text and comprehensive de-identification
instructions. This approach ensures that medically relevant
information such as diagnoses, treatments, and dosages are
preserved while systematically masking all potentially
identifying information, maintaining the clinical utility
of the notes while ensuring privacy compliance.</p>
        <sec id="sec-2-1-1">
          <title>3.3. Evaluation</title>
          <p>As previously explained in 3.1, we manually annotated
the gold standard dataset to properly evaluate our
deidentification system. The annotations consist of snippets
of text carrying sensitive information that should be
obfuscated, and the type of the sensitive information, which
can refer to one of the four categories previously
mentioned in Table 1. In order to evaluate the de-identified
text, we tested two evaluation pipelines: LLM as a Judge,
which is in line with recent trends, and a more
classical Deterministic Evaluation. In both cases, the idea is
to compute Precision, Recall and F1-score, based on the
following definitions:
• True Positives (annotated entities correctly
obfus</p>
          <p>
            cated)
• False Positives (non-annotated entities
incorrectly obfuscated)
• False Negatives (annotated entities that were in our case means at least 2/3). Only entities with a clear
missed and not obfuscated) majority consensus are included in the final metric
calculations, while entities without suficient agreement are
3.3.1. LLM as a Judge discarded to maintain evaluation quality. This approach
efectively handles disagreements between judge
modTo evaluate the quality of the de-identification process, els and reduces the impact of individual model biases or
we employed an LLM-as-a-Judge methodology that lever- errors, as seen in [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. The majority voting process
opages large language models to automatically assess the erates on entity-level classifications, where each unique
correctness of entity redaction. This approach was in- entity (identified by its text content and type) receives
spired by [18], in which the authors use several LLMs to votes from all available judges. The final precision, recall,
evaluate an LLM output and then get to a final decision F1-score, and accuracy metrics are computed using only
through majority voting. The original approach is de- the entities where a majority consensus was reached,
vised for binary outputs (true/false) so it was necessary to providing more reliable evaluation results than any
sinchange the method in order to adapt it to our setting. Our gle judge model alone. Additionally, the system tracks
technique compares three inputs for each clinical note: and reports the number of discarded entities, ofering
the original text, the de-identified version, and the man- transparency into cases where judge models disagreed
ually annotated gold standard entities. The judge model significantly, which can indicate particularly challenging
analyzes whether the annotated sensitive information or ambiguous de-identification scenarios.
has been correctly identified and replaced with
appropriate placeholder tags for each entity category (NOME, 3.3.3. Deterministic Evaluation
ETÀ, LUOGO/INDIRIZZO, DATA) separately. The
system classifies each entity into one of three categories: In addition to the LLM-as-a-Judge evaluation, we
impleTrue Positives (TP) when gold standard entities are cor- mented a deterministic evaluation methodology that
prorectly anonymized with proper tags, False Negatives (FN) vides a direct, rule-based assessment of de-identification
when gold standard entities remain unredacted in the out- quality without relying on LLMs’ judgments. This
apput, and False Positives (FP) when non-sensitive text is proach compares the original clinical notes with their
incorrectly replaced with anonymization tags. The judge de-identified counterparts using exact string matching
model receives a structured prompt containing detailed and pattern recognition techniques. This means that the
instructions and examples for each entity type, ensuring system does not handle partial matches, hence there is no
consistent evaluation criteria across all assessments. The span to check. In this system, when the entity integrity
LLM generates structured JSON output conforming to is lower than 100%, it is not matched. For each entity in
a predefined schema, facilitating automated processing the gold standard annotations, the system counts
occurand metric calculation. This approach provides a scal- rences in both the original and de-identified texts to
deable alternative to manual evaluation while maintaining termine how many instances were successfully removed.
ifne-grained analysis of de-identification performance True Positives are calculated as the number of annotated
across diferent types of sensitive information. The eval- entities that were correctly replaced with appropriate
uation process is executed independently three times placeholder tags, while False Negatives represent
anusing diferent judge models to ensure robust and re- notated entities that remain unredacted in the output
liable assessment, with results subsequently processed text. False Positives are also identified by detecting
placethrough a majority voting mechanism to determine final holder patterns ([NOME], [ETÀ], [LUOGO/INDIRIZZO],
entity classifications. [DATA]) that exceed the number of corresponding gold
standard entities for each category, indicating
over3.3.2. Majority Voting redaction of non-sensitive information. For a practical
example of how this works, refer to Section 4.3. Like in
To ensure robust and reliable evaluation results, we im- the LLM-as-a-judge evaluation, this evaluation processes
plemented a majority voting mechanism that aggregates each entity category independently, computing precision,
judgments from multiple LLM judges for each entity clas- recall, and F1-scores both per category and overall. This
sification decision. The system collects all individual deterministic approach provides a complementary
evaljudgments (True Positive, False Positive, False Negative) uation perspective that is fully reproducible and
transfor each unique entity across the three judge models and parent, ofering exact quantitative measures without the
applies a voting threshold to determine the final classifi- potential variability introduced by LLM-based judgments.
cation. For each entity, the algorithm counts the votes for The method is particularly valuable for identifying
syseach classification type and determines whether a clear tematic patterns in de-identification performance and
majority exists based on a configurable threshold (default ensuring consistent evaluation across diferent model
0.5, meaning more than 50% agreement is required, which outputs.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>In this section, we describe in detail the experimental setup used, including models and frameworks.</title>
        <sec id="sec-3-1-1">
          <title>4.1. De-Identification</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>The de-identification experiments were conducted using</title>
        <p>six diferent large language models:
• llama3.2 3b [19]
• gemma3 [20] in sizes 1b, 4b, 12b
• mistral 7b [21]
• phi4 [22] 14b
It should be noted that we also tried using llama3.2 1b,
but we did not report any result for this model because it
refused to handle the "sensitive" data, although we clearly
specified that the data is already public and there should
be no issue in processing it. All models were deployed
locally using 2ollama-python for local inference. The
generation parameters were set to reduce randomness and
get a focused output: temperature of 0.7 (standard) and a
maximum token limit of 8,192 per generation. All
experiments were executed on a single NVIDIA RTX 3090 GPU
with 24GB VRAM. Each clinical note was individually
prompted using the structured de-identicfiation template
described previously in 3.2. Output was generated in
JSON Lines format, containing the original input text,
the de-identified output, and optionally the full prompt
for debugging purposes.</p>
        <sec id="sec-3-2-1">
          <title>4.2. LLM-based Evalutation</title>
          <p>4.2.1. LLM as a judge</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>The LLM-as-a-Judge evaluation employed three substan</title>
        <p>tially larger language models requiring distributed
inference across two NVIDIA RTX 3090 GPUs:
• gemma3 [20] 27b
• mistral-small [23] 24b
• deepseek-r1 [24] 32b
All judge models were deployed using the Ollama
framework with tensor parallelism enabled across both GPUs
to handle the increased memory requirements of these
larger models. The evaluation process was conducted
with a temperature setting of 0.7 to allow for slight
variability in judgments while maintaining consistency, and
structured JSON output was enforced using 3Pydantic
schema validation to ensure reliable parsing of model
responses. Each judge model received a comprehensive</p>
      </sec>
      <sec id="sec-3-4">
        <title>2https://github.com/ollama/ollama-python</title>
        <p>3https://github.com/pydantic/pydantic
evaluation prompt in Italian that detailed the task
requirements, entity categories, and classification criteria.
For the complete prompt, refer to the Appendix A. The
prompt specifically instructed the models to compare
original clinical notes with their de-identified versions
against gold standard annotations. The evaluation was
conducted independently for each of the four entity
categories (NOME, ETÀ, LUOGO/INDIRIZZO, DATA) across
all seven de-identification models, resulting in 72
individual evaluation runs per judge model (6 models × 4
categories × 3 judges = 72 evaluations).
4.2.2. Voting
The majority voting mechanism was implemented
through a systematic aggregation process that collected
all individual judgments from the three judge models for
each unique entity across the evaluation dataset. Thanks
to Ollama and Pydantic, the models were forced to
output structured text, which allowed automatic parsing of
the answers. The system utilized a configurable voting
threshold set to 0.5, requiring strict majority consensus
(&gt;50% agreement) among the three judges for an entity
classification to be accepted into the final metrics
calculation. The voting algorithm operated on entity-level
classifications and entities failing to achieve majority
consensus were systematically discarded and tracked
separately to maintain transparency in the evaluation process.
Figure 2 shows the overall distribution of discarded
entities per de-identification run and per entity category.
In most cases, the disagreement only involves between 1
and 4 entities, with some rare exceptions reaching up to
12 discarded entities.</p>
        <p>Results were computed using exact vote counting
without weighted averaging, ensuring that each judge model
contributed equally to the final decision. To explain
things more in detail, let’s make an example. Let’s say
that, for the annotated gold entity
{text: 1 Agosto, type: DATA}
judgements are:
comparison against the LLM-based evaluation
methodology while ensuring computational eficiency and
transparency in the assessment process.
4.4. Results and Discussion
gemma3:27b:
{text:1 Agosto,type:DATA,counted_as:FN}
mistral-small:24b:
{text:1 Agosto,type:DATA,counted_as:FN}
deepseek-r1:32b:
{text:1 Agosto,type:DATA,counted_as:TP}</p>
      </sec>
      <sec id="sec-3-5">
        <title>De-identification results for both evaluation methods</title>
        <p>are shown in Table 2, where the performance of the
de-identifiers is reported using F1-Score values, across
the two evaluation scenarios and for each entity.
FurIn this case, the majority of judges agree on counting thermore, Figure 3 illustrates the F1-score distribution
this case as a False Negative (and they are right since the over the entities and models, comparing the
deterministext in the output is not obfuscated), so the annotation is tic and majority voting evaluation methods across all the
actually counted as a False Negative. If the three judges de-identification models. The visualization also enables
disagreed (let’s say mistral counted the sample as a False identification of the best-performing model and
evaluPositive), then no agreement would have been reached, ation approach for each entity, aided by the individual
and the entity would not have been considered in the data points (displayed as a strip plot) alongside the box
ifnal count. plots.</p>
        <p>From Table 2, The deterministic evaluation yielded
gen4.3. Deterministic Evaluation erally higher F1 scores compared to the LLM-as-a-Judge
approach, with the highest F1-Score ranging from 0.65 to
The deterministic evaluation system was implemented 0.88 for NAME, LOCATION/ADDRESS and DATE
entibuasisnegd eaxssaecstsrmegeenxt omfadtech-iidnegnatilficgaotrioitnh mqusatloityp.roTvhiedeevraull-e- ties, with gemma3:12b model. The same finding is shown
uation process loaded gold standard annotations and in Figure 3, where the F1-Score distribution for this model
model outputs, ensuring data alignment through text con- in the deterministic scenario has a higher interquartile
tent verification between original and de-identified ver- range (IQR) specifically in terms of median and third
quarsions. The system grouped annotations by unique entity tile, if compared to other experiments. The same model,
text and type combinations, enabling eficient process- under majority voting evaluation, shows lower
perforing of duplicate entities across clinical notes. True Posi- mances for these entities, with F1-Score values from 0.40
ttihvaet ccaolmcuplaatrieodneunttiilitzyefdreoqcucuenrrceinescebecotwuneetinngorailggionrailthamnds in NAME, to values of 0.64 with LOCATION/ADDRESS.
de-identified texts, determining successful redaction by However the F1-Score of 0.57 from gemma3:4b is the
measuring the reduction in entity instances. False Nega- highest score returned for the AGE entity across all the
tive detection identified annotated entities that remained experiments. The disparity in performance between the
present in de-identified output through direct string pres- two evaluation criteria suggests that the deterministic
epnactteervnermificaattciohnin.gFaalgseaiPnosstiptirveediedfineendtipficalaticoenhoelmdeprloryeegdex method may be less strict in certain classifications, while
patterns to detect over-redaction by counting placehold- the LLM-based evaluation provides more stringent
asers exceeding gold standard entity counts per category.. sessments of de-identification quality.
Looking at Table 2 and Figure 3, the
LOCATION/ADr’\[NOME\]’},r’\[ETÀ\]’,r’\[LUOGO/INDIRIZZO\]’,r’\[DATA\]’ DRESS and NAME entities in deterministic evaluation
demonstrated the highest scores over all the experiments.</p>
        <p>In particular, the LOCATION/ADDRESS entity (green
data point in the plot) shows the highest F1-Score value
of 0.88 with gemma3:12b. The same entity also shows
an high score of 0.75 with gemma3:4b, always in the
deterministic scenario. The NAME entity (violet data
point in the plot) presents a F1-Score of 0.73, 0.81 and
0.77 with gemma3:4b, gemma3:12b and mistral:7b
respectively. Looking at the Majority Voting performance, the
highest score is returned by gemma3:12b, with a value
of 0.64 for the LOCATION/ADDRESS performance.
Furthermore, the gemma3:1b model, presents its highest
results in this evaluation criteria, with the score of 0.56
for the AGE entity. In general, the highest results of</p>
        <p>LOCATION/ADDRESS and NAME entities across all the</p>
      </sec>
      <sec id="sec-3-6">
        <title>To make things clearer, let’s make an example: if the in</title>
        <p>put sample has 2 annotated NAME entities (which could
even be the same one repeated twice) and the text of the
entity is found only once in the output, this last one is
the counter for False Negatives, True Positives are 2-1=
1. Then if we find 3 tags [NOME] in the output text, False
Positives are 3-2=1, because the redactions exceed the
original annotations by 1.</p>
        <p>While the de-identification was done in a single run
(per model) for all PII categories, the evaluation processed
all four entity categories independently, computing
precision, recall, and F1-scores for every entity type. Results
were aggregated across all clinical notes. This
implementation provided completely reproducible evaluation
results without stochastic elements, serving as a baseline
experiments suggest that these categories are easier to the deterministic method remains the most reliable
(exbe detected in LLM implementation where no context is cept for the DATE entity), but they also reveal promising
given in the input prompts. capabilities of LLMs as evaluators, which merit deeper</p>
        <p>Date-related entities revealed interesting evaluation investigation in future studies.
disparities, with the majority of models performing better
under LLM-based assessment. Specifically we are
talking about gemma3 1b (0.12 vs 0.47), gemma3 4b (0.27 vs 5. Conclusions
0.40), mistral 7b (0.37 vs 0.38, the smallest improvement)
and phi4 14b (0.33 vs 0.41). This improvement suggests This study demonstrates the feasibility of using
openthat LLM judges may better recognize contextual date source LLMs for the de-identification of clinical text in
patterns and partial date redactions that the determinis- Italian, a lower-resourced language within the
biomeditic method treats as failures. Considering how variable cal NLP domain. While the results are far from perfect,
the format of a date can be, it is not surprising to see they are quite promising in this context, especially
conthe LLM-based method perform better, as it is definitely sidering how many diferent ways exist to express
senmore flexible. sitive information, ways that deterministic methods are</p>
        <p>The substantial diferences between evaluation meth- often unable to include exahustively. Without requiring
ods can be attributed to several factors, that should be any specific domain adaptation or fine-tuning, models
further investigated. Nonetheless, the LLM-as-a-judge such as Gemma3, Llama3, Mistral, and Phi4 achieved
evaluation, with its capability to handle the evaluation of solid performance in identifying and redacting key PII
variables with diferent formats, represents great poten- entities, with F1 scores ranging from 0.65 to 0.81 in
detertial. Further exploration of this method could be valuable, ministic evaluations. These results highlight the strong
especially by refining its implementation, such as revis- generalization capabilities of modern LLMs, even when
ing the evaluation prompts or selecting more suitable applied to specialized tasks in unfamiliar domains and
language models. For instance, choosing models specif- languages and also suggest that, with proper adaptation,
ically pre-trained on the Italian language (as Minerva performance would be even better. Among the evaluation
[25]) or on the medical domain (as MedGemma [20]) may strategies explored, the deterministic approach, based on
lead to improved performances. direct comparison with a gold standard, proved to be</p>
        <p>Finally, our work highlights the significant potential the most stable and informative. This may be due to
of leveraging LLMs for de-identification tasks, even in a current limitations in the LLM-as-a-judge method,
particzero-shot learning scenario where no model fine-tuning ularly in how prompts are structured and how reference
was applied. This suggests that incorporating few-shot annotations are formatted. While LLM-based judgment
prompting or instruction tuning could further enhance holds promise as a flexible evaluation tool, future work
performance, potentially making the approach more ro- should focus on improving prompt engineering and
rebust. Moreover, our decision to compare a deterministic ifning the representation of the gold standard to ensure
evaluation method with an LLM-based approach aimed to more consistent and accurate assessments. A future
diassess LLMs not only in information extraction but also rection could involve comparing performance across
difas tools for evaluation. Preliminary results indicate that ferent formulations of the same evaluation prompt (e.g.,
entity-by-entity prompts vs. full-document evaluations) prompt-based de-identification, we did not perform a
and assessing how this impacts consistency across judge comparative evaluation against other established
techmodels. Additionally, another future direction could be niques, such as fine-tuned transformer models like
BERTadjusting the pattern matching to make it more sophisti- based Named Entity Recognition (NER) systems.
Includcated, thereby improving the robustness of the evaluation. ing such baselines in future studies would help clarify the
Overall, our findings support the use of prompt-based de- trade-ofs in terms of accuracy, resource requirements,
identification pipelines built on open-source LLMs as a and deployment constraints, ultimately guiding the
selecprivacy-compliant and resource-eficient solution for real- tion of the most efective approach for diferent clinical
world hospital deployments. It is important to emphasize settings. Finally, further investigations on the LLM
capathat this study is not a definitive solution, but rather bilities for evaluation should be done, in order to make
shows the potential for both efective de-identification the LLM as a judge framework more robust and reliable.
and its evaluation. Future eforts will aim to extend this
work to proprietary datasets and explore lightweight
domain adaptation techniques to further enhance perfor- References
mance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Limitations</title>
      <sec id="sec-4-1">
        <title>While the results of this study are promising, sev</title>
        <p>eral limitations must be acknowledged. First, our
deidentification pipeline targets only a limited subset of
PII entity types—specifically names, locations, and dates.
A more comprehensive de-identification system would
need to address additional categories such as contact
information, institutional identifiers, and clinical IDs to
meet the full requirements of privacy regulations. Second,
the evaluation was conducted on a small open-source
Italian clinical dataset, which may not fully reflect the
complexity, variability, and noise present in real-world
clinical records. As such, the generalizability of the
approach needs to be validated on proprietary datasets from
healthcare institutions to assess its practical utility and
robustness in production environments. Additionally,
although this work explores the capabilities of LLMs for</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. LLM as a Judge evaluation prompt</title>
      <p>Ti fornirò:
- Il testo originale di un referto medico (</p>
      <p>testo_originale)
- La sua versione anonimizzata (</p>
      <p>testo_anonimizzato)
- Una lista di entità sensibili annotate</p>
      <p>manualmente (entità_gold)
Le possibili categorie sono:
- NOME
- ETÀ
- LUOGO/INDIRIZZO
- DATA
Il tuo compito è confrontare le entità del gold
standard con quelle effettivamente
anonimizzate nel testo.</p>
      <p>Per ciascuna entità del gold, verifica:
- Se è stata correttamente anonimizzata, il
testo dell’entità gold è stato sostituito
con il tag corrispondente alla categoria:
mettila in annotations_deidentified con
counted_as: "TP"
ESEMPIO:
- Entità gold: "Mario Rossi"
- Entità deidentified: "[NOME]"
- Output: "Mario Rossi", "NOME", "TP"
- Se non è stata correttamente anonimizzata, il
testo dell’entità gold è rimasto
invariato: mettila in
annotations_deidentified con counted_as: "</p>
      <p>FN"
ESEMPIO:
- Entità gold: "Mario Rossi"
- Entità deidentified: "Mario Rossi"
- Output: "Mario Rossi", "NOME", "FN"
È possibile che compaiano entità anonimizzate
che non sono presenti nel gold standard.</p>
      <p>Questo vuol dire che è stato anonimizzato
un testo che non conteneva entità sensibili
. In questo caso, mettila in
annotations_deidentified con counted_as: "</p>
      <p>FP"
ESEMPIO:
- Entità deidentified: "[NOME]"
- Output: "[NOME]", "NOME", "FP"
IMPORTANTE: Ogni elemento in
annotations_deidentified DEVE avere
esattamente questi campi:
- text: il testo dell’entità
- type: il tipo dell’entità
- counted_as: deve essere esattamente "TP", "FN</p>
      <p>", o "FP"
NOTA: Ogni entità gold deve in qualche modo
essere presente nel testo anonimizzato e
sarà contata come "TP" se è stata
anonimizzata correttamente, "FN" se non è
stata anonimizzata. Questo significa che la
cardinalità di annotations_deidentified
deve essere maggiore o uguale alla
cardinalità di annotations_gold.</p>
      <p>ATTENZIONE:
- Ogni output deve essere un JSON valido, verrà</p>
      <p>poi processato con json.loads().
- Non aggiungere altro testo oltre al JSON,</p>
      <p>altrimenti verrà considerato un errore.
- Assicurati di mettere tra virgolette TUTTI i
valori di testo, inclusi i tag come [NOME],
[ETÀ], etc.
- Non usare virgole al posto dei due punti nelle</p>
      <p>coppie chiave-valore.</p>
      <p>ESEMPI:
--NOME
Esempio di output:
{"report_id": "1", "annotations_gold": [{"text":
"Mario Rossi", "type": "NOME"}, {"text": "
Giuseppe Bianchi", "type": "NOME"}], "
annotations_deidentified": [{"text": "Mario
Rossi", "type": "NOME", "counted_as": "FN
"}, {"text": "[NOME]", "type": "NOME", "
counted_as": "TP"}, {"text": "[NOME]", "
type": "NOME", "counted_as": "FP"}]}
--ETÀ
Esempio di output:
{"report_id": "1", "annotations_gold": [{"text":
"25", "type": "ETÀ"}, {"text": "30", "type
": "ETÀ"}], "annotations_deidentified": [{"
text": "25", "type": "ETÀ", "counted_as": "
FN"}, {"text": "[ETÀ]", "type": "ETÀ", "
counted_as": "TP"}, {"text": "[ETÀ]", "type
": "ETÀ", "counted_as": "FP"}]}
--LUOGO/INDIRIZZO
Esempio di output:
{"report_id": "1", "annotations_gold": [{"text":
"Pakistan", "type": "LUOGO/INDIRIZZO"}, {"
text": "Bologna", "type": "LUOGO/INDIRIZZO
"}], "annotations_deidentified": [{"text":
"[LUOGO/INDIRIZZO]", "type": "LUOGO/
INDIRIZZO", "counted_as": "TP"}, {"text": "
Bologna", "type": "LUOGO/INDIRIZZO", "
counted_as": "FN"}, {"text": "[LUOGO/
INDIRIZZO]", "type": "LUOGO/INDIRIZZO", "
counted_as": "FP"}]}
--DATA
Esempio di output:
{"report_id": "1", "annotations_gold": [{"text":
"2021-01-01", "type": "DATA"}, {"text": "4
Maggio", "type": "DATA"}], "
annotations_deidentified": [{"text":
"2021-01-01", "type": "DATA", "counted_as":
"FN"}, {"text": "[DATA]", "type": "DATA",
"counted_as": "TP"}, {"text": "[DATA]", "
type": "DATA", "counted_as": "FP"}]}
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI), Grammarly, Other, and
Cursor in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Parliament</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>of the European Union, Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation)</article-title>
          ,
          <source>Oficial Journal of the European Union L119</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>U.S.</surname>
          </string-name>
          <article-title>Department of Health and</article-title>
          Human Services,
          <volume>45</volume>
          cfr §
          <volume>164</volume>
          .514 - de-identification of health information, Health Information Privacy. [Online]. Available: https://www.law.cornell.edu/cfr/text/45/ 164.514, ???? [Accessed: Dec. 2,
          <year>2024</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report</source>
          , M. Speranza,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanoli</surname>
          </string-name>
          , Clinkart at evalita
          <year>2023</year>
          : arXiv preprint arXiv:
          <volume>2303</volume>
          .08774 (
          <year>2023</year>
          ).
          <article-title>Overview of the task on linking a lab result to its</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Cao, test event in the clinical domain</article-title>
          .,
          <source>EVALITA</source>
          (
          <year>2023</year>
          ). H.
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Zeng</surname>
          </string-name>
          , L. Sun, [16]
          <string-name>
            <surname>M. Lai1</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Menini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Polignano</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Russo</surname>
            , R. SprugW. Liu,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , Deid- noli,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th evaluagpt: Zero-shot medical text de-identification by tion campaign of natural language processing</article-title>
          and
          <source>gpt-4</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.11032.
          <article-title>speech tools for italian (</article-title>
          <year>2023</year>
          ). arXiv:
          <volume>2303</volume>
          .
          <fpage>11032</fpage>
          . [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Altuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          , M. Speranza,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <surname>Bidirec- R. Zanoli</surname>
          </string-name>
          ,
          <article-title>The e3c project: European clinical case tional lstm networks for improved phoneme classi- corpus</article-title>
          ,
          <source>Language</source>
          <volume>1</volume>
          (
          <year>2021</year>
          )
          <article-title>L3. ifcation and recognition</article-title>
          , in: International confer- [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Badshah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sajjad</surname>
          </string-name>
          , Reference-guided
          <source>verdict: ence on artificial neural networks</source>
          , Springer,
          <year>2005</year>
          ,
          <article-title>Llms-as-judges in automatic evaluation of freepp</article-title>
          . 799-
          <fpage>804</fpage>
          . form text,
          <source>arXiv preprint arXiv:2408.09235</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit, [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>KaL. Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <given-names>I. Polosukhin</given-names>
            , At- dian, A.
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Scheltention is all you need, Advances in neural infor- ten,</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, mation processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
          <source>arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Tobia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Patarnello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Masciocchi</surname>
          </string-name>
          , C. Nero, [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. VieilM. C.</given-names>
            <surname>Passarotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marchetti</surname>
          </string-name>
          , G. Ar- lard, R. Merhej,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Matejovicova</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ramé, curi, L. Lilli,
          <article-title>Privacy in italian clinical reports: A M. Rivière</article-title>
          , et al.,
          <source>Gemma 3 technical report, arXiv nlp-based anonymization approach</source>
          , in: 2025
          <source>IEEE preprint arXiv:2503.19786</source>
          (
          <year>2025</year>
          ). 13th International Conference on Healthcare Infor- [21]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Bammatics (ICHI)</article-title>
          , IEEE,
          <year>2025</year>
          , pp.
          <fpage>630</fpage>
          -
          <lpage>635</lpage>
          . ford,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tannier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wajsbürt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Calliger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dura</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Mouchet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hilka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Bey</surname>
            , Development and
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>Wang, validation of a natural language processing algo- T.</article-title>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <year>2023</year>
          .
          <article-title>URL: https: rithm to pseudonymize documents in the context of //arxiv</article-title>
          .org/abs/2310.06825. arXiv:
          <volume>2310</volume>
          .06825.
          <article-title>a clinical data warehouse</article-title>
          ,
          <source>Methods of Information</source>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          , R. Eldan, in Medicine 63 (
          <year>2024</year>
          )
          <fpage>021</fpage>
          -
          <lpage>034</lpage>
          . S. Gunasekar,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Hewett</surname>
          </string-name>
          , M. Java-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rolla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Santos</surname>
          </string-name>
          , Incognitus: A tool- heripi, P. Kaufmann, et al.,
          <source>Phi-4 technical report, box for automated clinical notes anonymization</source>
          ,
          <source>in: arXiv preprint arXiv:2412.08905</source>
          (
          <year>2024</year>
          ).
          <source>Proceedings of the 17th Conference of the Euro- [23] Mistral Small</source>
          <volume>3</volume>
          |
          <string-name>
            <surname>Mistral</surname>
            <given-names>AI</given-names>
          </string-name>
          <article-title>- mistral.ai, https:// pean Chapter of the Association for Computational mistral</article-title>
          .ai/news/mistral-small-
          <volume>3</volume>
          , ???? [Accessed 13- Linguistics: System Demonstrations,
          <year>2023</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>06</lpage>
          -2025].
          <volume>194</volume>
          . [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Song, R. Zhang,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pissarra</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Curioso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          , et al.,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Souper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gomes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rolla</surname>
          </string-name>
          , Deepseek-r1:
          <article-title>Incentivizing reasoning capability in Unlocking the potential of large language mod- llms via reinforcement learning, arXiv preprint els for clinical text anonymization: A comparative arXiv</article-title>
          :
          <volume>2501</volume>
          .12948 (
          <year>2025</year>
          ). study, in
          <source>: Proceedings of the Fifth Workshop on</source>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.-L. H.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <source>Privacy in Natural Language Processing</source>
          ,
          <year>2024</year>
          ,
          <string-name>
            <given-names>pp. E.</given-names>
            <surname>Barba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <fpage>Min74</fpage>
          -
          <lpage>84</lpage>
          .
          <article-title>erva llms: The first family of large language models</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , et al.,
          <article-title>Judging llm-as-a-judge with mt- trained from scratch on italian data, in: Proceedings bench and chatbot arena</article-title>
          , in: NeurIPS,
          <year>2023</year>
          . URL: of the 10th Italian Conference on Computational https://arxiv.org/abs/2306.05685.
          <string-name>
            <surname>Linguistics (</surname>
          </string-name>
          CLiC-it
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C.-H. Chiang</surname>
          </string-name>
          , H.-y. Lee,
          <article-title>Can large language models be an alternative to human evaluations?</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2305.12042.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Verga</surname>
          </string-name>
          , et al.,
          <article-title>Replacing judges with juries: Evaluating llm generations with a panel of diverse models</article-title>
          ,
          <source>arXiv preprint arXiv:2403.16950</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Altuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Karunakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>