1. Introduction

Mamma Mia! Where's My Name? De-Identifying Italian Clinical Notes with Large Language Models

Michele Miranda

2 3

Sébastien Bratières

Stefano Patarnello

Livia Lilli

0 1 0 Catholic University of the Sacred Heart , Rome , Italy 1 Fondazione Policlinico Universitario Agostino Gemelli IRCCS , Rome , Italy 2 Sapienza University of Rome , Rome , Italy 3 Translated srl , Rome , Italy

2025

The reuse of clinical free-text data plays a pivotal role in enabling advancements in medical research, healthcare analytics, and decision support systems. However, strict regulatory frameworks such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) impose rigorous privacy requirements, particularly concerning the removal of Protected Health Information (PII). As a result, robust de-identification systems are essential to safeguard patient confidentiality while ensuring data usability. In this work, we present an adaptation of a prompt-based de-identification pipeline, originally developed for English-language clinical texts, to the Italian medical domain. Our approach prioritizes deployability in a real-world scenario, by relying exclusively on open-source large language models (LLMs), to ensure compliance with privacy constraints. Specifically, we experimented with diferent versions of Gemma, LLaMA, Mistral, and Phi to identify and redact sensitive entities, focusing on name, age, location, and date. Our evaluation, conducted on an open-source Italian clinical dataset, employs both a classical deterministic approach and a more modern LLM-as-a-judge framework with a voting-based aggregation mechanism, both based on the comparison to a gold standard manually annotated. In the deterministic setting, the pipeline achieved promising F1 scores between 0.65 and 0.81 across entity types. These results demonstrate the potential of using open-source LLMs for clinical de-identification in low-resource language settings, ofering a privacy-compliant solution for real-world hospital deployments.

eol>Large Language Models (LLMs) De-Identification Clinical Reports

1. Introduction

of VRAM needed to run them), making it impossible mance over general-purpose models. Similar trends to run them on-premise in most real-world scenarios, are observed by Tannier et al. [ 9 ], which combines such as the case of hospitals for processing clinical data. deep learning with rule-based heuristics in a hybrid Moreover, adapting these techniques to less-resourced pseudonymization pipeline, achieving high F1-scores languages like Italian adds another layer of complexity, as across multiple PHI types. A notable system in the clinimost LLMs are trained primarily on English and exhibit cal de-identification landscape is also INCOGNITUS [ 10 ], limited specialization for smaller languages, impacting a modular anonymization toolbox supporting various performance in domain-specific tasks such as clinical anonymization strategies—including NER-based, ruletext de-identification. In this work, we address these based, and embedding-based substitution. It emphasizes challenges by implementing and adapting an existing both recall and information preservation, and incorpoGPT-based de-identification pipeline—originally devel- rates novel metrics to evaluate semantic loss due to oped for English [ 5 ]—for the Italian clinical domain. Our anonymization. More recently, the emergence of Large approach leverages smaller open-source LLMs, which Language Models (LLMs) has opened up new frontiers are better suited for compliance with privacy regulations for clinical text anonymization. In a comparative study, and could be run on hospitals’ proprietary clusters. As a Pissarra et al. [ 11 ] demonstrates that open-source LLMs ifrst experiment, we utilize an open-source Italian clini- like LLaMA and Mistral can efectively anonymize clinical dataset to develop and evaluate our models, with the cal notes without relying on token-level labeling. Their goal of extending the approach to proprietary datasets approach introduces six new evaluation metrics to asfrom other hospitals in future deployments. The evalua- sess anonymization quality and utility retention, addresstion was performed following two diferent approaches, ing the limitations of conventional frameworks, espeboth based on the comparison with a manually annotated cially for generative anonymization. Finally, Liu et al. gold standard: first using a deterministic assessment of [ 5 ] presents a framework to systematically apply GPTthe type of prediction and then leveraging the LLM-as-a- 4 to HIPAA-compliant de-identification, showing sigjudge method. In this last implementation, a voting mech- nificant improvements over both traditional and deep anism was integrated in order to aggregate the evaluation learning baselines. Recent work has explored the use of multiple LLMs. The study framework is shown in Fig- of LLMs also as evaluators of natural language outputs. ure 1. The full code implementation is available at the This paradigm, often referred to as LLM-as-a-judge, has Github repository Italian-Clinical-Note-Deidentification 1. gained traction as a scalable alternative to traditional human evaluation. [ 12 ] introduced MT-Bench and Chatbot Arena to benchmark LLMs through multi-turn conver2. Related Works sations. Their findings exposed key challenges in LLMbased evaluation, such as positional bias, verbosity bias, De-identification of clinical texts has long been a cen- and self-enhancement bias—where models might favour tral concern in biomedical informatics, particularly given their own responses when acting as judges. [ 13 ] systhe stringent data protection regulations such as GDPR tematically studied whether LLMs can replace human [ 1 ] and HIPAA [ 2 ]. Recent eforts have embraced deep annotators for tasks like summarization and question learning, particularly using Named Entity Recognition answering. They found that while LLMs can achieve (NER) frameworks based on BiLSTM [ 6 ] or Transformer reasonable alignment with human judgments, their relia[ 7 ] architectures. For instance, the work by Tobia et al. bility is sensitive to prompt design and evaluation context. [ 8 ] explores the use of fine-tuned BERT models for To improve robustness, [ 14 ] proposed replacing single PHI detection in Italian clinical reports, revealing that LLM judges with a panel of diverse models. This endomain-specific adaptation significantly boosts perfor- semble approach showed an improved correlation with human evaluations by mitigating individual model biases. These studies demonstrate the promise of LLMs

1https://github.com/michele17284/Italian-Clinical-Note

Deidentification DATE AGE LOCATION/ADDRESS NAME 47 101 34 46 serves as a critical foundation for performing reliable and reproducible evaluations of model performance.

3.2. De-Identification

in evaluation settings, while also highlighting the need for careful prompt engineering, reference use, and model diversity to ensure fair and consistent judgments.

3. Methods

To assess Large Language Models’ performance in the de-identification of Italian clinical notes, we designed a comprehensive methodological framework that harnesses the capabilities of LLMs in two complementary roles: as automated de-identification systems and as evaluative agents. This dual-role approach enabled a more nuanced analysis of model behavior and efectiveness in handling sensitive clinical data. In addition to the LLMbased evaluation, we also implemented a deterministic evaluation pipeline. This component served as a complementary baseline, providing a rule-based reference to compare against the probabilistic and generative nature of LLM outputs, thereby enhancing the robustness and reliability of our overall evaluation strategy.

The de-identification process employs an LLM-based framework to automatically identify and redact PII and sensitive data from our Italian annotated notes. We leveraged the approach of [ 5 ], where GPT-4 was used to deidentify english clinical cases based on the HIPAA definition of sensitive data. In this research, we took as a 3.1. Dataset reference both HIPAA and GDPR [ 2, 1 ] when prompting the models, targeting 19 specific categories of sensitive inIn this study, we decided to use the CLinkaRT dataset [ 15 ], formation, including patient names, birth dates, tax idenwhich was developed as part of the Evalita 2023 campaign tification codes, ages, places of birth, geographical origin, [16]. Originally constructed for a relation extraction task, health card numbers, medical record numbers, phone the dataset is based on clinical cases drawn from the E3C numbers, email addresses, residential addresses, names corpus [17], a publicly available multilingual resource of family members/caregivers, medical device identificacomprising semantically annotated clinical narratives in tion numbers, attending physician names, exact admisEnglish, French, Italian, Spanish, and Basque. The pri- sion/discharge dates, social security numbers, specific mary objective of the original task was to identify test hospital or healthcare facility names, specific geographresults and measurements within clinical texts and to ical locations, and any other data that could uniquely link them to corresponding mentions of laboratory pro- identify the patient. However, our analysis focuses on cedures and diagnostic assessments from which those a subset of entities that appear most frequently in the results were derived. Accordingly, the dataset contains dataset, as they are the most representative and relevant both the clinical narratives and a set of relational anno- for assessing performance. According to the above two tations linking relevant entities. For the purpose of our laws, also health information can be used for patient investigation—focused on the de-identification of Ital- identification, but it does not really make sense for us ian clinical text—we made use exclusively of the textual to remove any health-related data since this is a clinical component of the dataset. Specifically, we employed the dataset. De-identification is performed through a care80 Italian-language clinical notes provided and manu- fully crafted prompt that instructs the LLM to replace ally annotated them to identify instances of sensitive sensitive information with appropriate placeholder tags information relevant to de-identification tasks. The an- such as notation process was carried out according to predefined entity categories, including dates, patient age, geographic locations or addresses, and personal names. Table 1 summarizes the distribution of annotated entities across these categories over the whole dataset.

Every annotation is in the format: • [NOME] for entities regarding names. From now

on we will refer to this category as NAME. • [ETÀ] for entities regarding ages. From now on

we will refer to this category as AGE. • [DATA] for entities regarding dates in any format and level of completeness. This means that we include from entities naming a full date to entities naming just the year. From now on we will refer to this category as DATE. • [LUOGO/INDIRIZZO] for entities containing info about locations and addresses. From now on {"text": "agosto del 2011", "type": "DATA"}

Through this process, we constructed a task-specific gold standard dataset for de-identification. This resource we will refer to this category as LOCATION/AD- DATA], [LUOGO/INDIRIZZO], ecc.

DRESS. 2. Non modificare nulla all’infuori delle informazioni sensibili.

We tested multiple prompt templates with the objec- 3. Non rimuovere o modificare informazioni tive of optimizing model performance and ensuring the mediche rilevanti come diagnosi, highest possible alignment with the expectations of a trattamenti, dosaggi, ecc. hypothetical human evaluator. Particular attention was 4. Se un’informazione potrebbe essere devoted to both linguistic and structural consistency, es- identificativa ma non sei sicuro, pecially in relation to the task of de-identification. Ini- mascherala comunque. tially we also tried post-processing routines to extract 5. Non includere spiegazioni o commenti, clean de-identified text by removing model-generated 6. IlrersitsiutlutiastcoidSeOvLeOeislsetreestuon dtee-sitdoentificato. explanations and comments, but then we managed to estremamente simile all’originale, le ensure that the model would not diverge by only us- uniche modifiche dovrebbero essere le ing a more structured and focused prompt. To maintain sostituzioni delle informazioni sensibili. coherence with the input data—namely, clinical notes 7. Il risultato verrà inserito in una rete originally written in Italian—the selected prompt tem- neurale dal contesto molto limitato, quindi plate for de-identification was also formulated in Italian. devi evitare assolutamente di includere This choice was intended to minimize any potential se- commenti o spiegazioni. mantic drift or misinterpretation arising from language 8. Questi dati sono già pubblici in quanto il mismatches. The final prompt template integrates the dataset è disponibile online per clinical text, denoted as "text", which goes in place of the EVALItTrAan2q0u2i3l,laqmueinntdei.puoi processarli curly brackets. The exact prompt template used in the de-identification script is: Sei un assistente specializzato nella deidentificazione di note cliniche in italiano, in conformità con il GDPR.

Ti fornirò una nota clinica e tu dovrai identificare e sostituire tutte le seguenti informazioni sensibili: - Nome e cognome del paziente - Data di nascita completa - Codice fiscale - Età - Luogo di nascita - Provenienza geografica - Numeri di tessera sanitaria - Numeri di cartella clinica - Numeri di telefono - Indirizzi email - Indirizzi di residenza/domicilio - Nomi di familiari/caregiver - Numeri di identificazione di dispositivi

medici - Nomi di medici curanti - Date esatte di ricovero/dimissione - Numeri di previdenza sociale - Nome dell’ospedale o struttura sanitaria

specifica - Località geografiche specifiche - Qualsiasi altro dato che potrebbe identificare il paziente in modo univoco ISTRUZIONI IMPORTANTI: 1. Sostituisci tutte le informazioni sensibili con i tag appropriate come [NOME], [ETÀ], [

NOTA CLINICA: {text} TESTO DE-IDENTIFICATO:

The framework processes each clinical note individually, through this structured prompt that includes the original text and comprehensive de-identification instructions. This approach ensures that medically relevant information such as diagnoses, treatments, and dosages are preserved while systematically masking all potentially identifying information, maintaining the clinical utility of the notes while ensuring privacy compliance.

3.3. Evaluation

As previously explained in 3.1, we manually annotated the gold standard dataset to properly evaluate our deidentification system. The annotations consist of snippets of text carrying sensitive information that should be obfuscated, and the type of the sensitive information, which can refer to one of the four categories previously mentioned in Table 1. In order to evaluate the de-identified text, we tested two evaluation pipelines: LLM as a Judge, which is in line with recent trends, and a more classical Deterministic Evaluation. In both cases, the idea is to compute Precision, Recall and F1-score, based on the following definitions: • True Positives (annotated entities correctly obfus

cated) • False Positives (non-annotated entities incorrectly obfuscated) • False Negatives (annotated entities that were in our case means at least 2/3). Only entities with a clear missed and not obfuscated) majority consensus are included in the final metric calculations, while entities without suficient agreement are 3.3.1. LLM as a Judge discarded to maintain evaluation quality. This approach efectively handles disagreements between judge modTo evaluate the quality of the de-identification process, els and reduces the impact of individual model biases or we employed an LLM-as-a-Judge methodology that lever- errors, as seen in [ 14 ]. The majority voting process opages large language models to automatically assess the erates on entity-level classifications, where each unique correctness of entity redaction. This approach was in- entity (identified by its text content and type) receives spired by [18], in which the authors use several LLMs to votes from all available judges. The final precision, recall, evaluate an LLM output and then get to a final decision F1-score, and accuracy metrics are computed using only through majority voting. The original approach is de- the entities where a majority consensus was reached, vised for binary outputs (true/false) so it was necessary to providing more reliable evaluation results than any sinchange the method in order to adapt it to our setting. Our gle judge model alone. Additionally, the system tracks technique compares three inputs for each clinical note: and reports the number of discarded entities, ofering the original text, the de-identified version, and the man- transparency into cases where judge models disagreed ually annotated gold standard entities. The judge model significantly, which can indicate particularly challenging analyzes whether the annotated sensitive information or ambiguous de-identification scenarios. has been correctly identified and replaced with appropriate placeholder tags for each entity category (NOME, 3.3.3. Deterministic Evaluation ETÀ, LUOGO/INDIRIZZO, DATA) separately. The system classifies each entity into one of three categories: In addition to the LLM-as-a-Judge evaluation, we impleTrue Positives (TP) when gold standard entities are cor- mented a deterministic evaluation methodology that prorectly anonymized with proper tags, False Negatives (FN) vides a direct, rule-based assessment of de-identification when gold standard entities remain unredacted in the out- quality without relying on LLMs’ judgments. This apput, and False Positives (FP) when non-sensitive text is proach compares the original clinical notes with their incorrectly replaced with anonymization tags. The judge de-identified counterparts using exact string matching model receives a structured prompt containing detailed and pattern recognition techniques. This means that the instructions and examples for each entity type, ensuring system does not handle partial matches, hence there is no consistent evaluation criteria across all assessments. The span to check. In this system, when the entity integrity LLM generates structured JSON output conforming to is lower than 100%, it is not matched. For each entity in a predefined schema, facilitating automated processing the gold standard annotations, the system counts occurand metric calculation. This approach provides a scal- rences in both the original and de-identified texts to deable alternative to manual evaluation while maintaining termine how many instances were successfully removed. ifne-grained analysis of de-identification performance True Positives are calculated as the number of annotated across diferent types of sensitive information. The eval- entities that were correctly replaced with appropriate uation process is executed independently three times placeholder tags, while False Negatives represent anusing diferent judge models to ensure robust and re- notated entities that remain unredacted in the output liable assessment, with results subsequently processed text. False Positives are also identified by detecting placethrough a majority voting mechanism to determine final holder patterns ([NOME], [ETÀ], [LUOGO/INDIRIZZO], entity classifications. [DATA]) that exceed the number of corresponding gold standard entities for each category, indicating over3.3.2. Majority Voting redaction of non-sensitive information. For a practical example of how this works, refer to Section 4.3. Like in To ensure robust and reliable evaluation results, we im- the LLM-as-a-judge evaluation, this evaluation processes plemented a majority voting mechanism that aggregates each entity category independently, computing precision, judgments from multiple LLM judges for each entity clas- recall, and F1-scores both per category and overall. This sification decision. The system collects all individual deterministic approach provides a complementary evaljudgments (True Positive, False Positive, False Negative) uation perspective that is fully reproducible and transfor each unique entity across the three judge models and parent, ofering exact quantitative measures without the applies a voting threshold to determine the final classifi- potential variability introduced by LLM-based judgments. cation. For each entity, the algorithm counts the votes for The method is particularly valuable for identifying syseach classification type and determines whether a clear tematic patterns in de-identification performance and majority exists based on a configurable threshold (default ensuring consistent evaluation across diferent model 0.5, meaning more than 50% agreement is required, which outputs.

4. Experiments In this section, we describe in detail the experimental setup used, including models and frameworks. 4.1. De-Identification The de-identification experiments were conducted using

six diferent large language models: • llama3.2 3b [19] • gemma3 [20] in sizes 1b, 4b, 12b • mistral 7b [21] • phi4 [22] 14b It should be noted that we also tried using llama3.2 1b, but we did not report any result for this model because it refused to handle the "sensitive" data, although we clearly specified that the data is already public and there should be no issue in processing it. All models were deployed locally using 2ollama-python for local inference. The generation parameters were set to reduce randomness and get a focused output: temperature of 0.7 (standard) and a maximum token limit of 8,192 per generation. All experiments were executed on a single NVIDIA RTX 3090 GPU with 24GB VRAM. Each clinical note was individually prompted using the structured de-identicfiation template described previously in 3.2. Output was generated in JSON Lines format, containing the original input text, the de-identified output, and optionally the full prompt for debugging purposes.

4.2. LLM-based Evalutation

4.2.1. LLM as a judge

The LLM-as-a-Judge evaluation employed three substan

tially larger language models requiring distributed inference across two NVIDIA RTX 3090 GPUs: • gemma3 [20] 27b • mistral-small [23] 24b • deepseek-r1 [24] 32b All judge models were deployed using the Ollama framework with tensor parallelism enabled across both GPUs to handle the increased memory requirements of these larger models. The evaluation process was conducted with a temperature setting of 0.7 to allow for slight variability in judgments while maintaining consistency, and structured JSON output was enforced using 3Pydantic schema validation to ensure reliable parsing of model responses. Each judge model received a comprehensive

2https://github.com/ollama/ollama-python

3https://github.com/pydantic/pydantic evaluation prompt in Italian that detailed the task requirements, entity categories, and classification criteria. For the complete prompt, refer to the Appendix A. The prompt specifically instructed the models to compare original clinical notes with their de-identified versions against gold standard annotations. The evaluation was conducted independently for each of the four entity categories (NOME, ETÀ, LUOGO/INDIRIZZO, DATA) across all seven de-identification models, resulting in 72 individual evaluation runs per judge model (6 models × 4 categories × 3 judges = 72 evaluations). 4.2.2. Voting The majority voting mechanism was implemented through a systematic aggregation process that collected all individual judgments from the three judge models for each unique entity across the evaluation dataset. Thanks to Ollama and Pydantic, the models were forced to output structured text, which allowed automatic parsing of the answers. The system utilized a configurable voting threshold set to 0.5, requiring strict majority consensus (>50% agreement) among the three judges for an entity classification to be accepted into the final metrics calculation. The voting algorithm operated on entity-level classifications and entities failing to achieve majority consensus were systematically discarded and tracked separately to maintain transparency in the evaluation process. Figure 2 shows the overall distribution of discarded entities per de-identification run and per entity category. In most cases, the disagreement only involves between 1 and 4 entities, with some rare exceptions reaching up to 12 discarded entities.

Results were computed using exact vote counting without weighted averaging, ensuring that each judge model contributed equally to the final decision. To explain things more in detail, let’s make an example. Let’s say that, for the annotated gold entity {text: 1 Agosto, type: DATA} judgements are: comparison against the LLM-based evaluation methodology while ensuring computational eficiency and transparency in the assessment process. 4.4. Results and Discussion gemma3:27b: {text:1 Agosto,type:DATA,counted_as:FN} mistral-small:24b: {text:1 Agosto,type:DATA,counted_as:FN} deepseek-r1:32b: {text:1 Agosto,type:DATA,counted_as:TP}

De-identification results for both evaluation methods

are shown in Table 2, where the performance of the de-identifiers is reported using F1-Score values, across the two evaluation scenarios and for each entity. FurIn this case, the majority of judges agree on counting thermore, Figure 3 illustrates the F1-score distribution this case as a False Negative (and they are right since the over the entities and models, comparing the deterministext in the output is not obfuscated), so the annotation is tic and majority voting evaluation methods across all the actually counted as a False Negative. If the three judges de-identification models. The visualization also enables disagreed (let’s say mistral counted the sample as a False identification of the best-performing model and evaluPositive), then no agreement would have been reached, ation approach for each entity, aided by the individual and the entity would not have been considered in the data points (displayed as a strip plot) alongside the box ifnal count. plots.

From Table 2, The deterministic evaluation yielded gen4.3. Deterministic Evaluation erally higher F1 scores compared to the LLM-as-a-Judge approach, with the highest F1-Score ranging from 0.65 to The deterministic evaluation system was implemented 0.88 for NAME, LOCATION/ADDRESS and DATE entibuasisnegd eaxssaecstsrmegeenxt omfadtech-iidnegnatilficgaotrioitnh mqusatloityp.roTvhiedeevraull-e- ties, with gemma3:12b model. The same finding is shown uation process loaded gold standard annotations and in Figure 3, where the F1-Score distribution for this model model outputs, ensuring data alignment through text con- in the deterministic scenario has a higher interquartile tent verification between original and de-identified ver- range (IQR) specifically in terms of median and third quarsions. The system grouped annotations by unique entity tile, if compared to other experiments. The same model, text and type combinations, enabling eficient process- under majority voting evaluation, shows lower perforing of duplicate entities across clinical notes. True Posi- mances for these entities, with F1-Score values from 0.40 ttihvaet ccaolmcuplaatrieodneunttiilitzyefdreoqcucuenrrceinescebecotwuneetinngorailggionrailthamnds in NAME, to values of 0.64 with LOCATION/ADDRESS. de-identified texts, determining successful redaction by However the F1-Score of 0.57 from gemma3:4b is the measuring the reduction in entity instances. False Nega- highest score returned for the AGE entity across all the tive detection identified annotated entities that remained experiments. The disparity in performance between the present in de-identified output through direct string pres- two evaluation criteria suggests that the deterministic epnactteervnermificaattciohnin.gFaalgseaiPnosstiptirveediedfineendtipficalaticoenhoelmdeprloryeegdex method may be less strict in certain classifications, while patterns to detect over-redaction by counting placehold- the LLM-based evaluation provides more stringent asers exceeding gold standard entity counts per category.. sessments of de-identification quality. Looking at Table 2 and Figure 3, the LOCATION/ADr’\[NOME\]’},r’\[ETÀ\]’,r’\[LUOGO/INDIRIZZO\]’,r’\[DATA\]’ DRESS and NAME entities in deterministic evaluation demonstrated the highest scores over all the experiments.

In particular, the LOCATION/ADDRESS entity (green data point in the plot) shows the highest F1-Score value of 0.88 with gemma3:12b. The same entity also shows an high score of 0.75 with gemma3:4b, always in the deterministic scenario. The NAME entity (violet data point in the plot) presents a F1-Score of 0.73, 0.81 and 0.77 with gemma3:4b, gemma3:12b and mistral:7b respectively. Looking at the Majority Voting performance, the highest score is returned by gemma3:12b, with a value of 0.64 for the LOCATION/ADDRESS performance. Furthermore, the gemma3:1b model, presents its highest results in this evaluation criteria, with the score of 0.56 for the AGE entity. In general, the highest results of

LOCATION/ADDRESS and NAME entities across all the

To make things clearer, let’s make an example: if the in

put sample has 2 annotated NAME entities (which could even be the same one repeated twice) and the text of the entity is found only once in the output, this last one is the counter for False Negatives, True Positives are 2-1= 1. Then if we find 3 tags [NOME] in the output text, False Positives are 3-2=1, because the redactions exceed the original annotations by 1.

While the de-identification was done in a single run (per model) for all PII categories, the evaluation processed all four entity categories independently, computing precision, recall, and F1-scores for every entity type. Results were aggregated across all clinical notes. This implementation provided completely reproducible evaluation results without stochastic elements, serving as a baseline experiments suggest that these categories are easier to the deterministic method remains the most reliable (exbe detected in LLM implementation where no context is cept for the DATE entity), but they also reveal promising given in the input prompts. capabilities of LLMs as evaluators, which merit deeper

Date-related entities revealed interesting evaluation investigation in future studies. disparities, with the majority of models performing better under LLM-based assessment. Specifically we are talking about gemma3 1b (0.12 vs 0.47), gemma3 4b (0.27 vs 5. Conclusions 0.40), mistral 7b (0.37 vs 0.38, the smallest improvement) and phi4 14b (0.33 vs 0.41). This improvement suggests This study demonstrates the feasibility of using openthat LLM judges may better recognize contextual date source LLMs for the de-identification of clinical text in patterns and partial date redactions that the determinis- Italian, a lower-resourced language within the biomeditic method treats as failures. Considering how variable cal NLP domain. While the results are far from perfect, the format of a date can be, it is not surprising to see they are quite promising in this context, especially conthe LLM-based method perform better, as it is definitely sidering how many diferent ways exist to express senmore flexible. sitive information, ways that deterministic methods are

The substantial diferences between evaluation meth- often unable to include exahustively. Without requiring ods can be attributed to several factors, that should be any specific domain adaptation or fine-tuning, models further investigated. Nonetheless, the LLM-as-a-judge such as Gemma3, Llama3, Mistral, and Phi4 achieved evaluation, with its capability to handle the evaluation of solid performance in identifying and redacting key PII variables with diferent formats, represents great poten- entities, with F1 scores ranging from 0.65 to 0.81 in detertial. Further exploration of this method could be valuable, ministic evaluations. These results highlight the strong especially by refining its implementation, such as revis- generalization capabilities of modern LLMs, even when ing the evaluation prompts or selecting more suitable applied to specialized tasks in unfamiliar domains and language models. For instance, choosing models specif- languages and also suggest that, with proper adaptation, ically pre-trained on the Italian language (as Minerva performance would be even better. Among the evaluation [25]) or on the medical domain (as MedGemma [20]) may strategies explored, the deterministic approach, based on lead to improved performances. direct comparison with a gold standard, proved to be

Finally, our work highlights the significant potential the most stable and informative. This may be due to of leveraging LLMs for de-identification tasks, even in a current limitations in the LLM-as-a-judge method, particzero-shot learning scenario where no model fine-tuning ularly in how prompts are structured and how reference was applied. This suggests that incorporating few-shot annotations are formatted. While LLM-based judgment prompting or instruction tuning could further enhance holds promise as a flexible evaluation tool, future work performance, potentially making the approach more ro- should focus on improving prompt engineering and rebust. Moreover, our decision to compare a deterministic ifning the representation of the gold standard to ensure evaluation method with an LLM-based approach aimed to more consistent and accurate assessments. A future diassess LLMs not only in information extraction but also rection could involve comparing performance across difas tools for evaluation. Preliminary results indicate that ferent formulations of the same evaluation prompt (e.g., entity-by-entity prompts vs. full-document evaluations) prompt-based de-identification, we did not perform a and assessing how this impacts consistency across judge comparative evaluation against other established techmodels. Additionally, another future direction could be niques, such as fine-tuned transformer models like BERTadjusting the pattern matching to make it more sophisti- based Named Entity Recognition (NER) systems. Includcated, thereby improving the robustness of the evaluation. ing such baselines in future studies would help clarify the Overall, our findings support the use of prompt-based de- trade-ofs in terms of accuracy, resource requirements, identification pipelines built on open-source LLMs as a and deployment constraints, ultimately guiding the selecprivacy-compliant and resource-eficient solution for real- tion of the most efective approach for diferent clinical world hospital deployments. It is important to emphasize settings. Finally, further investigations on the LLM capathat this study is not a definitive solution, but rather bilities for evaluation should be done, in order to make shows the potential for both efective de-identification the LLM as a judge framework more robust and reliable. and its evaluation. Future eforts will aim to extend this work to proprietary datasets and explore lightweight domain adaptation techniques to further enhance perfor- References mance.

6. Limitations While the results of this study are promising, sev

eral limitations must be acknowledged. First, our deidentification pipeline targets only a limited subset of PII entity types—specifically names, locations, and dates. A more comprehensive de-identification system would need to address additional categories such as contact information, institutional identifiers, and clinical IDs to meet the full requirements of privacy regulations. Second, the evaluation was conducted on a small open-source Italian clinical dataset, which may not fully reflect the complexity, variability, and noise present in real-world clinical records. As such, the generalizability of the approach needs to be validated on proprietary datasets from healthcare institutions to assess its practical utility and robustness in production environments. Additionally, although this work explores the capabilities of LLMs for

A. LLM as a Judge evaluation prompt

Ti fornirò: - Il testo originale di un referto medico (

testo_originale) - La sua versione anonimizzata (

testo_anonimizzato) - Una lista di entità sensibili annotate

manualmente (entità_gold) Le possibili categorie sono: - NOME - ETÀ - LUOGO/INDIRIZZO - DATA Il tuo compito è confrontare le entità del gold standard con quelle effettivamente anonimizzate nel testo.

Per ciascuna entità del gold, verifica: - Se è stata correttamente anonimizzata, il testo dell’entità gold è stato sostituito con il tag corrispondente alla categoria: mettila in annotations_deidentified con counted_as: "TP" ESEMPIO: - Entità gold: "Mario Rossi" - Entità deidentified: "[NOME]" - Output: "Mario Rossi", "NOME", "TP" - Se non è stata correttamente anonimizzata, il testo dell’entità gold è rimasto invariato: mettila in annotations_deidentified con counted_as: "

FN" ESEMPIO: - Entità gold: "Mario Rossi" - Entità deidentified: "Mario Rossi" - Output: "Mario Rossi", "NOME", "FN" È possibile che compaiano entità anonimizzate che non sono presenti nel gold standard.

Questo vuol dire che è stato anonimizzato un testo che non conteneva entità sensibili . In questo caso, mettila in annotations_deidentified con counted_as: "

FP" ESEMPIO: - Entità deidentified: "[NOME]" - Output: "[NOME]", "NOME", "FP" IMPORTANTE: Ogni elemento in annotations_deidentified DEVE avere esattamente questi campi: - text: il testo dell’entità - type: il tipo dell’entità - counted_as: deve essere esattamente "TP", "FN

", o "FP" NOTA: Ogni entità gold deve in qualche modo essere presente nel testo anonimizzato e sarà contata come "TP" se è stata anonimizzata correttamente, "FN" se non è stata anonimizzata. Questo significa che la cardinalità di annotations_deidentified deve essere maggiore o uguale alla cardinalità di annotations_gold.

ATTENZIONE: - Ogni output deve essere un JSON valido, verrà

poi processato con json.loads(). - Non aggiungere altro testo oltre al JSON,

altrimenti verrà considerato un errore. - Assicurati di mettere tra virgolette TUTTI i valori di testo, inclusi i tag come [NOME], [ETÀ], etc. - Non usare virgole al posto dei due punti nelle

coppie chiave-valore.

ESEMPI: --NOME Esempio di output: {"report_id": "1", "annotations_gold": [{"text": "Mario Rossi", "type": "NOME"}, {"text": " Giuseppe Bianchi", "type": "NOME"}], " annotations_deidentified": [{"text": "Mario Rossi", "type": "NOME", "counted_as": "FN "}, {"text": "[NOME]", "type": "NOME", " counted_as": "TP"}, {"text": "[NOME]", " type": "NOME", "counted_as": "FP"}]} --ETÀ Esempio di output: {"report_id": "1", "annotations_gold": [{"text": "25", "type": "ETÀ"}, {"text": "30", "type ": "ETÀ"}], "annotations_deidentified": [{" text": "25", "type": "ETÀ", "counted_as": " FN"}, {"text": "[ETÀ]", "type": "ETÀ", " counted_as": "TP"}, {"text": "[ETÀ]", "type ": "ETÀ", "counted_as": "FP"}]} --LUOGO/INDIRIZZO Esempio di output: {"report_id": "1", "annotations_gold": [{"text": "Pakistan", "type": "LUOGO/INDIRIZZO"}, {" text": "Bologna", "type": "LUOGO/INDIRIZZO "}], "annotations_deidentified": [{"text": "[LUOGO/INDIRIZZO]", "type": "LUOGO/ INDIRIZZO", "counted_as": "TP"}, {"text": " Bologna", "type": "LUOGO/INDIRIZZO", " counted_as": "FN"}, {"text": "[LUOGO/ INDIRIZZO]", "type": "LUOGO/INDIRIZZO", " counted_as": "FP"}]} --DATA Esempio di output: {"report_id": "1", "annotations_gold": [{"text": "2021-01-01", "type": "DATA"}, {"text": "4 Maggio", "type": "DATA"}], " annotations_deidentified": [{"text": "2021-01-01", "type": "DATA", "counted_as": "FN"}, {"text": "[DATA]", "type": "DATA", "counted_as": "TP"}, {"text": "[DATA]", " type": "DATA", "counted_as": "FP"}]} Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI), Grammarly, Other, and Cursor in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Parliament , C. of the European Union, Regulation (eu) 2016/679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation) , Oficial Journal of the European Union L119 ( 2016 ) 1 - 88 .

[2] U.S. Department of Health and Human Services, 45 cfr § 164 .514 - de-identification of health information, Health Information Privacy. [Online]. Available: https://www.law.cornell.edu/cfr/text/45/ 164.514, ???? [Accessed: Dec. 2, 2024 ].

[3]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[4]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report , M. Speranza,

Zanoli , Clinkart at evalita 2023 : arXiv preprint arXiv: 2303 .08774 ( 2023 ). Overview of the task on linking a lab result to its

[5]

Liu ,

Huang ,

Yu ,

Zhang ,

Wu , C. Cao, test event in the clinical domain ., EVALITA ( 2023 ). H. Dai , L.

Zhao , Y.

Li , P.

Shu , F.

Zeng , L. Sun, [16] M. Lai1 , S.

Menini , M.

Polignano , V.

Russo , R. SprugW. Liu, D.

Shen , Q.

Li , T.

Liu , D.

Zhu , X.

Li , Deid- noli, Evalita 2023 : Overview of the 8th evaluagpt: Zero-shot medical text de-identification by tion campaign of natural language processing and gpt-4 , 2023 . URL: https://arxiv.org/abs/2303.11032. speech tools for italian ( 2023 ). arXiv: 2303 . 11032 . [17]

Magnini ,

Altuna ,

Lavelli , M. Speranza,

[6]

Graves ,

Fernández ,

Schmidhuber , Bidirec- R. Zanoli , The e3c project: European clinical case tional lstm networks for improved phoneme classi- corpus , Language 1 ( 2021 ) L3. ifcation and recognition , in: International confer- [18]

Badshah ,

Sajjad , Reference-guided verdict: ence on artificial neural networks , Springer, 2005 , Llms-as-judges in automatic evaluation of freepp . 799- 804 . form text, arXiv preprint arXiv:2408.09235 ( 2024 ).

[7]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit, [19]

Grattafiori ,

Dubey ,

Jauhri ,

Pandey ,

KaL. Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , At- dian, A. Al-Dahle ,

Letman ,

Mathur , A.

Scheltention is all you need, Advances in neural infor- ten,

Vaughan , et al., The llama 3 herd of models, mation processing systems 30 ( 2017 ). arXiv preprint arXiv:2407.21783 ( 2024 ).

[8]

G. P.

Tobia ,

Patarnello ,

Masciocchi , C. Nero, [20]

Team ,

Kamath ,

Ferret ,

Pathak ,

N. VieilM. C.

Passarotti ,

Moretti ,

Marchetti , G. Ar- lard, R. Merhej,

Perrin ,

Matejovicova , A . Ramé, curi, L. Lilli, Privacy in italian clinical reports: A M. Rivière , et al., Gemma 3 technical report, arXiv nlp-based anonymization approach , in: 2025 IEEE preprint arXiv:2503.19786 ( 2025 ). 13th International Conference on Healthcare Infor- [21]

A. Q.

Jiang ,

Sablayrolles ,

Mensch , C. Bammatics (ICHI) , IEEE, 2025 , pp. 630 - 635 . ford,

D. S.

Chaplot , D. de las Casas,

Bressand ,

[9]

Tannier ,

Wajsbürt ,

Calliger ,

Dura , G. Lengyel,

Lample ,

Saulnier ,

L. R.

Lavaud , M. -

A. Mouchet , M.

Hilka , R.

Bey , Development and A.

Lachaux , P.

Stock , T. L.

Scao , T.

Lavril , T.

Wang, validation of a natural language processing algo- T.

Lacroix , W. E.

Sayed , Mistral 7b, 2023 . URL: https: rithm to pseudonymize documents in the context of //arxiv .org/abs/2310.06825. arXiv: 2310 .06825. a clinical data warehouse , Methods of Information [22]

Abdin ,

Aneja ,

Behl ,

Bubeck , R. Eldan, in Medicine 63 ( 2024 ) 021 - 034 . S. Gunasekar,

Harrison ,

R. J.

Hewett , M. Java-

[10]

Ribeiro ,

Rolla ,

Santos , Incognitus: A tool- heripi, P. Kaufmann, et al., Phi-4 technical report, box for automated clinical notes anonymization , in: arXiv preprint arXiv:2412.08905 ( 2024 ). Proceedings of the 17th Conference of the Euro- [23] Mistral Small 3 | Mistral

- mistral.ai, https:// pean Chapter of the Association for Computational mistral .ai/news/mistral-small- 3 , ???? [Accessed 13- Linguistics: System Demonstrations, 2023 , pp. 187 - 06 -2025]. 194 . [24]

Guo ,

Yang ,

Zhang , J. Song, R. Zhang,

[11]

Pissarra , I. Curioso ,

Alveira ,

Pereira ,

Xu ,

Zhu , S. Ma,

Wang ,

Bi , et al.,

Ribeiro ,

Souper ,

Gomes ,

Carreiro ,

Rolla , Deepseek-r1: Incentivizing reasoning capability in Unlocking the potential of large language mod- llms via reinforcement learning, arXiv preprint els for clinical text anonymization: A comparative arXiv : 2501 .12948 ( 2025 ). study, in : Proceedings of the Fifth Workshop on [25]

Orlando ,

Moroni ,

P.-L. H.

Cabot ,

Conia , Privacy in Natural Language Processing , 2024 ,

pp. E.

Barba ,

Orlandini , G. Fiameni,

Navigli , Min74 - 84 . erva llms: The first family of large language models

[12]

Zheng , et al., Judging llm-as-a-judge with mt- trained from scratch on italian data, in: Proceedings bench and chatbot arena , in: NeurIPS, 2023 . URL: of the 10th Italian Conference on Computational https://arxiv.org/abs/2306.05685. Linguistics ( CLiC-it 2024 ), 2024 , pp. 707 - 719 .

[13] C.-H. Chiang , H.-y. Lee, Can large language models be an alternative to human evaluations? , in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , 2023 . URL: https://arxiv.org/abs/2305.12042.

[14]

Verga , et al., Replacing judges with juries: Evaluating llm generations with a panel of diverse models , arXiv preprint arXiv:2403.16950 ( 2024 ).

[15]

Altuna ,

Karunakaran ,

Lavelli ,

Magnini ,