1. Introduction

Protecting the Privacy in Velvet with Model Editing

Giancarlo A. Xompero

0 1

Elena Sofia Ruzzetti

Cristina Giannone

Andrea Favalli

Raniero Romagnoli

Fabio Massimo Zanzotto

1 0 Almawave S.p.A., Via di Casal Boccone , 188-190 00137, Rome , Italy 1 Human Centric ART, University of Rome Tor Vergata , Italy

2025

Large Language Models (LLMs) showed impressive generation abilities and are now integrated in many real-world applications. However, LLMs also tend to memorize information, including Personally Identifiable Information (PII), which can be learned and generated during inference, posing a risk to users' privacy. In this context, Model Editing techniques have been proposed recently to prevent the leakage of private information by modifying LLMs' parameters directly while preserving their generation capabilities. In this work, we show an application of Model Editing for Privacy Protection in the context of Italian data on Velvet, a multilingual LLM recently released. In particular, we focus on protection against Training Data Extraction (TDE) attacks. Empirical results from the experiments show that model editing techniques can be efective in mitigating privacy leakage in LLMs, even for Italian data, while preserving their multilingual generation capabilities.

eol>Large Language Models Model Editing Privacy

1. Introduction

tual sequences from training data. The success of these attacks is evidence that the privacy of real individuals is Large Language Models (LLMs) showed impressive gen- at risk, so methods to prevent leakage of PII are necessary. eration capabilities in managing various tasks, and they Recently, many solutions have been proposed to mitigate are now integrated into many real-world applications. this phenomenon, such as machine unlearning [9, 10]. Given the popularity and potential of these models, sev- Alternatively, Model Editing approaches showed promiseral open-weight LLMs have been released to the public ing efects for protecting the privacy of users [ 11, 12, 13]. in the last years, including multilingual ones. Following The application of these methods allows us to modify this trend, LLMs that support the Italian language have the knowledge encoded in the LLMs by breaking the also been made available [ 1 ], thus allowing to manage association between some memorized prompts and the tasks even in Italian. corresponding PII. Among these methods, Private Mem

However, since LLMs are now employed in many ser- orization Editing (PME) [13] is an approach that exploits vices, they can be afected by some well-known issues, the memorization mechanism of transformers to modify such as toxicity or privacy leakage [ 2 ], which can have the association between a prompt and its related private an important negative impact on model performance. information, showing its efectiveness in protecting LLMs These problems raised concerns about privacy due to the from TDE attacks. possible presence of undetected private information in In this work, we show an application of PME [13] to training data. Prior research showed that these models protect users’ privacy for Italian data in LLMs. We fotend to memorize training data [ 3, 4, 5, 6 ], thus they are cus on Velvet-2B1, a recent multilingual LLM for English prone to memorizing Personal Identifiable Information and Italian languages. Even though the training data (PII), which might be disclosed during the text generation. has been curated to remove PII, the model may learn Italian LLMs can also be afected, as data used for train- some information during training. Our main objective ing these models is often scraped from public web pages is to understand whether model editing can be extended [7, 8], and although processes to identify and remove and used to protect users’ privacy whose PII might be private information are used to clean data, PII could still included in training data obtained from public datasets. be present. With PME, we can define an automatic process for ob

Privacy is critical for LLMs deployed as services, rais- scuring private information and making Velvet robust to ing concerns about privacy leakage and thus requiring external attacks. attention. Carlini et al. [ 3 ] showed that extracting private We evaluate the efectiveness of our approach through information from an LLM is possible by prompting tex- an experimental process to make Velvet more robust against external attacks aimed at prompting the LLM to generate memorized PII. We obtain Training Data Extraction (TDE) attacks from a subset of documents in Italian used to train Velvet to induce the model to leak PII; in par- 2.2. Knowledge Mechanism of ticular, we focus on email addresses (Section 4.1). Then, we adapt PME to Velvet and edit the model to protect the LLM against identified TDE attacks (Section 4.2). Finally, we measure the efectiveness of our approach by observing the behaviour of Velvet against TDE attacks, and we evaluate the preservation of post-edit Velvet’s multilingual generation capabilities to ensure the edit had no negative impact on the model (Section 4.3). Results show that model editing can be adapted to Italian data and make Velvet more robust against TDE attacks by notably reducing the accuracy of attacks (Section 5.1).

In addition, evaluation of post-edit Velvet suggests that

the edit does not afect multilingual capabilities for both

English and Italian languages (Section 5.2). 2. Background

Given the large amount of data that is necessary to train an LLM, the risks connected to privacy violations have been largely investigated (Section 2.1). We describe what mechanisms in LLMs have been identified to control model predictions (Section 2.2), and how these insights allow editing some undesired predictions without the need of re-training the model (Section 2.3).

2.1. LLMs & Privacy

As LLMs require large amounts of data for training, some undesirable information may have been included in the training material inadvertently: a person’s name, address, email address, social security number, phone number, as well as any other data that, when combined, could lead to identification of individuals, are considered private information and should not be further disseminated during inference by an LLM. This kind of information, defined as Personal Identifiable Information (PII), can in fact be used to identify a specific individual, and threats their privacy if disseminated. rial, an LLM can leak it during inference. In fact, LLM may memorize that information [14, 15, 16] and consequently cause privacy leaks at inference time. A number of attacks have been designed to exploit this tendency

For LLMs, even in black-box access the right prompt may

be suficient to obtain private information. While some attacks require the attacker to craft an adversarial input for the model [19, 20], other attacks do not even rely on potentially harmful prompts [ 3, 6, 4, 5 ].

Developing techniques for the preservation of individuals privacy is central to make LLMs more robust, and trustworthy. Transformers

Transformer-based Language Model Predictions We consider the forward pass of a Transformer-based decoder-only model ℳ of layers and describe it in terms of its sub-components on a prompt . Given the tokenized prompt = [1, ..., ] and their corresponding input embeddings (0), a model builds the prediction for the next token +1 with an iterative refinement across layers. At a given layer , given the Attention

Block as Attn, the layer normalization as LN and the

Feed Forward block FFN, the output of that layer () is computed as: ∀ ∈ {1, . . . , } : ⎪ ⎪ ⎪ ⎪⎨˜() ⎧() = Attn(LN((− 1)))

= (− 1) + () ⎪ ⎪ ⎪⎩() = ˜() ⎪() = FFN(LN(˜())) + () (1)

On the last position , at the last layer , the hidden rep

resentation () is projected by a matrix ∈ R×| | onto the vocabulary space. The scores obtained, normalized by a softmax function , predict the next token: ℳ() = arg max ︁( () ︁) = +1

We aim to understand what are the mechanisms that control for the generation of next token, and if it is possible to alter them to modify the predictions on the next token when the model leaks private information.

FFN Layers as Knowledge Memories

Feed Forward blocks FFN play a crucial role in the generation mechanism of the model, and not only because they account for most of the parameters of the network. The interpretation of the Feed-Forward block in a Transformer model is that it implements a mapping of paired keys exception of activation function that is usually a ReLU rather than a softmax, the equation for the Feed Forward layer reminds the one that describes a neural memory [ 23 ]. The Feed Forward block is in fact composed of function that process each position ∈ [1, ..., ] of the input independently from one another. The output ℎ() of the Feed Forward block at the -th position of the input is computed as follow: (), ()

∈ R× 1 and an activation ︁( ℎ() = ˜() ())︁ ()

(2) where ˜() is the sum output of the Attention Block and the output of the previous layer as in Equation 1. The

However, once a PII is included in the training mate- to values [ 21, 22 ]. Geva et al. [21] notice that, with the and extract private information from LLMs [ 2, 17, 18 ]. two matrices,

() keys of the memory are produced by the output of and the non-linear function , while the values are the corresponding columns in ().

2.3. Editing Knowledge of LLMs

currently stored in () for the new keys * . Since we have * ⊆ 0 because the new keys are representations already stored in (), and the new values 0* satisfy 0* ⊆ 0, we can define ()* = 0* . The equation for ∆ () can be written as: In the last years, there was a major interest around al- ∆ () = ( * − 0* )* (00 + * * )− 1 (6) ternative methods to modify specific behaviors of LLMs without retraining the entire model from scratch. Based We will use the matrix ∆ () to edit the memorized mapon the insights about the knowledge mechanism of trans- ping in layer , without retraining. formers, the research area of knowledge editing has been Since we do not have access to 0, Meng et al. [ 26 ] lfourishing, with the number of methods and approaches assumes that this representation can be modeled with a growing further. random sample of inputs, so 00 can be defined as

Currently, knowledge editing methods can be roughly follows: divided in two categories: parameter-preserving and parameter-editing methods [ 24 ]. While parameter- 0() = · E[ ] ≜ 00 , (7) preserving methods rely on external adapters or memories to intervene whenever there is a specific situation where · E[ ] is an uncentered covariance statistics requiring a diferent response, parameter-editing meth- computed on an empirical sample of vector inputs to the ods are based on the theory about the knowledge mecha- layer. In this paper, we refer to it with 0 rather than nism of transformers and modify the parameters of the 0() for simplicity.

LLM directly, without the need of external modules like parameter-preserving solutions. 2.4. Model Editing for Privacy

We focus on parameter-editing methods: basically, Preservation given an LLM ℳ with parameters , parameter-editing methods aim at finding a shift in parameters ∆ to obtain In recent studies, model editing techniques have been a new model +Δ, which allows to modify a specific applied to the context of privacy protection. prediction while preserving the non-target generation Wu et al. [11] propose DEPN, which is a method that capabilities. ROME [ 25 ] and MEMIT [ 26 ], in particular, locates neurons associated with private information, and are parameter-editing approaches designed to edit the then edits their corresponding activations to remove their LLMs’ parameters in a localized manner and are based on contribution to prediction. the interpretation of Feed Forward layers as memories, as Patil et al. [ 28 ] showed an application of ROME [ 25 ] introduced in Section 2.2. Under this interpretation, then, and MEMIT [ 26 ] to remove private information from the matrix () is optimizing the mapping between keys FFN layers of transformers. This approach exploits the and values, that is: association mechanism to break the associations leading to the leakage of private information. () = min ∑︁ ||̂︁0 − 0||2 (3) Venditti et al. [12] propose PAE, a data-driven ap̂︁ (0,0) proach based on the editing mechanism of MEMIT, aiming to break the association between an individual and with 0 ∈ 0 being a set of keys to memorize and their corresponding PIIs. The method uses prompt tem0 ∈ 0 the corresponding values [ 25, 26, 27 ]. Given plates filled with the information about an individual and the linearity of the system in Equation 3, the optimal their corresponding PII, to replace the private informasolution can be computed as: tion with a dummy PII, thus preventing the leakage of the real PII.

() = 00 (00 )− 1 (4) Ruzzetti et al. [13] propose PME an automatic approach Additionally, a closed-form equation can be found to taking advantage of the memorization mechanism in calculate the edit to introduce new keys and values into LLMs. This approach basically uses memorized prompts the mapping [ 25, 26 ]. Given a representation of keys 0 inducing privacy violation to remove associated PIIs. Unand values 0 stored in that matrix, and the representa- like other locate-and-edit methods, PME distributes the tions for the new keys * and values * to store. residual for the editing among all the FFN layers of the transformer. The main advantage of this method is that ∆ () = ( * − ()* )* (00 + * * )− 1 it can be used automatically on collected prompts with(5) out the need of further manual analysis to determine ()* represents the residual be- the source of the knowledge, allowing for an automatic algorithm for privacy protection.

The term * − tween the new desired values * and the old values

In this paper, we apply PME because of its advantages, original PII in the training material: the evaluation is in particular the fact that it does not rely on assumptions rigorous since a strict match between the generated PII such as which layers to modify or which part of a text and the one found in the training material is required. retrieves the critical information, thus allowing for an automated process.

3. Application and Method 3.1. PII Leakage via Training Data Extraction attacks PII is private information that may have been inadver

tently included in the training dataset and can be extracted from an LLM using Training Data Extraction attacks (TDE) [ 3, 4, 5, 6 ]. In the initial formulation of TDE attacks, Carlini et al. [ 3 ] demonstrate that black-box access to an LLM can be suficient to extract memorized information from a model: when prompted with a context that has been included in the training material, the target LLM tends to generate verbatim the continuation of the original document. Among the generated verbatim memorized content, a model may also generate private information that should not be disseminated.

Formally, given a model ℳ, a string is -extractable memorized if there exists a context string of tokens such that the concatenation of [ || ] is contained in the training material for ℳ and ℳ generates exactly when prompted with in greedy decoding. When the context exactly matches a sequence of the training material, the success of the attack is maximized [ 4 ], and since this is the most informative setting that the attacker can obtain, this is the worst-case scenario.

The success of the attack increases as the attacker

gets more information regarding the training material: one crucial aspect is the length of the context that the model is fed with [ 5, 4 ]: the longer the context, the larger the probability of emission of verbatim memorized information.

Since LLMs have been shown to memorize PII rather 3.2. PME for Automatic Privacy Mitigation To address the threats posed by TDE attacks, we adopt

Private Memorization Editing (PME) [13], a model editing strategy that aims to leverage the memorization tendencies of LLMs as a defense strategy. The objective of the method is to reduce the success of TDE attacks, and hence to replace the memorized PII with a semantically equivalent, but privacy-preserving information. PME applies the editing on the Feed Forward layers of the models, similarly to other model editing techniques like ROME [ 25 ] and MEMIT [ 26 ].

As discussed in Section 2.3, once one knows the correct representation for keys and values that the () encodes, it is possible to apply the closed form solution in Equation 6 to perform the update. To compute the correct representation for keys and values, PME directly exploits training material verbatim memorized from a model.

When the model is prompted with a context that is included in the training material that causes the generation of a PII, PME edits the model to obtain a privacypreserving output instead. In each layer, the keys are the hidden representation that the model computes for the context prompt as in Equation 2, so () = ˜()())︁ . ︁(

For the values, the new privacy-preserving value

should be encoded with an appropriate vector representation. For this reason, PME initially optimizes a hidden representation * in the last layer of the model: using

Gradient Descent, PME optimizes * so that, once de

coded with the projection matrix on the vocabulary, it gives the highest probability of generating a dummy, privacy-preserving value.

Then, the underlying hypothesis in PME is that each than associating them with an individual identity [5, 12, layer should contribute to the generation of this last

2], those attacks represent one of the crucial challenges

layer representation * . PME mimics the generation of to protect individuals whom information have been in- the PII: with a forward pass on the memorized context, advertently added to the training material of an LLM.

the method quantifies how much each layer contributes

Hence, we initially perform TDE attacks against our to the generation of the memorized PII. Instead of relytarget model: we simulate an informed attacker who has some background knowledge regarding the training material, with increasing level of information.

For a given PII, we collect the context that precede it in the training materials, and produce 50, 100, and 200-tokenslong sequences (see Section 4.1 for further details) as we expect that a more informed attacker may obtain larger volume of information. The model is then prompted to generate the subsequent 100 tokens: the attack succeeds if – in greedy decoding – the generated PII matches the ing on Causal Mediation Analysis as in MEMIT [ 26 ] or other localization techniques that have been shown to not inform the edit [ 29, 30 ] for identifying a restricted number of contributing layers, a contribution coeficient

is computed for each layer following a geometric approach.

Since the computation of a Transformer model can be described as a sum of its sub-components at each layer [ 31, 32 ], PME computes the contribution coeficient of each layer as the projection of that layer Feed Forward output onto the last layer Feed Forward representation: () the larger the projection, the larger the impact of that Sample for computing PME Editing Statistics An layer on the overall sum. This contribution coeficient – important step required by PME to perform the desired rescaled to obtain a sum of one across diferent layers – edit is the uncentered covariance statistic 0() described is then used to represent a fraction of * , proportionally in Eq.7. This is an estimation of the keys stored in the to the contribution coeficient () of that layer, that is, at corresponding -th FFN layer, so we need to build an each layer the value () = ()* empirical sample of vector inputs for the layer, which are

Once the correct representation for keys and privacy obtained by feeding the LLM with sample texts. Since preserving values is computed, then the edit can be per- we are dealing with a multilingual LLM trained on both formed as in Equation 6, and the post-edit model should English and Italian texts, we prepare two samples of not generate the target PII under TDE attacks. 100k documents each from English and Italian Wikipedia subsets of the pre-training data used for Velvet-2B. The 4. Experimental Setting purpose of these samples is to understand the efects on the editing performance of 0 computed on diferent languages.

4.2. Application of PME

In this section, we discuss the experimental setting we use to assess the efectiveness of our approach. Specifically, we define: (1) the process for data preparation to obtain the TDE attacks and relative leaked information (Section 4.1), (2) how PME is adapted and applied to Velvet (Section 4.2), and (3) how we evaluate the effectiveness of our privacy protection approach and the post-edit preservation of Velvet’s capabilities (Section 4.3). For these experiments, we focus on email addresses of Italian data as PII, and Velvet-2B as our target LLM.

Mitigating Privacy Leakage Our strategy is to prevent Velvet from generating memorized PIIs during inference by applying PME to Velvet on identified TDE attacks reported in Section 4.1. PME allows to edit the relative knowledge of PII associated with multiple memorized prompts by modifying the LLM’s parameters directly.

The main advantage of this method is that we can edit the TDE attacks directly and there is no need to specify 4.1. Data Preparation which layers are the target of the edit, unlike methods Training Data Extraction Attacks As we discussed such as MEMIT [ 26 ]. in Section 3.1, Training Data Extraction attacks are based Based on this, for every attack (, ) with = ℳ(), on documents and prompts that the LLM has seen during the prompt attack and the leaked PII, we use PME training, which induce a target LLM to complete the given to edit the knowledge encoded in Velvet’s FFN layers prompts with a text verbatim memorized by the model. to force the new association (, ), where is the new Since LLMs are prone to leak PII during generation due dummy PII mail@domain.com, which is semantically to possible contamination of training data with PII, we similar to the original PII. With this method, our objecprepare Training Data Extraction attacks by analyzing tive is to reduce the accuracy of attacks, modifying the a subset of the training data used for Velvet. We focus prediction of the LLM to prevent the generation of the on the Italian subset of CulturaY [33], one of the public leaked information. datasets seen by Velvet during the pre-training phase. We perform the editing process with an approach

We focus on potentially harmful prompts, since our called sequential batch editing [12, 13], in which several main objective is to study the feasibility of protecting prompts are edited in multiple steps, with a batch of mulagainst TDE rather than assessing their accuracy. To do tiple examples edited at each step. For our experiments, that, we define the following protocol. We filter all docu- we fixed the batch size to 16. ments in the dataset that contain at least one email address in them. Then, once we obtain only documents con- Computing Multilingual 0 for PME PME [13], taining PII, we prepare batches of diferent potential TDE ROME [ 25 ] and MEMIT [ 26 ] require a representation attack prompts of diferent lengths ∈ {50, 100, 200}, of the keys 0 stored in the -th FFN layer to apply the by selecting the tokens preceding the identified PII. formula defined in Eq.6, which can be modeled as the After obtaining a set of potential attacks, we deduplicate quantity 0() defined in Eq.7. This quantity is obtained similar prompts. In order to select efective attacks, we by computing an uncentered covariance statistics on an prompt Velvet-2B with the collected attacks and induce empirical sample of vector inputs to the layer when parsthe model to generate 100 tokens: if the email address ing a sample of documents. For our experiments, we generated by the model for a given prompt is the one prepare three types of 0 for PME on the text samples expected as in the training data, we add it to the set of described in Section 4.1: TDE attacks. • IT: computed on the Italian sample; • EN: computed on the English sample; • multi: computed on the English and Italian sam

ples combined.

We compute these statistics for all the FFN layers of Velvet

following the same procedure carried out by Meng et al. [ 26 ].

This statistic plays a crucial role in Eq.6, as it allows us to determine the interaction between the new keys and the knowledge stored in that layer. An efective computation of this statistic is necessary to obtain efective edits, and we empirically explore how diferent estimates of 0 may afect the edit in a multilingual setting.

4.3. Evaluation

model. For this reason, we need to determine whether the editing had a negative impact on the multilingual generative capabilities of our LLM, thus afecting its skills in non-related tasks.

We adopt an automatic evaluation strategy similar to the one used by Venditti et al. [12] to measure the reliability of our post-edit models. We compare the generation capabilities of the post-edit and pre-edit versions of Velvet by measuring the similarity of generated texts on a sample of prompts in terms of BLUE [34] and METEOR [35] scores. For comparison, we consider the subsequent 50 tokens generated by each model after receiving in input the first 100 token of each prompt of our sample.

We perform the evaluation on a sample of 500 prompts for the English and Italian languages, which is defined as follows:

multi EN IT multi EN IT multi EN IT

Context Length 50 100 200

Post-Edit Multilingual Generation Capabilities An important aspect of model editing methods is that they are designed to modify specific knowledge of LLMs, while preserving the non-related generative capabilities of the Post-Edit Attack Accuracy PME efectively protects the privacy in Velvet if the parameter edit reduces the number of successful TDE attacks against the model. • English sample: 100 prompts from Books3, Therefore, the efectiveness of our approach is assessed Wikipedia-en, and Pile-CC subsets of the Pile, by measuring the post-edit privacy leakage efects and respectively; comparing them with the ones of the pre-edit model. • Italian sample: 100 prompts from Clean-C4 and

We adopted the same measure used by Ruzzetti et al. Wikipedia-it, respectively. [13], that is, the Attack Accuracy for memorization attacks. After we edit Velvet for TDE attacks of ∈ The composition of this sample allows to have an indica{50, 100, 200} context lengths, we measure the Attack tion of the impact of PME editing on post-edit language Accuracy of post-edit models and compare their scores capabilities of Velvet. with the ones of the pre-edit version of Velvet. We feed We also extend the utility evaluation by measuring the the TDE prompts to both post-edit and pre-edit version post-edit accuracy of Velvet on LAMBADA[36], one of of Velvet, and then let them generate 100 tokens: if the the tasks included in EleutherAI Language Model Evalgenerated text for each attack contains the expected PII, uation Harness[37]. LAMBADA is used to measure the then the attack is considered successful. accuracy of a model in generating the missing target word from a passage given in input. For the evaluation, we focus on the full test split of the dataset to measure the reliability of the edit. Since we are interested in evaluating the preservation of the post-edit multilingual capabilities of the model, we use both the English and Italian

Editing Attacks PME 0 Context Prompts

IT EN 50 83 multi IT EN 100 380 multi IT EN 200 34 multi

5. Results and Discussion 5.1. Editing reduces Privacy Risks Finally, we observe that the diferent statistics com

puted as an approximation of 0 do not greatly afect the post-edit attack accuracy, with a rather similar number of leaked PII in each configuration.

5.2. Generation Capabilities are Preserved

As we observed during the extraction and filtering phase The results reported in Table 2 show that BLEU and MEof TDE attacks (see Sec. 4.1), Velvet memorized some PII TEOR scores are high in general for all the diferent vercontained in the pre-training data. For diferent context sions of 0 and attacks used for editing, and the same lengths ∈ {50, 100, 200}, we obtained 83, 380, and observation holds for both English and Italian genera34 leaked email addresses, respectively, with the same tion capabilities. The overall high scores suggest that number of memorized prompts. Surprisingly, context of the generations of post-edit models are quite similar to 200 tokens obtained less leaked PII than shorter prompts. the generated texts of the pre-edit model. This aspect, as In this phase, we observe that a slightly diferent prompt discussed in [12], suggests that the edit is robust, because composition might afect the results: so in pre and post- it does not interfere with multilingual capabilities in both edit we adopt the same batch size and batch composition, English and Italian languages. to ensure the reproducibility of the results. Interestingly, the scores show that there is no real con

The results reported in Table 1 show that PME is ef- sensus on the type of statistics that is the best for the fective in reducing the risks of privacy leakage. The English language, since the highest scores are shared post-edit versions of Velvet for contexts 50 and 100 are between the EN and multi 0. However, we note that more robust than the pre-edit model, leaking less than the IT version of 0 obtains lower scores than the other 9 and 16 PII with respect to 75 and 341 leaked by the two versions in general, suggesting that the IT statistics pre-edit Velvet. The efect is similar for all the versions leads to a less efective preservation of Velvet’s generaof 0 used by PME for editing, with minimal diferences tion capabilities for English. among them: in fact, the diference is of 4 more leaked Observing the evaluation results for Italian, we notice PII at best for context 100. that IT version of 0 achieves higher BLEU and ME

The number of leaked email addresses is reduced even TEOR scores, suggesting that this version is necessary to for context 200 attacks, where post-edit Velvet leaked preserve the generation capabilities of Velvet for Italian. 17 PII instead of 31 of the pre-edit model. However, the Also, we note that the EN version of 0 tends to achieve reduction here is lower compared with the other attacks, lower scores with respect to the other types, indicating probably due to the lower number of PII extracted during that this 0 is less efective for preserving the abilities the data processing phase. for Italian.

Note that results also show that the model tends to In general, observed results indicate that using vergenerate a large number of email addresses in general, sions of 0 computed on a diferent language from the which are diferent from the correct ones. These difer- target one is less efective for preserving the generative ent email addresses could be model’s hallucinations, or capabilities of the target language in post-edit. In fact, the email addresses that follow the original one in the pre- IT version of 0 obtained lower scores for the English training corpus. However, results in terms of successfully language, and the EN version of 0 was less efective for Leaked PII suggest that PME is still suficiently efective the Italian language. Thus, these experiments suggest in preserving privacy on edited prompts. that 0 should be computed on samples containing texts in the target languages.

multi EN IT multi EN IT multi EN IT

50 100 200 83 380 34

About task performance, results reported in Table 2 of the LAMBADA benchmark corroborate the utility preservation already observed with the previous evaluation analysis. The accuracy scores of post-edit models are comparable with the pre-edit ones, suggesting that the edits performed by PME do not afect considerably the capabilities of the model. The same observation holds for both English and Italian versions of LAMBADA. Diferently from the previous analysis, there are no noticeable losses in terms of performance with respect to the version of 0 used for the editing, except for the Italian score of context-100 editing with EN 0 that is lower than the pre-edit score (42.1 vs 45.2). Hence, this result indicates that edits performed by PME are reliable in general, allowing privacy protection of Velvet for Italian data without loss of task performance.

6. Conclusions and Future Work In this work, we show an application of model editing for protecting the privacy of Italian data on Velvet-2B, a multilingual model trained on both Italian and English data.

Our method is based on a recent model editing technique named Private Memorization Editing, which prevents LLMs from generating memorized PII that might be included in the training data. Results of our experiments on privacy protection for email addresses shows that model editing is efective in reducing the privacy risks of Velvet, thus reducing the success of Training Data Extraction (TDE) attacks, harmful prompts obtained from the training data that are efective for extracting private information from the original model. In addition, we show that our approach mitigates the privacy risks while preserving the model’s multilingual generation capabilities.

In conclusion, our approach shows that we can adapt and apply model editing techniques for privacy protection in multilingual LLMs for Italian data.

For future work, we should focus on some other aspects to further improve this work. Firstly, our approach should be extended to diferent types of PII other than email addresses, and further investigation is necessary to understand the efects of the approach with diferent PII.

Another aspect to consider is how well PME scales with larger models such as Velvet-14B: this other model requires additional investigation, because it manages other languages other than English and Italian, and the magnitude of data used for training is larger than the one used for Velvet-2B. Finally, the evaluation of Velvet’s post-edit capabilities should be extended to other tasks of the Language Model Evaluation Harness[37] or other benchmarks, and include human evaluation to have a better perspective on the overall quality of post-edit models instead of relying exclusively on automatic metrics. aclanthology.org/2022.findings-emnlp.148. doi: 10. J. Nabende, E. Shutova, M. T. Pilehvar (Eds.), Pro18653/v1/2022.findings-emnlp.148. ceedings of the 63rd Annual Meeting of the Asso[6] M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. ciation for Computational Linguistics (Volume 1: Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wal- Long Papers), Association for Computational Linlace, F. Tramèr, K. Lee, Scalable extraction of train- guistics, Vienna, Austria, 2025, pp. 16572–16592. ing data from (production) language models, arXiv URL: https://aclanthology.org/2025.acl-long.810/. preprint arXiv:2311.17035 (2023). [14] S. Biderman, U. Prashanth, L. Sutawika, [7] T. Nguyen, C. V. Nguyen, V. D. Lai, H. Man, N. T. H. Schoelkopf, Q. Anthony, S. Purohit, E. Raf, Ngo, F. Dernoncourt, R. A. Rossi, T. H. Nguyen, Emergent and predictable memorization in large CulturaX: A cleaned, enormous, and multilingual language models, Advances in Neural Information dataset for large language models in 167 lan- Processing Systems 36 (2023) 28072–28090. guages, in: N. Calzolari, M.-Y. Kan, V. Hoste, [15] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi, A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of C. Giannone, A. Favalli, R. Romagnoli, F. M. Zanthe 2024 Joint International Conference on Com- zotto, Investigating the impact of data conputational Linguistics, Language Resources and tamination of large language models in text-toEvaluation (LREC-COLING 2024), ELRA and ICCL, SQL translation, in: L.-W. Ku, A. Martins, Torino, Italia, 2024, pp. 4226–4237. URL: https: V. Srikumar (Eds.), Findings of the Association //aclanthology.org/2024.lrec-main.377. for Computational Linguistics: ACL 2024, Asso[8] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, ciation for Computational Linguistics, Bangkok, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, Thailand, 2024, pp. 13909–13920. URL: https: S. Presser, C. Leahy, The pile: An 800gb dataset //aclanthology.org/2024.findings-acl.827/. doi: 10. of diverse text for language modeling, 2020. 18653/v1/2024.findings-acl.827. arXiv:2101.00027. [16] H. Kiyomaru, I. Sugiura, D. Kawahara, S. Kuro[9] Y. Yao, X. Xu, Y. Liu, Large language model unlearn- hashi, A comprehensive analysis of memorizaing, 2024. URL: https://arxiv.org/abs/2310.10683. tion in large language models, in: S. Mahamood, arXiv:2310.10683. N. L. Minh, D. Ippolito (Eds.), Proceedings of the [10] A. Kassem, O. Mahmoud, S. Saad, Preserving 17th International Natural Language Generation privacy through dememorization: An unlearn- Conference, Association for Computational Lining technique for mitigating memorization risks guistics, Tokyo, Japan, 2024, pp. 584–596. URL: in language models, in: H. Bouamor, J. Pino, https://aclanthology.org/2024.inlg-main.45/. K. Bali (Eds.), Proceedings of the 2023 Conference [17] B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, on Empirical Methods in Natural Language Pro- X. Cheng, On protecting the data privacy of large cessing, Association for Computational Linguis- language models (llms): A survey, arXiv preprint tics, Singapore, 2023, pp. 4360–4379. URL: https: arXiv:2403.05156 (2024). //aclanthology.org/2023.emnlp-main.265. doi:10. [18] A. Verma, S. Krishna, S. Gehrmann, M. Seshadri, 18653/v1/2023.emnlp-main.265. A. Pradhan, T. Ault, L. Barrett, D. Rabinowitz, [11] X. Wu, J. Li, M. Xu, W. Dong, S. Wu, C. Bian, J. Doucette, N. Phan, Operationalizing a threat D. Xiong, DEPN: Detecting and editing privacy neu- model for red-teaming large language models (llms), rons in pretrained language models, in: H. Bouamor, arXiv preprint arXiv:2407.14937 (2024). J. Pino, K. Bali (Eds.), Proceedings of the 2023 Con- [19] F. Perez, I. Ribeiro, Ignore previous prompt: Attack ference on Empirical Methods in Natural Language techniques for language models, in: NeurIPS ML Processing, Association for Computational Linguis- Safety Workshop, 2022. tics, Singapore, 2023, pp. 2875–2886. URL: https: [20] X. Shen, Z. Chen, M. Backes, Y. Shen, Y. Zhang, " do //aclanthology.org/2023.emnlp-main.174. doi:10. anything now": Characterizing and evaluating in18653/v1/2023.emnlp-main.174. the-wild jailbreak prompts on large language mod[12] D. Venditti, E. S. Ruzzetti, G. A. Xompero, els, in: Proceedings of the 2024 on ACM SIGSAC C. Giannone, A. Favalli, R. Romagnoli, F. M. Conference on Computer and Communications SeZanzotto, Enhancing data privacy in large lan- curity, 2024, pp. 1671–1685. guage models through private association edit- [21] M. Geva, R. Schuster, J. Berant, O. Levy, Transing, 2024. URL: https://arxiv.org/abs/2406.18221. former feed-forward layers are key-value memarXiv:2406.18221. ories, in: M.-F. Moens, X. Huang, L. Specia, [13] E. S. Ruzzetti, G. A. Xompero, D. Venditti, F. M. S. W.-t. Yih (Eds.), Proceedings of the 2021 ConZanzotto, Private memorization editing: Turning ference on Empirical Methods in Natural Lanmemorization into a defense to strengthen data guage Processing, Association for Computational privacy in large language models, in: W. Che, Linguistics, Online and Punta Cana, Domini

Declaration on Generative AI During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Orlando ,

Moroni , P.-L. Huguet Cabot , S.

Conia , E.

Barba , S.

Orlandini , G. Fiameni, R.

Navigli , Minerva LLMs: The first family of large language models trained from scratch on Italian data , in: F. Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiCit 2024 ), CEUR Workshop Proceedings, Pisa, Italy, 2024 , pp. 707 - 719 . URL: https://aclanthology.org/ 2024 .clicit- 1 .77/.

[2]

Miranda ,

E. S.

Ruzzetti ,

Santilli ,

F. M.

Zanzotto ,

Bratières , E. Rodolà, Preserving privacy in large language models: A survey on current threats and solutions , Transactions on Machine Learning Research ( 2025 ). URL: https://openreview.net/forum? id= Ss9MTTN7OL .

[3]

Carlini ,

Tramer , E. Wallace,

Jagielski ,

Herbert-Voss ,

Lee ,

Roberts ,

Brown , D. Song,

Erlingsson , et al., Extracting training data from large language models , in: 30th USENIX Security Symposium (USENIX Security 21) , 2021 , pp. 2633 - 2650 .

[4]

Carlini ,

Ippolito ,

Jagielski ,

Lee ,

Tramer ,

Zhang , Quantifying memorization across neural language models , 2023 . arXiv: 2202 . 07646 .

[5]

Huang ,

Shao , K. C.-C. Chang , Are large pretrained language models leaking your personal information? , in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2022 , Association for Computational Linguistics , Abu Dhabi, United Arab Emirates, 2022 , pp. 2038 - 2047 . URL: https:// can Republic, 2021 , pp. 5484 - 5495 . URL: https: doi:10.1162/tacl_a_ 00501 . //aclanthology.org/ 2021 .emnlp-main. 446 . doi:10. [32]

Ferrando ,

Sarti ,

Bisazza ,

M. R.

Costa-jussà , A 18653 /v1/ 2021 . emnlp-main.446. primer on the inner workings of transformer-based

[22]

Geva ,

Caciularu ,

Wang ,

Goldberg , language models, 2024 . URL: https://arxiv.org/abs/ Transformer feed-forward layers build predictions 2405 .00208. arXiv: 2405 .00208. by promoting concepts in the vocabulary space , [33]

H. N. Thuat

Nguyen , T. Nguyen, Culturay: A large in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), cleaned multilingual dataset of 75 languages , 2024 . Proceedings of the 2022 Conference on Empir- [34]

Glushkova ,

Zerva ,

A. F. T.

Martins , BLEU ical Methods in Natural Language Processing, meets COMET: Combining lexical and neural metAssociation for Computational Linguistics, Abu rics towards robust machine translation evaluaDhabi , United Arab Emirates , 2022 , pp. 30 - 45 . tion, in: M. Nurminen , J.

Brenner , M.

Koponen , URL: https://aclanthology.org/ 2022 .emnlp-main.3.

Latomaa ,

Mikhailov ,

Schierl , T. Ranasdoi: 10 .18653/v1/ 2022 .emnlp-main.3. inghe , E. Vanmassenhove, S. A. Vidal , N. Aran-

[23]

Sukhbaatar ,

Weston ,

Fergus , et al., End- to- berri, M. Nunziatini, C. P.

Escartín , M.

Forcada, end memory networks , Advances in

Neural

Infor- M. Popovic ,

Scarton , H. Moniz (Eds.), Proceedmation Processing Systems 28 ( 2015 ). ings of the 24th Annual Conference of the European

[24]

Yao ,

Wang ,

Tian , S. Cheng,

Li ,

Deng , Association for Machine Translation, European AsH. Chen, N. Zhang, Editing large language mod- sociation for Machine Translation , Tampere, Finels: Problems, methods, and opportunities, 2023 . land, 2023 , pp. 47 - 58 . URL: https://aclanthology. arXiv: 2305 .13172. org/ 2023 .eamt- 1 .6.

[25]

Meng ,

Bau ,

Andonian ,

Belinkov , Locat- [35]

Banerjee ,

Lavie , METEOR: An automatic meting and editing factual associations in gpt, 2023. ric for MT evaluation with improved correlation arXiv:2202.05262. with human judgments , in: J. Goldstein , A . Lavie,

[26]

Meng ,

A. S.

Sharma ,

Andonian ,

Belinkov ,

C.-Y.

Lin , C. Voss (Eds.), Proceedings of the ACL D. Bau , Mass-editing memory in a transformer, 2023 . Workshop on Intrinsic and Extrinsic Evaluation arXiv: 2210 .07229. Measures for Machine Translation and/or Summa-

[27]

Kohonen , Correlation matrix memories , IEEE rization, Association for Computational LinguisTransactions on Computers C- 21 ( 1972 ) 353 - 359 . tics, Ann Arbor, Michigan, 2005 , pp. 65 - 72 . URL: URL: https://api.semanticscholar.org/CorpusID: https://aclanthology.org/W05-0909. 21483100 . [36]

Paperno ,

Kruszewski ,

Lazaridou , N. Q.

[28]

Patil ,

Hase ,

Bansal , Can sensitive in- Pham,

Bernardi ,

Pezzelle ,

Baroni , G. Boleda, formation be deleted from llms? objectives R. Fernández, The LAMBADA dataset: Word predicfor defending against extraction attacks, 2023. tion requiring a broad discourse context , in: K. Erk, arXiv: 2309 .17410. N. A. Smith (Eds.), Proceedings of the 54th An-

[29] T.-Y. Chang , J.

Thomason , R.

Jia , Do localiza- nual Meeting of the Association for Computational tion methods actually localize memorized data in Linguistics (Volume 1: Long Papers), Association LLMs? a tale of two benchmarks , in: K. Duh, for Computational Linguistics, Berlin, Germany,

Gomez , S. Bethard (Eds.), Proceedings of the 2016 , pp. 1525 - 1534 . URL: https://aclanthology.org/ 2024 Conference of the North American Chapter P16-1144. doi: 10 .18653/v1/ P16 -1144. of the Association for Computational Linguistics: [37]

Gao ,

Tow ,

Abbasi ,

Biderman ,

Black , Human Language Technologies (Volume 1: Long A . DiPofi,

Foster ,

Golding ,

Hsu ,

Le Noac'h , Papers), Association for Computational Linguis - H. Li , K.

McDonell , N.

Muennighof , C. Ociepa, tics, Mexico City, Mexico, 2024 , pp. 3190 - 3211 . J. Phang , L.

Reynolds , H.

Schoelkopf , A.

Skowron , URL: https://aclanthology.org/ 2024 . naacl-long . 176 /. L. Sutawika , E.

Tang , A.

Thite , B.

Wang , K.

Wang , doi:10.18653/v1/ 2024 . naacl-long.176. A. Zou, A framework for few-shot language model

[30]

Hase ,

Bansal ,

Kim ,

Ghandeharioun , evaluation, 2024 . URL: https://zenodo.org/records/ Does localization inform editing? surprising dif- 12608602 . doi: 10 .5281/zenodo.12608602. ferences in causality-based localization vs. knowledge editing in language models , 2023 . URL: https: //arxiv.org/abs/2301.04213. arXiv: 2301 . 04213 .

[31]

Mickus ,

Paperno ,

Constant , How to dissect a Muppet: The structure of transformer embedding spaces , Transactions of the Association for Computational Linguistics 10 ( 2022 ) 981 - 996 . URL: https://aclanthology.org/ 2022 .tacl- 1 . 57 .