1. Introduction

A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering

Daniel Scalena

0 1

Elisabetta Fersini

Malvina Nissim

0 0 University of Groningen , CLCG , The Netherlands 1 University of Milano - Bicocca , Italy

Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to diferent models, (ii) achieves performances comparable to, or even better than, ifne-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.

eol>Italian steering Language adaptation Activation steering Instruction Tuning Reasoning benchmarks

1. Introduction

cultural aspects are often not represented by the training data. In addition, one must consider the usual substantial The strong rise in capabilities of the latest large language (computational) costs associated with large datasets. models (LLMs) has brought significant improvements in a With recent developments in interpretability research, wide variety of downstream tasks. These abilities mainly new approaches are arising to localize and steer diferderive from the instruction-tuning procedure (IT), i.e., ent language model aspects. These techniques mainly model fine-tuning on instruction datasets, and enable the work with an inference-time injection, allowing for tarmodels to follow user-prompted instructions. geted interventions during the generation phase without

Most LLMs, however, are mainly pre-trained and fine- incurring the high costs associated with any additional tuned in English, and while other high-resource lan- training. Such techniques, relying on the assumption that guages are included in the training data, they are not models are already capable of performing specific tasks, present to the extent needed to achieve out-of-the-box aim at enhancing some of the internal activations leading performances comparable to English. A strategy to ad- to specific solutions, thereby also increasing overall perdress this has been, in the past few years, to fine-tune formance. They have proved successful towards specific models with language-specific instructions, such as the tasks, such as model detoxification, but also toward more Stanford Alpaca dataset [1], which has been automati- generalist and wide-ranging tasks [4, 5]. cally translated in multiple languages – the Italian version We explore the potential of steering for Italianof it has been used to train the Llama 2-based Camoscio instructing a pre-trained LLM as an alternative to finemodel [2]. A combination of ∼ 240 training instances tuning, adopting a steering technique based on confrom three automatically translated instruction datasets trastive examples. We observe that this approach, with was used to train the latest Llamantino [3], the most much less data (≪ 100 instances instead of 240K) and no recent Llama 3-based instruction-tuned model for Italian. additional training required, enables performances com

This approach has proven efective, but using large parable to standard fine-tuning approaches and yields amounts of machine-translated texts is far from opti- high-quality Italian generations. mal: although the translation is generally good for highresource languages, the language’s unique linguistic and

The latest LLMs are pre-trained on data which often includes not only English but also (small percentages of) other languages [6, 7]. After the initial pre-training phase, models are further trained to follow instructions given by users. Due to the nature of most instruction-tuning data, performance in and on English is still overwhelmingly

2. Related works

Italian adaptation Over time the most widely adopted We edit the original Alpaca dataset and obtain three solution to improve model performance over the Italian diferent versions: language has been to perform further Instruction-Tuning with Italian data (IT-ITA) on existing models. Exam- • ENG: the original dataset, both question and answer ples of this type are Camoscio [2] and Llamantino 2 [3] are in English; (both based on the Llama 2 model’s family), and ANITA [9] (based on Llama 3 models). Generally, instruction • ITA-full: machine-translated Alpaca dataset, both ifne-tuning is performed on the original model already question and answer are in Italian; in its instructed version using additional data which is • ITA: questions in English, answers in Italian. The aim machine-translated from instructions originally in En- is to emphasize the language switch task, pushing the glish. Taking ANITA as an example this goes as follows: model to respond in Italian even to an English prompt. starting from the instructed Llama 3, fine-tuning is performed with ∼ 100k instruction prompts in English and, By using contrastive examples between the original Enafter an additional optimization step with ∼ 40k exam- glish and the Italian responses we extract the diference ples, another 100k prompts machine-translated into Ital- in activations between the models prompted in diferent ian are used for the language adaptation task. This large languages. amount of data, combined with the size of the models, naturally leads to large computational costs.

Steering vector extraction At every generation step

= 1, . . . , a LLM generates a sequence of tokens Steering vectors Following the linear representation based on the prompt version and previously generated hypothesis, high-level concepts are represented as di- tokens 1, . . . , − 1. We collect the activations of the last rections in the activation space of LLMs [ 10 ]. A single token from each attention head output ( ,ℎ ∈ Rhead )1 direction can be found through the use of examples de- and average them over a series of = 30 prompts. signed to elicit opposite behaviors in output to the model [5, 4, 11] or by using the diference between fine-tuned Tmhoedeefelsctfiovrenspesescioficf ttahseksse atencdhtnhiqeiureosrliigeisnianlivsoerlastioinng[sp1e2-]. version = 1 ∑=︁1 ,ℎ(version, <) (1) cific properties, such as the language or the style used, to emphasize it during inference. In this work, we test the potential of steering vectors to improve performance on several NLP tasks by facilitating the process of generating the Italian language for which the models were not originally explicitly trained.

where version ∈ R||×| |× head . The prompts version are supposed to push the model towards the desired behavior using a 5-shot setting and an instruction explicitly asking the model to respond in a specific language (either Italian for ITA and ITA-full or English for ENG; further details are in Appendix A).

To obtain the final steering vector towards the ITA or ITA-full behavior we compute the diference between the previously calculated activations as follows:

3. Method

We build on the assumption that during the training process, the model already sees a small amount of the target language (Italian in our case). However, as anticipated, reasoning behavior is mainly developed through the use of the English language, especially during instruction tuning. We aim to push the internal components promoting the language switch, so as to achieve better results on a language diferent than English. ∆ ITA-full = ITA-full − ENG ∆ ITA = − ENG

ITA Steering vector injection The newly calculated steering vector, when added to the running activations, is supposed to steer the model toward a specific direction, in a similar fashion to what was common with word emSteering through contrastive prompts The first beddings in vector space [13]. We apply each steering step to extract the Italian steering vector is to build con- vector for every generated token using a diminishing trastive prompts that will highlight the diferences be- multiplicative factor = 1.5 to modulate the steering tween the activations when prompting the model with intensity following what was proposed to be efective in diferent languages [ 4, 5]. To this end, we use the Stan- [4]: ford Alpaca dataset [1], consisting of question-answering style prompts, both in its original English and its machinetranslated Italian version (Appendix A shows some random example instances.) 1The extraction is made on every layer ∈ and for each attention head ℎ ∈ where and are the total number of layers and attention heads in the LLM respectively. ,ℎ(· ) ← ,ℎ(· ) + ∆ ,ℎ

(2) where regulates the steering intensity, starting with valmax and linearly diminishing to 0 for each -th generated token: = valmax · ︂( 1 −

− 1 )︂ − 1 where indicates the maximum number of tokens to be generated.

This allows us to get the language direction coming from the diference in polarity between the activations, eventually steering the original LLM towards Italian.

4. Results

We select two diferent models as base to test the efectiveness of our steering approach. The first is the smallest (8B parameters) from the Llama 3 family in its Instructed version2. The second model we take as base is the smallest (3.8B parameters) Phi 3 model3 in its English-instructed version. For a comparison of steering with the more commonly-used Instruction Tuning approach, we also re-run on the selected benchmarks the latest Instruction Tuned model with Italian data (IT-ITA) model ANITA from [9], also based on the same Llama 3 model we use.

Since all of these models have some training data in diferent languages, even if not specifically meant to be multilingual, we also test the original models on the Ital(3) • ARC challenge [17] is a collection of over 1k in• HellaSwag [16] is a benchmark meant to measure grounded commonsense inference. The model is supposed to indicate the correct continuation after reading the initial prompt containing procedure steps from Activitynet and wikiHow. The employed setting is a 0-shot prompt over all the ∼ 10k test instances. stances of school-level multiple-choice science questions aimed at measuring the knowledge retrieval capabilities of a LLM. The employed setting is a 0-shot prompt where the model must select the most likely answer to each of the questions.

We also test the ability of the model in generating full Italian responses (rather than non-Italian ones). To this end, we use a popular language identification tool lang-detect6 and take the probability of the Italian language as the scoring metric.

4.2. Steering vs the rest General results

each benchmark.7 Among the two proposed steering approaches, ITA generally proves to be more efective in steering the LLM outputs. Additionally, the steering approach often surpasses both the original and IT-ITA models’ performances. The most significant advantage, however, is the reduced time and computational resources needed to enhance a model’s performance in a new language. The Italian Llama 3 ANITA [9] typ

4.1. Selected benchmarks

We test the models on three diferent standard benchmarks included in the Italian LLM leaderboard5: • MMLU [15] is a multitask question-answering benchmark consisting of multiple-choice questions from various expert-level knowledge branches. The usual setup for this benchmark is a 5-shot prompt to help the model during the reasoning task. The test set consists of ∼ 14k instances with four possible responses each.

across most benchmarks with significantly less data — only 30 demonstrative examples in our case.

Approaches matter It may be useful to look at how steering and Instruction Tuning techniques difer in improving model responses. Figure 1 shows the overlap (or lack thereof) of correct responses of the four approaches based on Llama 3-Instruct. The Instruction Tuning process allows ANITA to learn to answer questions that the original model was not able to. This likely occurs due to the fine-tuning process, where the model absorbs new information from the utilized data, expanding its set of correct answers. At the same time, however, IT-ITA also runs into the loss of previous capabilities on some quesian benchmarks to get a baseline in terms of model ca- ically outperforms its original version but has required pabilities and better capture the diferences between the IT-ITA procedure and the diferent steering techniques. 4 ing technique achieves comparable or better performance ifne-tuning on over 240k examples. In contrast, the steer2meta-llama/Meta-Llama-3-8B-Instruct via HuggingFace 3microsoft/Phi-3-mini-4k-instruct via HuggingFace 6lang-detect package 4Another obvious baseline would be a native Italian model, such as 7Please note that our results difer from those shown in the Italian the recent Minerva [14] which is pre-trained on Italian+English data. While some instructed versions of Minerva are available on Huggingface, they are completely undocumented and have unclear ownership, so we cannot get any reliable indicator about its training. 5Open ITA LLM leaderboard via HuggingFace.

LLM leaderboard since we employ a regex-based approach to evaluate the responses instead of using the response likelihood of the model as per [18], which would require four times more runs. This is further explained in Appendix B.

8For the sake of clarity, only cardinalities > 25 are shown in writing.

Meta Llama 3 8B - Instruct

Original + IT-ITA (ANITA [9]) + Steering ITA-full + Steering ITA

Microsoft Phi 3 mini 4k - Instruct

Original + Steering ITA-full + Steering ITA 54.21 55.01 55.73 55.95 59.65 59.92 60.65

Generation quality According to langdetect (last

column in Table 1), which measures the probability of a sentence being Italian, the Italian fine-tuned ANITA has lower consistency over the used benchmarks (0.715).

Qualitatively, we also observe that with diferent system prompts, ANITA sometimes generates non-sensical output or uses languages other than the expected Italtions, a behavior similar to the so-called catastrophic ian. Some examples can be seen in Table 2, where we forgetting [19] when learning new information. report some random examples from the ARC challenge

On the other hand, the steering technique is based benchmark, where the model might still able to solve the on improving only language capabilities, without task but fails to continue the generation properly. This the model learning anything new from the data. problem could be traced back to the instability of the This leads to the theoretical disadvantage of an upper fine-tuning process which can lead to excessive variance bound whereby it is dificult to improve the model’s per- in results depending on the used data or diferent hyperformance. Experimentally, however, steering gives mod- parameters employed during the training process [20]. els better language/reasoning-specific capabilities, which The steering approach, instead, appears to provide a prestill allow a slight increase in performance, without neces- cise direction toward the expected language, generally sarily forgetting much of the information and/or knowl- achieving better results in terms of language consistency. edge stored in the original model. To further get an intuition of the ability to generate free Italian text of the diferent models, we qualitatively test their outputs on a series of random prompts and report these generations in Table 7 for the Llama 3 models and in Table 8 for the Phi 3 model.

4.3. On SOTA models performance improvements

The gap in performance that we have observed between the original model and the steered/instruction-tuned verMeta Llama 2 7B - Instruct Original + IT-ITA (LLaMAntino 2 [3]) + Steering ITA-full + Steering ITA 32.84 34.98 41.06 38.24 sion is present in some benchmarks although not as substantial. One obvious observation is that the original already has substantial abilities in Italian, in spite of not having been specifically instructed for that. Llama 3 Instruct was trained on more than 15T tokens which, together with several other techniques, must allow it to achieve impressive performance even on diferent languages. In order to possibly see a bigger impact of steering and fine-tuning over their respective original model, we replicate our experiments on the previous version of the same model (Llama 2 - Instruct)9, looking only at the ARC challenge results. We also use the IT-ITA version of Llama 2-Instruct10 from [3] for comparison.

From Table 3 we can see that the increase in performance over the original model is more substantial than what observed for Llama 3. This is especially true for the steering techniques, which increase the performance of Llama 2 by ∼ 20% and ∼ 25% (for ITA and ITA-full, respectively), yielding a larger improvement than what achieved by the fine-tuned model.

5. Take home message and outlook

To instruct in a specific language a pre-trained LLM, steering is computationally much less expensive than ifne-tuning with hundreds of thousands of (automatically translated) examples. We observe that for Italian this strategy achieves comparable or better performance on existing benchmarks than fine-tuning; generations are also fluent and comparable to those of fine-tuned models. The advantage of fine-tuning is that new data, and thus new knowledge, is injected in the model via training on new examples. At the same time, this might also trigger so-called catastrophic forgetting, yielding degradation in the output.

We suggest that in the context of creating a new language-specific instructed LLM, this advantage makes sense only insofar culturally relevant and native data 9We use the name "Llama 2 - Instruct" for consistency even though the original name is meta-llama/Llama-2-7b-chat-hf via HuggingFace 10swap-uniba/LLaMAntino-2-chat-7b-hf-ITA via HuggingFace is used in the fine-tuning phase, so that the model can truly be enriched with language-specific knowledge, both grammatically and pragmatically. If translated data must be used, then it is incredibly more efective to use steering which requires much fewer examples (less than 0.5%) and a simple inference-time injection, making this an accessible method for virtually any language. Using native examples for the steering procedure, and possibly stylespecific examples, might also yield interesting results.

Acknowledgments

The work of Daniel Scalena and Elisabetta Fersini has been partially funded by MUR under the grant ReGAInS, Dipartimenti di Eccellenza 2023-2027 of the Department of Informatics, Systems and Communication at the University of Milano-Bicocca;

Daniel Scalena is also partially supported by the graduate school of the Faculty of Arts of the University of Groningen.

The work of Elisabetta Fersini has been also partially funded by the European Union – NextGenerationEU under the National Research Centre For HPC, Big Data and Quantum Computing - Spoke 9 - Digital Society and Smart Cities (PNRR-MUR).

We also thank the Center for Information Technology of the University of Groningen for providing access to the Hábrók high-performance computing cluster.

B. Evaluation technique A. Promtps and instructions

When extracting the behavior from the models, we employ diferent versions of Alpaca. Examples of the three versions listed above (ENG, ITA-full and ITA) can be observed in Table 4. As highlighted in Section 5 it is important to use datasets that are original in the target language or, alternatively, carefully translated and reviewed by expert subjects. By looking at the examples in Table 4, in some cases the translation does not carry with it cultural and diverse aspects of the new language, efectively degrading the actual performance of the model when the dataset is employed for instruction fine-tuning. This aspect, on the other hand, is partially negligible when steering techniques are applied whose sole purpose is to identify which internal activations contribute to the generation of a language and push them accordingly.

Each of the Alpaca prompts used for the contrastive approach is also paired with a system instruction Answer the following questions. The same instruction is translated in Italian (Rispondi alle seguenti domande) when using the ITA-full and ITA versions of the dataset.

We also list in Table 6 the instructions used as system prompts for each proposed benchmark. Each prompt follows the standard chat template on which the alreadyinstructed is trained on. Some examples from the diferent benchmarks are proposed in Table 5.

Moreover, we do not use any sampling parameters, using greedy decoding and stopping criteria when the EOS token is generated. 11lm-evaluation-harness via GitHub What is the capital of France? Look up the boiling point of water.

Compose a five word sentence describing your day. Compile a list of 5 US states located in the Mid West. What is the least common multiple of 3, 4, 6 and 15?

Question & Answer

Alpaca (ENG version) Alpaca (ITAfull version)

The capital of France is Paris.

The boiling point of water is 100 degrees Celsius or 212 degrees Fahrenheit at standard atmospheric pressure.

I am always assisting users. 1. Ohio 2. Indiana 3. Michigan 4. Illinois 5. Wisconsin To find the LCM of 3, 4=22ˆ, 6=2*3, and 15=3*5, take the highest power of each prime that appears and multiply: 2**2*3*5=60.

Qual è la capitale della Francia? Guarda il punto di ebollizione dell’acqua.

Scrivi una frase di cinque parole che descriva la tua giornata.

Compila una lista di 5 stati americani situati nel Midwest.

Qual è il minimo comune multiplo di 3, 4, 6 e 15?

La capitale della Francia è Parigi.

Il punto di ebollizione dell’acqua è di 100 gradi Celsius o 212 gradi Fahrenheit a pressione atmosferica standard.

What is the capital of France? Look up the boiling point of water.

Compose a five word sentence describing your day.

Compile a list of 5 US states located in the Mid West.

What is the least common multiple of 3, 4, 6 and 15?

Alpaca (ITA version)

La capitale della Francia è Parigi.

Il punto di ebollizione dell’acqua è di 100 gradi Celsius o 212 gradi Fahrenheit a pressione atmosferica standard.

Sono sempre qui per aiutare gli utenti. 1. Ohio 2. Indiana 3. Michigan 4. Illinois 5. Wisconsin Per trovare l’LCM di 3, 4=22ˆ, 6=2*3, e 15=3*5, prendi la potenza più alta di ogni numero primo che appare e moltiplica: 2**2*3*5=60. Per quale di questi due scenari il protagonista (che usa Io/me/mio) fa qualcosa di moralmente sbagliato, secondo gli standard morali americani del 2020? Scenario 1 | Ho afittato un ragazzo elegante e sexy per il pomeriggio.

Scenario 2 | Ho trasferito i rifugiati nelle case.

(A) Sbagliato, Sbagliato (B) Sbagliato, Non sbagliato (C) Non sbagliato, Sbagliato (D) Non sbagliato, Non sbagliato Una donna in tailleur rosso sta parlando dietro una scrivania. un uomo Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano?

HellaSwag (it) ARC challenge (it) (A) prende le palle da un attaccapanni sul muro. (B) sta passando l’aspirapolvere su una sedia di plastica nera. (C) spruzza acqua in un piccolo lavandino bianco. (D) sta in piedi tenendo il piede dietro la schiena. (A) Il calore si sposta dalla sua mano al cubetto di ghiaccio. (B) Il freddo si sposta dalla sua mano al cubetto di ghiaccio. (C) Il calore si sposta dal cubetto di ghiaccio alla sua mano. (D) Il freddo si sposta dal cubetto di ghiaccio alla sua mano. MMLU HellaSwag ARC

Instruction Scegli la continuazione più adatta tra A, B, C e D Scegli la risposta corretta tra A, B, C e D -shot 5 0 0 La risposta è no, non necessariamente. La costruzione di una corsia in più in una strada non è suficiente per ridurre il trafico in modo significativo. Ecco alcune ragioni per cui: 1. La corsia in più non è necessariamente una misura per ridurre il trafico. La costruzione di una corsia in più può attirare più trafico nella zona (domanda indotta), poiché gli automobilisti coglieranno l’opportunità di poter usare la nuova corsia [...] La risposta è no, non necessariamente riduco il trafico.

Costruire una corsia in più in una strada può avere diversi efetti sul trafico, a seconda del contesto e delle condizioni.

Ecco alcuni possibili efetti: 1. **Riduzione del trafico**: se la corsia in più è destinata ai mezzi pubblici o ai pedoni e ciclisti, può ridurre il trafico generale sulla strada, [...] Rispondi alle domande dell’utente Riassumi il seguente testo: "Il Monte Bianco (Mont Blanc in francese e in arpitano), con un’altitudine di 4805,59 m s.l.m., è la montagna più alta delle Alpi, d’Italia, di Francia e anche d’Europa, secondo una difusa convenzione geografica, da cui i soprannomi di tetto d’Europa e di Re delle Alpi, condividendo assieme al monte Elbrus nel Caucaso un posto tra le cosiddette Sette Vette (Seven Summits) del pianeta." Il testo descrive il Monte Bianco, la montagna più alta delle Alpi, d’Italia, di Francia e d’Europa, con un’altitudine di 4805,59 metri. È noto anche come "tetto d’Europa" e "Re delle Alpi". Inoltre, è considerato una delle Sette Vette del pianeta, insieme al monte Elbrus nel Caucaso.

Il testo descrive il Monte Bianco, la montagna più alta dÉuropa, delle Alpi, d’Italia e di Francia, con un’altitudine di 4805,59 metri, noto anche come "tetto d’Europa" e "Re delle Alpi".

Il testo descrive il Monte Bianco, la montagna più alta delle Alpi, d’Italia, di Francia e d’Europa, con un’altitudine di 4805,59 m s.l.m. È noto anche come "Tetto d’Europa" e "Re delle Alpi" e fa parte delle cosiddette "Sette Vette" del pianeta, insieme al Monte Elbrus nel Caucaso.

Il testo descrive il Monte Bianco, la montagna più alta delle Alpi, d’Italia, di Francia e d’Europa, con un’altitudine di 4805,59 metri. Inoltre, menziona i soprannomi della montagna, ovvero "Tetto d’Europa" e "Re delle Alpi", e la sua inclusione tra le "Sette Vette" del pianeta, insieme al monte Caucaso Table 7 Example generations on random prompts for the Llama 3 - Instruct model in all previously proposed versions. Qualitatively analyzing the responses, the generation seems to be good in all proposed cases. Only the first prompt (the generation of the sonnet) although an Italian in line with the poetic style required by the prompt is used, the generated sonnet does not respect the correct metric, rhyme and syllables required by the particular style of the composition. System Prompt Original ITA Segui le istruzioni dell’utente Scrivi la prima quartina di un sonetto sull’estate

05457. arXiv: 1803 . 05457 . [18]

Biderman ,

Schoelkopf ,

Sutawika ,

Gao ,

models , 2024 . URL: https://arxiv.org/abs/2405.14782. The most widely used approach, for model compari-

arXiv:2405 .14782. son in the above leaderboards, is to evaluate the likeli [19]

Kirkpatrick ,

Pascanu ,

Rabinowitz , J.

Ve- hood of a given response by appending each response

ness , G. Desjardins, A. A.

Rusu , K.

Milan , J. Quan, to the prompt [ 18 ]. This technique is employed in the

Ramalho ,

Grabska-Barwinska , D.

Hassabis, lm-eval11 toolkit, which provides a useful tool to eval-

ences 114 ( 2017 ) 3521 - 3526 . URL: http://dx.doi. employed a standard regex to evaluate the generation

org/10 .1073/pnas.1611835114. doi: 10 .1073/pnas. from the model:

1611835114. r " ( R i s p o s t a : | r i s p o s t a e ' ) \ s ∗ \ ( ? ( [ [20]

Du ,

Nguyen , Measuring the instability of fine- ABCD ] ) \ ) ? "

tuning , 2023 . URL: https://arxiv.org/abs/2302.07778. r

" ( : | e ' ) \ s ∗ \ ( ? ( [ ABCD ] ) \ ) ? \ b "

arXiv:2302 . 07778 .