<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Scalena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabetta Fersini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malvina Nissim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Groningen</institution>
          ,
          <addr-line>CLCG</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Milano - Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to diferent models, (ii) achieves performances comparable to, or even better than, ifne-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Italian steering</kwd>
        <kwd>Language adaptation</kwd>
        <kwd>Activation steering</kwd>
        <kwd>Instruction Tuning</kwd>
        <kwd>Reasoning benchmarks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>cultural aspects are often not represented by the training
data. In addition, one must consider the usual substantial
The strong rise in capabilities of the latest large language (computational) costs associated with large datasets.
models (LLMs) has brought significant improvements in a With recent developments in interpretability research,
wide variety of downstream tasks. These abilities mainly new approaches are arising to localize and steer
diferderive from the instruction-tuning procedure (IT), i.e., ent language model aspects. These techniques mainly
model fine-tuning on instruction datasets, and enable the work with an inference-time injection, allowing for
tarmodels to follow user-prompted instructions. geted interventions during the generation phase without</p>
      <p>Most LLMs, however, are mainly pre-trained and fine- incurring the high costs associated with any additional
tuned in English, and while other high-resource lan- training. Such techniques, relying on the assumption that
guages are included in the training data, they are not models are already capable of performing specific tasks,
present to the extent needed to achieve out-of-the-box aim at enhancing some of the internal activations leading
performances comparable to English. A strategy to ad- to specific solutions, thereby also increasing overall
perdress this has been, in the past few years, to fine-tune formance. They have proved successful towards specific
models with language-specific instructions, such as the tasks, such as model detoxification, but also toward more
Stanford Alpaca dataset [1], which has been automati- generalist and wide-ranging tasks [4, 5].
cally translated in multiple languages – the Italian version We explore the potential of steering for
Italianof it has been used to train the Llama 2-based Camoscio instructing a pre-trained LLM as an alternative to
finemodel [2]. A combination of ∼ 240 training instances tuning, adopting a steering technique based on
confrom three automatically translated instruction datasets trastive examples. We observe that this approach, with
was used to train the latest Llamantino [3], the most much less data (≪ 100 instances instead of 240K) and no
recent Llama 3-based instruction-tuned model for Italian. additional training required, enables performances
com</p>
      <p>This approach has proven efective, but using large parable to standard fine-tuning approaches and yields
amounts of machine-translated texts is far from opti- high-quality Italian generations.
mal: although the translation is generally good for
highresource languages, the language’s unique linguistic and</p>
      <p>The latest LLMs are pre-trained on data which often
includes not only English but also (small percentages of)
other languages [6, 7]. After the initial pre-training phase,
models are further trained to follow instructions given by
users. Due to the nature of most instruction-tuning data,
performance in and on English is still overwhelmingly</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Italian adaptation Over time the most widely adopted We edit the original Alpaca dataset and obtain three
solution to improve model performance over the Italian diferent versions:
language has been to perform further Instruction-Tuning
with Italian data (IT-ITA) on existing models. Exam- • ENG: the original dataset, both question and answer
ples of this type are Camoscio [2] and Llamantino 2 [3] are in English;
(both based on the Llama 2 model’s family), and ANITA
[9] (based on Llama 3 models). Generally, instruction • ITA-full: machine-translated Alpaca dataset, both
ifne-tuning is performed on the original model already question and answer are in Italian;
in its instructed version using additional data which is • ITA: questions in English, answers in Italian. The aim
machine-translated from instructions originally in En- is to emphasize the language switch task, pushing the
glish. Taking ANITA as an example this goes as follows: model to respond in Italian even to an English prompt.
starting from the instructed Llama 3, fine-tuning is
performed with ∼ 100k instruction prompts in English and, By using contrastive examples between the original
Enafter an additional optimization step with ∼ 40k exam- glish and the Italian responses we extract the diference
ples, another 100k prompts machine-translated into Ital- in activations between the models prompted in diferent
ian are used for the language adaptation task. This large languages.
amount of data, combined with the size of the models,
naturally leads to large computational costs.</p>
      <sec id="sec-2-1">
        <title>Steering vector extraction At every generation step</title>
        <p>
          = 1, . . . ,  a LLM  generates a sequence of tokens
Steering vectors Following the linear representation based on the prompt version and previously generated
hypothesis, high-level concepts are represented as di- tokens 1, . . . , − 1. We collect the activations of the last
rections in the activation space of LLMs [
          <xref ref-type="bibr" rid="ref7">10</xref>
          ]. A single token from each attention head output ( ,ℎ ∈ Rhead )1
direction can be found through the use of examples de- and average them over a series of  = 30 prompts.
signed to elicit opposite behaviors in output to the model
[5, 4, 11] or by using the diference between fine-tuned 
Tmhoedeefelsctfiovrenspesescioficf ttahseksse atencdhtnhiqeiureosrliigeisnianlivsoerlastioinng[sp1e2-]. version = 1 ∑=︁1  ,ℎ(version, &lt;) (1)
cific properties, such as the language or the style used,
to emphasize it during inference. In this work, we test
the potential of steering vectors to improve performance
on several NLP tasks by facilitating the process of
generating the Italian language for which the models were not
originally explicitly trained.
        </p>
        <p>where version ∈ R||×| |× head . The prompts version
are supposed to push the model towards the desired
behavior using a 5-shot setting and an instruction explicitly
asking the model to respond in a specific language (either
Italian for ITA and ITA-full or English for ENG; further
details are in Appendix A).</p>
        <p>To obtain the final steering vector towards the ITA or
ITA-full behavior we compute the diference between the
previously calculated activations as follows:</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>We build on the assumption that during the training
process, the model already sees a small amount of the target
language (Italian in our case). However, as anticipated,
reasoning behavior is mainly developed through the use
of the English language, especially during instruction
tuning. We aim to push the internal components
promoting the language switch, so as to achieve better results
on a language diferent than English.
∆ ITA-full = ITA-full − ENG
∆ ITA =  − ENG</p>
      <p>ITA
Steering vector injection The newly calculated
steering vector, when added to the running activations, is
supposed to steer the model toward a specific direction,
in a similar fashion to what was common with word
emSteering through contrastive prompts The first beddings in vector space [13]. We apply each steering
step to extract the Italian steering vector is to build con- vector for every generated token using a diminishing
trastive prompts that will highlight the diferences be- multiplicative factor  = 1.5 to modulate the steering
tween the activations when prompting the model with intensity following what was proposed to be efective in
diferent languages [ 4, 5]. To this end, we use the Stan- [4]:
ford Alpaca dataset [1], consisting of question-answering
style prompts, both in its original English and its
machinetranslated Italian version (Appendix A shows some
random example instances.)
1The extraction is made on every layer  ∈  and for each attention
head ℎ ∈  where  and  are the total number of layers and
attention heads in the LLM respectively.

,ℎ(· ) ←

,ℎ(· ) +  ∆ ,ℎ</p>
      <p>(2)
where  regulates the steering intensity, starting with
valmax and linearly diminishing to 0 for each -th
generated token:
  = valmax ·
︂(
1
−</p>
      <p>− 1 )︂
 − 1
where  indicates the maximum number of tokens to
be generated.</p>
      <p>This allows us to get the language direction coming
from the diference in polarity between the activations,
eventually steering the original LLM towards Italian.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We select two diferent models as base to test the
efectiveness of our steering approach. The first is the smallest (8B
parameters) from the Llama 3 family in its Instructed
version2. The second model we take as base is the smallest
(3.8B parameters) Phi 3 model3 in its English-instructed
version. For a comparison of steering with the more
commonly-used Instruction Tuning approach, we also
re-run on the selected benchmarks the latest Instruction
Tuned model with Italian data (IT-ITA) model ANITA
from [9], also based on the same Llama 3 model we use.</p>
      <p>Since all of these models have some training data in
diferent languages, even if not specifically meant to be
multilingual, we also test the original models on the
Ital(3) • ARC challenge [17] is a collection of over 1k
in• HellaSwag [16] is a benchmark meant to measure
grounded commonsense inference. The model is
supposed to indicate the correct continuation after
reading the initial prompt containing procedure steps from
Activitynet and wikiHow. The employed setting is a
0-shot prompt over all the ∼ 10k test instances.
stances of school-level multiple-choice science
questions aimed at measuring the knowledge retrieval
capabilities of a LLM. The employed setting is a 0-shot
prompt where the model must select the most likely
answer to each of the questions.</p>
      <p>We also test the ability of the model in generating
full Italian responses (rather than non-Italian ones). To
this end, we use a popular language identification tool
lang-detect6 and take the probability of the Italian
language as the scoring metric.</p>
      <sec id="sec-4-1">
        <title>4.2. Steering vs the rest</title>
        <sec id="sec-4-1-1">
          <title>General results</title>
          <p>each benchmark.7 Among the two proposed steering
approaches, ITA generally proves to be more efective
in steering the LLM outputs. Additionally, the steering
approach often surpasses both the original and IT-ITA
models’ performances. The most significant advantage,
however, is the reduced time and computational
resources needed to enhance a model’s performance
in a new language. The Italian Llama 3 ANITA [9]
typ</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.1. Selected benchmarks</title>
        <p>We test the models on three diferent standard
benchmarks included in the Italian LLM leaderboard5:
• MMLU [15] is a multitask question-answering
benchmark consisting of multiple-choice questions from
various expert-level knowledge branches. The usual setup
for this benchmark is a 5-shot prompt to help the model
during the reasoning task. The test set consists of
∼ 14k instances with four possible responses each.</p>
        <p>across most benchmarks with significantly less data —
only 30 demonstrative examples in our case.</p>
        <p>Approaches matter It may be useful to look at how
steering and Instruction Tuning techniques difer in
improving model responses. Figure 1 shows the overlap (or
lack thereof) of correct responses of the four approaches
based on Llama 3-Instruct. The Instruction Tuning
process allows ANITA to learn to answer questions that the
original model was not able to. This likely occurs due to
the fine-tuning process, where the model absorbs new
information from the utilized data, expanding its set of
correct answers. At the same time, however, IT-ITA also
runs into the loss of previous capabilities on some
quesian benchmarks to get a baseline in terms of model ca- ically outperforms its original version but has required
pabilities and better capture the diferences between the
IT-ITA procedure and the diferent steering techniques. 4 ing technique achieves comparable or better performance
ifne-tuning on over 240k examples. In contrast, the
steer2meta-llama/Meta-Llama-3-8B-Instruct via HuggingFace
3microsoft/Phi-3-mini-4k-instruct via HuggingFace
6lang-detect package
4Another obvious baseline would be a native Italian model, such as
7Please note that our results difer from those shown in the Italian
the recent Minerva [14] which is pre-trained on Italian+English
data. While some instructed versions of Minerva are available on
Huggingface, they are completely undocumented and have unclear
ownership, so we cannot get any reliable indicator about its training.
5Open ITA LLM leaderboard via HuggingFace.</p>
        <p>LLM leaderboard since we employ a regex-based approach to
evaluate the responses instead of using the response likelihood of the
model as per [18], which would require four times more runs. This
is further explained in Appendix B.</p>
        <p>8For the sake of clarity, only cardinalities &gt; 25 are shown in writing.</p>
        <sec id="sec-4-2-1">
          <title>Meta Llama 3 8B - Instruct</title>
          <p>Original
+ IT-ITA (ANITA [9])
+ Steering ITA-full
+ Steering ITA</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Microsoft Phi 3 mini 4k - Instruct</title>
          <p>Original
+ Steering ITA-full
+ Steering ITA
54.21
55.01
55.73
55.95
59.65
59.92
60.65</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Generation quality According to langdetect (last</title>
          <p>column in Table 1), which measures the probability of
a sentence being Italian, the Italian fine-tuned ANITA
has lower consistency over the used benchmarks (0.715).</p>
          <p>Qualitatively, we also observe that with diferent
system prompts, ANITA sometimes generates non-sensical
output or uses languages other than the expected
Italtions, a behavior similar to the so-called catastrophic ian. Some examples can be seen in Table 2, where we
forgetting [19] when learning new information. report some random examples from the ARC challenge</p>
          <p>On the other hand, the steering technique is based benchmark, where the model might still able to solve the
on improving only language capabilities, without task but fails to continue the generation properly. This
the model learning anything new from the data. problem could be traced back to the instability of the
This leads to the theoretical disadvantage of an upper fine-tuning process which can lead to excessive variance
bound whereby it is dificult to improve the model’s per- in results depending on the used data or diferent
hyperformance. Experimentally, however, steering gives mod- parameters employed during the training process [20].
els better language/reasoning-specific capabilities, which The steering approach, instead, appears to provide a
prestill allow a slight increase in performance, without neces- cise direction toward the expected language, generally
sarily forgetting much of the information and/or knowl- achieving better results in terms of language consistency.
edge stored in the original model. To further get an intuition of the ability to generate free
Italian text of the diferent models, we qualitatively test
their outputs on a series of random prompts and report
these generations in Table 7 for the Llama 3 models and
in Table 8 for the Phi 3 model.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. On SOTA models performance improvements</title>
        <p>The gap in performance that we have observed between
the original model and the steered/instruction-tuned
verMeta Llama 2 7B - Instruct
Original
+ IT-ITA (LLaMAntino 2 [3])
+ Steering ITA-full
+ Steering ITA
32.84
34.98
41.06
38.24
sion is present in some benchmarks although not as
substantial. One obvious observation is that the original
already has substantial abilities in Italian, in spite of not
having been specifically instructed for that. Llama 3
Instruct was trained on more than 15T tokens which,
together with several other techniques, must allow it to
achieve impressive performance even on diferent
languages. In order to possibly see a bigger impact of
steering and fine-tuning over their respective original model,
we replicate our experiments on the previous version of
the same model (Llama 2 - Instruct)9, looking only at the
ARC challenge results. We also use the IT-ITA version of
Llama 2-Instruct10 from [3] for comparison.</p>
        <p>From Table 3 we can see that the increase in
performance over the original model is more substantial than
what observed for Llama 3. This is especially true for
the steering techniques, which increase the performance
of Llama 2 by ∼ 20% and ∼ 25% (for ITA and ITA-full,
respectively), yielding a larger improvement than what
achieved by the fine-tuned model.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Take home message and outlook</title>
      <p>To instruct in a specific language a pre-trained LLM,
steering is computationally much less expensive than
ifne-tuning with hundreds of thousands of
(automatically translated) examples. We observe that for Italian
this strategy achieves comparable or better performance
on existing benchmarks than fine-tuning; generations are
also fluent and comparable to those of fine-tuned models.
The advantage of fine-tuning is that new data, and thus
new knowledge, is injected in the model via training on
new examples. At the same time, this might also trigger
so-called catastrophic forgetting, yielding degradation in
the output.</p>
      <p>We suggest that in the context of creating a new
language-specific instructed LLM, this advantage makes
sense only insofar culturally relevant and native data
9We use the name "Llama 2 - Instruct" for consistency even though
the original name is meta-llama/Llama-2-7b-chat-hf via
HuggingFace
10swap-uniba/LLaMAntino-2-chat-7b-hf-ITA via HuggingFace
is used in the fine-tuning phase, so that the model can
truly be enriched with language-specific knowledge, both
grammatically and pragmatically. If translated data must
be used, then it is incredibly more efective to use
steering which requires much fewer examples (less than 0.5%)
and a simple inference-time injection, making this an
accessible method for virtually any language. Using native
examples for the steering procedure, and possibly
stylespecific examples, might also yield interesting results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The work of Daniel Scalena and Elisabetta Fersini has
been partially funded by MUR under the grant ReGAInS,
Dipartimenti di Eccellenza 2023-2027 of the Department
of Informatics, Systems and Communication at the
University of Milano-Bicocca;</p>
      <p>Daniel Scalena is also partially supported by the
graduate school of the Faculty of Arts of the University of
Groningen.</p>
      <p>The work of Elisabetta Fersini has been also partially
funded by the European Union – NextGenerationEU
under the National Research Centre For HPC, Big Data and
Quantum Computing - Spoke 9 - Digital Society and
Smart Cities (PNRR-MUR).</p>
      <p>We also thank the Center for Information Technology
of the University of Groningen for providing access to
the Hábrók high-performance computing cluster.</p>
    </sec>
    <sec id="sec-7">
      <title>B. Evaluation technique</title>
    </sec>
    <sec id="sec-8">
      <title>A. Promtps and instructions</title>
      <p>When extracting the behavior from the models, we
employ diferent versions of Alpaca. Examples of the three
versions listed above (ENG, ITA-full and ITA) can be
observed in Table 4. As highlighted in Section 5 it is
important to use datasets that are original in the target
language or, alternatively, carefully translated and reviewed
by expert subjects. By looking at the examples in Table 4,
in some cases the translation does not carry with it
cultural and diverse aspects of the new language, efectively
degrading the actual performance of the model when
the dataset is employed for instruction fine-tuning. This
aspect, on the other hand, is partially negligible when
steering techniques are applied whose sole purpose is
to identify which internal activations contribute to the
generation of a language and push them accordingly.</p>
      <p>Each of the Alpaca prompts used for the contrastive
approach is also paired with a system instruction Answer
the following questions. The same instruction is translated
in Italian (Rispondi alle seguenti domande) when using
the ITA-full and ITA versions of the dataset.</p>
      <p>We also list in Table 6 the instructions used as system
prompts for each proposed benchmark. Each prompt
follows the standard chat template on which the
alreadyinstructed is trained on. Some examples from the
diferent benchmarks are proposed in Table 5.</p>
      <p>Moreover, we do not use any sampling parameters,
using greedy decoding and stopping criteria when the
EOS token is generated.
11lm-evaluation-harness via GitHub
What is the capital of France?
Look up the boiling point of water.</p>
      <p>Compose a five word sentence describing your day.
Compile a list of 5 US states located in the Mid West.
What is the least common multiple of 3, 4, 6 and 15?</p>
      <sec id="sec-8-1">
        <title>Question &amp; Answer</title>
        <p>Alpaca (ENG version)
Alpaca (ITAfull version)</p>
        <p>The capital of France is Paris.</p>
        <p>The boiling point of water is 100 degrees Celsius or 212
degrees Fahrenheit at standard atmospheric pressure.</p>
        <p>I am always assisting users.
1. Ohio 2. Indiana 3. Michigan 4. Illinois 5. Wisconsin
To find the LCM of 3, 4=22ˆ, 6=2*3, and 15=3*5, take the
highest power of each prime that appears and multiply:
2**2*3*5=60.</p>
        <p>Qual è la capitale della Francia?
Guarda il punto di ebollizione dell’acqua.</p>
        <p>Scrivi una frase di cinque parole che descriva la tua
giornata.</p>
        <p>Compila una lista di 5 stati americani situati nel Midwest.</p>
        <p>Qual è il minimo comune multiplo di 3, 4, 6 e 15?</p>
        <p>La capitale della Francia è Parigi.</p>
        <p>Il punto di ebollizione dell’acqua è di 100 gradi Celsius o
212 gradi Fahrenheit a pressione atmosferica standard.</p>
        <p>Sono sempre qui per aiutare gli utenti.
1. Ohio 2. Indiana 3. Michigan 4. Illinois 5. Wisconsin
Per trovare l’LCM di 3, 4=22ˆ, 6=2*3, e 15=3*5, prendi
la potenza più alta di ogni numero primo che appare e
moltiplica: 2**2*3*5=60.</p>
        <p>What is the capital of France?
Look up the boiling point of water.</p>
        <p>Compose a five word sentence describing your day.</p>
        <p>Compile a list of 5 US states located in the Mid West.</p>
        <p>What is the least common multiple of 3, 4, 6 and 15?</p>
        <p>Alpaca (ITA version)</p>
        <p>La capitale della Francia è Parigi.</p>
        <p>Il punto di ebollizione dell’acqua è di 100 gradi Celsius o
212 gradi Fahrenheit a pressione atmosferica standard.</p>
        <p>Sono sempre qui per aiutare gli utenti.
1. Ohio 2. Indiana 3. Michigan 4. Illinois 5. Wisconsin
Per trovare l’LCM di 3, 4=22ˆ, 6=2*3, e 15=3*5, prendi
la potenza più alta di ogni numero primo che appare e
moltiplica: 2**2*3*5=60.
Per quale di questi due scenari il protagonista (che usa
Io/me/mio) fa qualcosa di moralmente sbagliato, secondo
gli standard morali americani del 2020? Scenario 1 | Ho
afittato un ragazzo elegante e sexy per il pomeriggio.</p>
        <p>Scenario 2 | Ho trasferito i rifugiati nelle case.</p>
        <p>(A) Sbagliato, Sbagliato
(B) Sbagliato, Non sbagliato
(C) Non sbagliato, Sbagliato
(D) Non sbagliato, Non sbagliato
Una donna in tailleur rosso sta parlando dietro una
scrivania. un uomo
Anna tiene un cubetto di ghiaccio. Perché si scioglie il
cubetto di ghiaccio nella sua mano?</p>
        <p>HellaSwag (it)
ARC challenge (it)
(A) prende le palle da un attaccapanni sul muro.
(B) sta passando l’aspirapolvere su una sedia di plastica
nera.
(C) spruzza acqua in un piccolo lavandino bianco.
(D) sta in piedi tenendo il piede dietro la schiena.
(A) Il calore si sposta dalla sua mano al cubetto di
ghiaccio.
(B) Il freddo si sposta dalla sua mano al cubetto di
ghiaccio.
(C) Il calore si sposta dal cubetto di ghiaccio alla sua
mano.
(D) Il freddo si sposta dal cubetto di ghiaccio alla sua
mano.
MMLU
HellaSwag
ARC</p>
        <p>Instruction
Scegli la continuazione
più adatta tra A, B, C e
D
Scegli la risposta corretta
tra A, B, C e D
-shot
5
0
0
La risposta è no, non necessariamente. La costruzione di una corsia in più in una strada non è suficiente per ridurre il
trafico in modo significativo. Ecco alcune ragioni per cui:
1. La corsia in più non è necessariamente una misura per ridurre il trafico. La costruzione di una corsia in più può
attirare più trafico nella zona (domanda indotta), poiché gli automobilisti coglieranno l’opportunità di poter usare la
nuova corsia [...]
La risposta è no, non necessariamente riduco il trafico.</p>
        <p>Costruire una corsia in più in una strada può avere diversi efetti sul trafico, a seconda del contesto e delle condizioni.</p>
        <p>Ecco alcuni possibili efetti:
1. **Riduzione del trafico**: se la corsia in più è destinata ai mezzi pubblici o ai pedoni e ciclisti, può ridurre il trafico
generale sulla strada, [...]
Rispondi alle domande dell’utente
Riassumi il seguente testo: "Il Monte Bianco (Mont Blanc in francese e in arpitano), con un’altitudine di 4805,59 m
s.l.m., è la montagna più alta delle Alpi, d’Italia, di Francia e anche d’Europa, secondo una difusa convenzione
geografica, da cui i soprannomi di tetto d’Europa e di Re delle Alpi, condividendo assieme al monte Elbrus nel
Caucaso un posto tra le cosiddette Sette Vette (Seven Summits) del pianeta."
Il testo descrive il Monte Bianco, la montagna più alta delle Alpi, d’Italia, di Francia e d’Europa, con un’altitudine di
4805,59 metri. È noto anche come "tetto d’Europa" e "Re delle Alpi". Inoltre, è considerato una delle Sette Vette del
pianeta, insieme al monte Elbrus nel Caucaso.</p>
        <p>Il testo descrive il Monte Bianco, la montagna più alta dÉuropa, delle Alpi, d’Italia e di Francia, con un’altitudine di
4805,59 metri, noto anche come "tetto d’Europa" e "Re delle Alpi".</p>
        <p>Il testo descrive il Monte Bianco, la montagna più alta delle Alpi, d’Italia, di Francia e d’Europa, con un’altitudine
di 4805,59 m s.l.m. È noto anche come "Tetto d’Europa" e "Re delle Alpi" e fa parte delle cosiddette "Sette Vette" del
pianeta, insieme al Monte Elbrus nel Caucaso.</p>
        <p>Il testo descrive il Monte Bianco, la montagna più alta delle Alpi, d’Italia, di Francia e d’Europa, con un’altitudine
di 4805,59 metri. Inoltre, menziona i soprannomi della montagna, ovvero "Tetto d’Europa" e "Re delle Alpi", e la sua
inclusione tra le "Sette Vette" del pianeta, insieme al monte Caucaso
Table 7
Example generations on random prompts for the Llama 3 - Instruct model in all previously proposed versions. Qualitatively
analyzing the responses, the generation seems to be good in all proposed cases. Only the first prompt (the generation of the
sonnet) although an Italian in line with the poetic style required by the prompt is used, the generated sonnet does not respect
the correct metric, rhyme and syllables required by the particular style of the composition.
System
Prompt
Original
ITA
Segui le istruzioni dell’utente
Scrivi la prima quartina di un sonetto sull’estate</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          05457. arXiv:
          <year>1803</year>
          .
          <volume>05457</volume>
          . [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>models</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2405.14782.
          <article-title>The most widely used approach, for model compari-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>arXiv:2405</source>
          .14782.
          <article-title>son in the above leaderboards, is to evaluate the likeli</article-title>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rabinowitz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Ve- hood of a given response by appending each response</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>ness</surname>
            , G. Desjardins,
            <given-names>A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Milan</surname>
          </string-name>
          , J. Quan, to the prompt [
          <volume>18</volume>
          ].
          <article-title>This technique is employed in the</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Ramalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grabska-Barwinska</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Hassabis, lm-eval11 toolkit, which provides a useful tool to eval-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>ences 114</source>
          (
          <year>2017</year>
          )
          <fpage>3521</fpage>
          -
          <lpage>3526</lpage>
          . URL: http://dx.doi.
          <article-title>employed a standard regex to evaluate the generation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>org/10</source>
          .1073/pnas.1611835114. doi:
          <volume>10</volume>
          .1073/pnas. from the model:
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          1611835114.
          <string-name>
            <surname>r</surname>
          </string-name>
          <article-title>" ( R i s p o s t a : | r i s p o s t a e ' ) \ s ∗ \ ( ? ( [</article-title>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , Measuring the instability of fine- ABCD ] ) \ ) ?
          <fpage>"</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>tuning</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.07778.
          <string-name>
            <surname>r</surname>
          </string-name>
          <article-title>" ( : | e ' ) \ s ∗ \ ( ? ( [ ABCD ] ) \ ) ? \ b "</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>arXiv:2302</source>
          .
          <fpage>07778</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>