<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>E. Musacchio);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extending Italian Large Language Models for vision-language tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elio Musacchio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asia Beatrice Uboldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Germani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fastweb SpA</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National PhD in Artificial Intelligence, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>With the growing evolution of Large Language Models, there has also been a rising interest in extending these models to incorporate non-textual signals. Specifically, Large Vision-Language Models have been developed, which extend Large Language Models to understand and process visual signals. This allows them to solve complex vision-language tasks, further extending their inherent abilities in text-only task resolution. However, for the Italian language, most works still focus on text-only solutions without extending them to multimodality. In this work, we extend Large Language Models for the Italian language to multimodality and benchmark the performance of these models when trained using the same experimental setting.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Large Vision-Language Models</kwd>
        <kwd>Multimodality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        enabling them to process visual inputs together with
textual ones. Also in this case, there are training procedures
In the last years, interest in Large Language Models that allow leveraging existing LLMs instead of training
(LLMs) has been growing steadily. The ability of these from scratch for vision-language inputs. This makes
models to solve complex tasks, even when they have not the process both more eficient, since the pre-training
been trained with that specific objective, makes them phase is skipped, and more efective, as the textual
knowlextremely useful for any natural language processing edge of the model is leveraged to learn how to perform
task. However, as it often occurs in the Natural Language vision-language tasks. Despite this, many open LLMs
Processing research field, the abundance of English data supporting the Italian language have not been extended
meant that the first openly released LLMs only supported to support multimodality. This is due to the limited
availthe English language (e.g. LLaMA 2 [
        <xref ref-type="bibr" rid="ref4">1</xref>
        ]), limiting the ap- ability of training data for vision-language tasks in Italian,
plicability of these models to other languages. To cover whereas English training data often comprises multiple
this gap, several LLMs were trained to directly support diverse and rich tasks. Furthermore, with the
proliferathe Italian language, using either a monolingual or mul- tion of Italian LLMs, like Minerva [2] and Velvet1, it
tilingual strategy. Whichever the selected strategy, these becomes increasingly important to test their capabilities
models were obtained using one of the following method- in the multimodal domain. This raises the question of
ologies: fine-tuning pre-existing models or training from whether it is possible to extend current LLMs trained for
scratch on datasets consisting mainly of Italian data. This the Italian language for multimodality. Do these models
trend allowed to extend LLMs not only to multiple un- perform well when extended to support it? In this work,
derrepresented languages but also to new modalities. An we propose a study on the multimodal performance of
example is represented by Large Vision-Language Mod- Italian LLMs extended to LVLMs using a state-of-the-art
els (LVLMs) that are LLMs extended with a technique approach.
      </p>
      <p>Specifically, this work extends current literature as
follows:</p>
      <p>• We train several LLMs supporting the Italian
lan</p>
      <p>guage to extend them to LVLMs;
• We benchmark these models using datasets that</p>
      <p>are natively in Italian;
• We study the efect of diferent prompt formatting
at inference time and showcase the length bias in
the response of LVLMs.</p>
      <p>Finally, we want to underline that we are forced to use
machine translation for training data due to the scarcity
of large-scale multimodal data for non-English languages.</p>
      <p>However, we focus our evaluation on natively Italian
multimodal datasets. Therefore, if a large-scale multimodal
dataset natively in Italian were to be released, we can
expect further improvements in performance since fewer
machine translation errors would be present.</p>
      <p>Furthermore, we release code and resources related to
this study2.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>For LVLMs, several methodologies have been designed to
adapt LLMs. One of the most prominent approaches is the makes 23% of the data). The two available checkpoints
one introduced in LLaVA [3], where visual embeddings for Velvet have 2B and 14B parameters.
extracted from a Vision Transformer [4] are projected FastwebMIIA3 (Italian Artificial Intelligence Model) is
into the latent space of an LLM. This strategy has been a 7-billion-parameter autoregressive model developed by
further refined in LLaVA 1.5 [5], where the projection Fastweb. Based on a decoder-only architecture with
romatrix is replaced with a Multi-Layer Perceptron, and tary positional embeddings, it has been trained on about
LLaVA-OneVision [6], a LLaVA-based model enhanced 3 trillion tokens, with a strong focus on Italian. It uses a
to also perform multi-image and video tasks. Other ap- custom tokenizer optimised for Italian, English and
proproaches include the one used in BLIP-2 [7], leveraging gramming languages, with a vocabulary of 50,000 tokens.
a QFormer module to extract the most relevant features It supports a context window of 16k tokens and has been
of images, and Flamingo [8], where cross-attention lay- trained in a distributed pipeline on NVIDIA H100 GPUs
ers are added to the LLM and relevant visual tokens are via MLDE and LLMFoundry.
extracted using a Perceiver Resampler module. Addi- Furthermore, at the time of writing, LLaVA-NDiNO
tionally, there is also LLaVA-NeXT [5] (also known as [10] is the only family of multimodal models extensively
LLaVA 1.5 HD), which introduces a technique to process trained for the Italian language only, further showcasing
high-resolution images. The idea is to resize the image the need for a more in-depth investigation of the current
to a higher resolution than the one supported by the un- landscape of Italian LLMs and their extension to LVLMs.
derlying vision encoder and split it into multiple images. For LLM evaluation in Italian, many eforts have been
Embeddings are then extracted for each image, as well as carried out to extensively evaluate Italian LLMs. For
exa resized version of the image to the supported resolution ample, Bacciu et al. [11] introduced an open LLM
leaderof the vision encoder to incorporate global details, and board for the Italian language, Moroni et al. [12] released
lfattened into a single vector. ITA-Bench, a comprehensive evaluation suite for
Ital</p>
      <p>For Italian LLMs, several models have been released ian LLMs consisting of both machine-translated and
nawhich incorporate a great quantity of natively Italian tively Italian benchmarks, Attanasio et al. [13] released
training data. Minerva [2] is the first family of models CALAMITA, a dynamic and growing benchmark for the
trained from scratch on an open data mixture consisting Italian language.
of only English and Italian data. It has several check- Finally, we also highlight that there are novel works
points with diferent parameter counts, that are 1B, 3B that showcase how non-trivial it is to evaluate LLMs. For
and 7B. The 7B model was trained on a total of 2.48 tril- example, Wang et al. [14] found mismatches between
lion tokens of Italian, English and code. EuroLLM [9] is the generated output and output obtained using
loga family of LLMs developed in Europe to support all the likelihood for next token prediction. Additionally, several
24 oficial European Union languages. Its two available works started to use a LLM-as-a-judge approach where
checkpoints have 1.7B and 9B parameters. The models the LLM is used as a model for evaluation [15].
are pre-trained on a total of 4 trillion tokens, where 50%
of the data is in English, 5% is code, and the remaining
45% are other languages (including Italian). Velvet is 3. Methodology
a family of LLMs trained on a balanced mixture of six
languages, with particular emphasis on Italian (which</p>
      <sec id="sec-2-1">
        <title>As mentioned in the introduction, our aim is to extend</title>
        <p>existing Italian LLMs with multimodal capabilities. We</p>
      </sec>
      <sec id="sec-2-2">
        <title>2https://github.com/swapUniba/Extending-LLMs-VL-ITA</title>
      </sec>
      <sec id="sec-2-3">
        <title>3https://huggingface.co/Fastweb/FastwebMIIA-7B</title>
      </sec>
      <sec id="sec-2-4">
        <title>Question Answering, Visual Grounding, ...), which al</title>
        <p>Listing 1: Mistral chat template used for base models. lows the model to learn to correctly solve this type of
{user} and {assistant} are placeholders for the task, while the latter is a multi-turn dataset generated by
user and assistant messages respectively. prompting GPT-4. Thanks to this training mixture, the
&lt;s&gt;[INST] {user} [/INST] {assistant}&lt;/s&gt; model learns to both solve tasks and provide meaningful
and complete responses to user prompts. For
MultiInstruct, we perform some additional processing operations.</p>
        <p>
          Instructions are manually translated, therefore only the
chose Minerva, EuroLLM, Velvet and FastwebMIIA, data instances (e.g. questions and answers in a visual
since they are among the most recently released LLMs question-answer task) are machine-translated. For tasks
supporting the Italian language and clearly define the that use bounding boxes, we normalize the bounding box
amount of Italian data used in training. For each model, values to the [
          <xref ref-type="bibr" rid="ref4">0, 1</xref>
          ] range so that the values are
consiswe evaluate both its base and instruct variants at their tent with the reference images and independent of their
largest available parameter scale. The only exception resolution. For tasks that provide options to choose from
is represented by Velvet, for which only the instruct within the instruction, we format them as an ordered list
version is available. using either numbers, uppercase or lowercase letters, or
        </p>
        <p>For the vision backbone, we use the vision transformer plain text. In such cases, we also replace the target text to
of the CLIP [16] model, specifically, we focus on the large be predicted with the corresponding identifier (e.g. if the
checkpoint with patch size 14 and image size 336.4 We option is a number, the target text is also converted to a
use this model since it is often used in the state-of-the-art number). Finally, we append a string to guide model
reresearch as the visual backbone for LVLMs [3]. sponses, depending on the type of output that is expected:</p>
        <p>To train the models, we use the methodology of LLaVA "Rispondi solamente con il numero dell’opzione corretta
NeXT, because of both its performance and its open code- dalle scelte date." ("Answer with the option’s number
base, which allows for easier reproducibility of this study. from the given choices directly." in English) when the
opThis methodology is made of two steps: pre-training to tions are identified by numbers, "Rispondi solamente con
warm up the multi-layer perceptron projector and visual la lettera dell’opzione corretta dalle scelte date." ("Answer
instruction tuning to teach the model how to solve vision- with the option’s letter from the given choices directly."
language tasks. For both steps, training is performed in English) when the options are identified by letters,
using the next token prediction objective, implemented "Rispondi usando una zona di delimitazione." ("Answer
as cross-entropy loss. We report hyperparameters used using a bounding box." in English) when the target text is
for both steps in Table 1. For base models, we apply the a bounding box and, finally, "Rispondi usando una singola
Mistral chat template reported in Listing 1, since they do parola o frase." ("Answer the question using a single word
not have a chat template associated with them, while for or phrase." in English) for all other cases. In total, the
instruct models we apply their own chat template. training mixture combining these two datasets consists
of 172,335 instances.</p>
        <sec id="sec-2-4-1">
          <title>3.1. Training Mixture</title>
        </sec>
        <sec id="sec-2-4-2">
          <title>3.2. Hardware and Software</title>
        </sec>
        <sec id="sec-2-4-3">
          <title>Configuration</title>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>For both training steps, we use a state-of-the-art machine</title>
        <p>translation model to translate popular vision-language
English-only datasets to Italian. This is necessary since, Our experimental setup was provided by Fastweb SpA via
at the time of writing, there is no large-scale vision- a high-performance computing cluster 5 composed of 31
language dataset for instruction tuning in Italian. There- NVIDIA DGX H100 systems, organized according to the
fore, we use MADLAD 400 3B [17], since it is one of the NVIDIA DGX SuperPOD reference architecture. The
cluslatest and best-performing machine translation models. ter is deployed within a datacenter located in Lombardia,
For pre-training, we use the same dataset as LLaVA trans- Italy, and ofers a total of 248 NVIDIA H100 Tensor Core
lated to the Italian language. During pre-training, the GPUs interconnected through high-bandwidth NVLink
whole model is kept frozen, except for the multi-layer and InfiniBand, enabling low-latency communication and
perceptron. Thanks to this approach, the multi-layer eficient scaling across nodes.
perceptron weights are initialized so that the vision em- The training and evaluation of the models was
conbeddings are correctly projected into the LLM’s space. ducted in a distributed manner through the Machine
For visual instruction tuning, we consider a combination Learning Distributed Engine MLDE6 platform, which
enof two datasets: MultiInstruct [18] and the conversational abled eficient parallelisation of workloads on DGX H100
split of the LLaVA-Instruct [3] dataset. The former is
a collection of diverse vision-language tasks (e.g. Visual</p>
      </sec>
      <sec id="sec-2-6">
        <title>4https://huggingface.co/openai/clip-vit-large-patch14-336</title>
      </sec>
      <sec id="sec-2-7">
        <title>5Fastweb Announcement</title>
        <p>6https://www.hpe.com/us/en/software/marketplace/
hpe-ml-development-environment.html
nodes. The software stack was based on open-source li- 4. Experiments
braries, including Transformers from Hugging Face [19],
which provides seamless integration with PyTorch [20] 4.1. Experimental Setting
ammneodndDtealelse.inpSepficeieendtl[y21h]a.nTdhliinsgsolaftrwgeardeastatascektshaansdbecoenmipnlsetxru- Telos,ewvaeluuastee
tthhreeevidsaiotans-elatsn:guGaQgAe-aIbTil[it2y2,o2f3t]h,eMseTmVQodAtaddhnuaetTdcaichb.fooiiIsmltrithaptyrala,sarrsoidacnwtariielnvaafleegrbceiatll-asinstroayaglfyebtaw-snrsiodascaraoedelficfeeicmremonnunocalfigytdtui,ieporwllasnehtaoimiolcnnhoeIfodteareanrtllesiataucornrrcweu-hldacaitirnraedeglcsptufuoaraorrge-es
i[ssum2wtaa4laenl]gyr,cieenEtsdrgX.aaAdntMaasMstlTeaaSVtst.-eeQVTdtAho[sen2pi5sdlni]ataa.ttatmuGosreQaaItntlApauslr-ciaoIaelTvlnnyi,edissaec.nsoaWnnsvpsoeiiltssicatutsotiaennfldogsqritdouesefeexrvst3-tie,tic0rosea0nnml0tlaaarinninnc---sovereign AI infrastructure, ensuring data localisation, guages, in this work we focus on the Italian split, which
traFnosrpatrraeinnciyngantdheremguoldaetlosr,ywceomuspeli2anGcPe.Us. The whole cMoTnVsiQstAs -oIfT8.8E4XqAuMesSt-iVonis-aancsowlleerctpiaoinrso.fWmeurlteifpelre-tcohioticaes
training procedure takes about 24 hours for each model. school exam questions in multiple languages. In this case,
we focus on the Italian split as well, which consists of
1,645 question-answer pairs. We refer to it as
EXAMS-VIT</p>
        <p>To take into account the efect of using diferent
prompts for the same model, we test all models and all
datasets using four diferent styles of formatting.
Specifically, to evaluate these models, an additional string is
added to the prompt to limit the generated output. In
English, this string that is used depends on the model and
the formatting of its training mixture, however the
original LLaVA, and most other models following its setup,
used "Answer the question using a single word or phrase."
for open-ended tasks and "Answer with the option’s letter
from the given choices directly." for closed-ended ones.</p>
        <p>Thanks to this, it is possible to use exact match as metric, A model that performs well for all four formattings can
where the generated output is compared directly to the be considered to be a consistent model, capable of
answerground truth (i.e. hard syntactic match), since the model ing user queries despite the syntax used in the request.
is instructed to generate only the text that is relevant Finally, all results are obtained using greedy decoding
w.r.t. the label. Due to this, we want to understand if as sampling strategy at inference time, which removes
and how the model performance is afected by this string. randomness in generation and guarantees improved
reIf we change this string to one with a similar meaning, producibility of the obtained results. For all tasks, we use
does the model generate outputs consistently? Does the the question and answer pairs provided by the task itself.
position of the string matter? To answer these questions, The only exception is EXAMS-V-IT where, since the
queswe apply four diferent formattings to the datasets: tion and choices are embedded within the image itself,
we use the following string as question: "Fornisci una
risposta alla domanda presente nell’immagine." ("Provide
an answer to the question in the image" in English). All
• Post: "\nRispondi utilizzando una sola parola o
frase." (or "\nRispondi utilizzando direttamente la
lettera dell’opzione corretta tra quelle date." for
closed-ended tasks) appended to the end of the
instruction
• Pre-Swap: "Rispondi utilizzando una sola parola
o frase.\n" (or "Rispondi utilizzando direttamente
la lettera dell’opzione corretta tra quelle date.\n"
for closed-ended tasks) appended to the
beginning of the instruction
• Post-Swap: "\sRispondi in modo breve e diretto."
(or "\sRispondi con la lettera." for closed-ended
tasks) appended to the end of the instruction
• Pre: "Rispondi in modo breve e diretto.\s" (or
"Rispondi con la lettera.\s" for closed-ended tasks)
appended to the beginning of the instruction
models are evaluated using the lmms-eval7 framework, Additionally, we showcase that the models are very
loaded in float16 as dtype and inference is performed sensitive to the formatting of the prompt. For example,
with a batch size of 1, ensuring reproducibility of the while the base version of EuroLLM achieves the best
avresults. Finally, we lowercase text and ground truth and erage performance on GQA-IT, it performs well on only
ignore whitespaces when evaluating using exact match. two out of the four formattings. This pattern can also
be seen in other models in our evaluation, in most cases,
4.2. Results Discussion the models tend to perform better in a limited subset
of formattings. After manually analyzing the generated
We report the results of the experiments in Table 2. For outputs, we find that there are cases where the models
the sake of comparison against already existing mod- generated the correct answer, but with additional
contexels, we also report the results of LLaVA-NeXT 8B [26], a tual text. For example, for the question "È nuvoloso?" ("Is
LLaVA-NeXT model trained from the LLaMA 3 Instruct it cloudy" in English) with label "Sì" ("Yes" in English),
8B checkpoint, on these benchmarks. Overall, models Minerva instruct answered "Sì" in the Post formatting,
trained on Italian perform well w.r.t. LLaVA-NeXT 8B. while it answered with "Sì, è nuvoloso nell’immagine."
Remarkably, the base version of EuroLLM has the best ("Yes, it is cloudy in the image" in English) in the
Postaverage performance in GQA, while the base version Swap formatting. In both cases, the answer is correct,
of FastwebMIIA has the best average performance in but the exact match metric fails to consider the second
EXAMS-V-IT. In MTVQA-IT, Italian models tend to per- case as correct, since there is no hard syntactic match
beform poorly w.r.t. LLaVA-NeXT 8B. We believe this is tween the generated output and the label. In light of this,
due to the low quantity of text-centric vision-language in- we propose further evaluation to study the relationship
stances in the training mixture, since MultiInstruct tasks between performance and the length of the generated
focus more on natural scenes and everyday images. We response.
can reasonably expect an improvement in performance
for text-centric tasks when integrating this type of tasks 4.3. Evaluating for Response Length
in the training mixture.</p>
      </sec>
      <sec id="sec-2-8">
        <title>7https://github.com/EvolvingLMMs-Lab/lmms-eval</title>
      </sec>
      <sec id="sec-2-9">
        <title>To further understand if the models provide outputs that are relevant, we evaluate them by performing an approx</title>
        <p>imate match between the label and the generated output. task using this strategy. Results for evaluation performed
That is, we check that the label is a substring of the gen- using this approach are reported in Table 3. As expected,
erated output. This allows us to cover cases where the we can appreciate a great improvement in performance
model keeps generating contextual text together with for most models. For example, for the base version of
the task answer. For example, for the question "C’è una EuroLLM-9B, performance rises from .0973 to .4807, and
palla da calcio nell’immagine?" ("Is there a football ball a similar trend can be seen in the instruct version of the
in the image?" in English) with label "Sì" ("Yes" in En- model. For most models, we can observe an increase in
glish), the model may generate "Sì, c’è una palla da calcio performance in approximate match, except for Velvet,
nell’immagine.". This case is considered incorrect in the where the performance remains the same. To further
valiexact match metric, since the generated output is not the date this finding, we also evaluate under the same setting
same as the ground truth label. However, the answer is the formatting where the models performed best, Results
correct, and the ground truth label is in the generated for approximate match evaluation of the best formatting
string itself. Our approach allows to cover these corner are reported in Table 4. Overall, the results are a lot
cases, however, note that this strategy sufers from false more stable, and the degree of improvement is less with
positives. For example, for the question "C’è una mano respect to worse formatting using approximate match.
nell’immagine?" ("Is there a hand in the image?" In En- This highlights that the models in their best formatting
glish) with label "No", the model may generate "Sì, c’è performed well because they were able to generate the
exuna mano nell’immagine" ("Yes, there is a hand in the pected output directly and consistently, without adding
image" in English), and it would be considered a correct additional contextual text to the answer. However, we
answer since "no" is a substring of "mano". We showcase emphasize that the worst formatting evaluated with
apsome examples in Figure 1 To assess the performance proximate match actually showcases better performance
of the models regardless of the response length, we con- w.r.t. best formatting evaluated with approximate match.
sider the formatting where each model has performed For example, the base version of EuroLLM achieves an
the worst. We retrieve the generated outputs and corre- approximate match of .4807 on GQA-IT for its worst
sponding ground truth labels and evaluate them using formatting, while it achieves an approximate match of
an approximate match. We expect an improvement in .4497 for its best one. This pattern can be seen for all
performance w.r.t. exact match. Note that we do not models, including LLaVA NeXT, the only exception
beperform this evaluation for EXAMS-V, since the task is ing Velvet, where performance is consistent for both
closed-ended, the answers are the identifiers of the op- formattings. This finding highlights that LLMs tend to
tions (e.g. "A", "B"), making it impossible to evaluate the provide better answers when they are able to provide a</p>
        <p>Minerva-7B
Minerva-7B
EuroLLM-9B</p>
        <p>EuroLLM-9B
FastwebMIIA-7B
FastwebMIIA-7B</p>
        <p>Minerva-7B
Minerva-7B
EuroLLM-9B
EuroLLM-9B
Velvet-14B</p>
        <p>Velvet-14B
FastwebMIIA-7B
FastwebMIIA-7B</p>
        <p>X
✓
X
✓
X
✓
X
✓
X
✓
X
✓
X
✓</p>
        <sec id="sec-2-9-1">
          <title>4.4. Evaluating for Text-only Tasks</title>
        </sec>
      </sec>
      <sec id="sec-2-10">
        <title>Finally, we also test the ability of the LVLMs in solving</title>
        <p>Italian text-only tasks, rather than vision-language ones.
This aims to determine whether the models retain the
knowledge they learned during their original text-only
training procedure. Since the models didn’t see text-only
data during vision-language training, we expect their
performance to be lower with respect to their original LLM
version. Since we only want to have a general estimate of
their performance, we consider a relatively small subset
of Italian tasks available through the lm-eval-harness8
framework. Namely, we consider Global-MMLU [27],
specifically its LITE subset. The dataset is a balanced
collection of culturally sensitive and culturally agnostic
MMLU tasks (a massive multitask test dataset consisting
of multiple-choice questions from various branches of
knowledge), where only languages with human
translation and post-edits are included. Results are reported in
Table 5. Surprisingly, there are models which perform
better after the visual instruction-tuning step. For
example, the base version of Minerva-7B performs better
on four out of the six categories of the dataset. Similar
behaviour is also showcased by other models, for
example, the instruct version of EuroLLM-9B also performs
better on four out of the six categories, while the base
version of FastwebMIIA performs better on five of them.
This showcases that a vision-language training procedure
8https://github.com/EleutherAI/lm-evaluation-harness
may also enhance the language-only performance of the
model. However, there is an outlier to this pattern, that
is Velvet-14B, where the original version of the model
performs better on all categories. Furthermore, for the
other models, there is no consistent improvement across
all categories. This highlights that, while
multimodality has helped improve the inherent knowledge of these
models, it is not guaranteed, and text-only evaluation is
still relevant for multimodal models.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions</title>
      <sec id="sec-3-1">
        <title>In this work, we have expanded the current landscape of</title>
        <p>LVLMs for the Italian language. We have collected a pool
of LLMs supporting the Italian language, which only
process textual inputs. Then, we have extended them
to LVLMs, by employing a state-of-the-art approach,
namely LLaVA-NeXT, and a machine-translated corpus
of vision-language tasks in Italian. Additionally, we
evaluated them using only benchmarks that are natively in
Italian and also studied the efect on the length of the
generated response in evaluation. Finally, we also
benchmarked these models on an Italian text-only benchmark
to understand if the performance for text-only tasks was
worse after the visual instruction-tuning step. As future
work, we plan to further extend the training mixture
so that it also considers text-centric tasks in Italian,
improving model performance on this type of task that is
currently missing in the training mixture. Specifically,
we plan to incorporate multimodal document data to
enhance these models in document visual question
an</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.</title>
        <p>swering. We also plan to further extend the evaluation
and to improve the approximate match strategy, which
soundness currently sufers from the possibility of false
positives.
Declaration on Generative AI
During the preparation of this work, the author(s) used Grammarly in order to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>large audited dataset</article-title>
          ,
          <source>Advances in Neural Informa- tion answering, arXiv preprint arXiv:2405.11985</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>tion Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>67284</fpage>
          -
          <lpage>67296</lpage>
          . (
          <year>2024</year>
          ). [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Multiinstruct: Improving [25]
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D. I.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
          </string-name>
          , I. Koy-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>ing, in: Proceedings of the 61st Annual Meeting tilingual multimodal exam benchmark for eval-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>11445</fpage>
          -
          <lpage>11465</lpage>
          . arXiv:
          <volume>2403</volume>
          .10378 (
          <year>2024</year>
          ). [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , C. De- [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>towicz</surname>
            , J. Davison,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma, knowledge,
          <year>2024</year>
          . URL: https://llava-vl.github.io/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          , S. Gugger, blog/2024-01-30
          <string-name>
            <surname>-</surname>
          </string-name>
          llava-next/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers: [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fourrier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Adelani</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>Proceedings of the 2020 Conference on Empirical sio</source>
          ,
          <string-name>
            <given-names>W. Q.</given-names>
            <surname>Leong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Susanto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ng</surname>
          </string-name>
          , S. Longpre,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https:// B. Ermis,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hooker</surname>
          </string-name>
          , Global mmlu: Understanding
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          www.aclweb.org/anthology/2020.
          <article-title>emnlp-demos.6. and addressing cultural and linguistic</article-title>
          biases in mul[20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ansel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Jain, tilingual evaluation,
          <year>2024</year>
          . URL: https://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Voznesensky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          , D. Berard, abs/2412.03304. arXiv:
          <volume>2412</volume>
          .
          <fpage>03304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>P. Wu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chintala</surname>
          </string-name>
          , Pytorch 2: Faster machine
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>formation and graph compilation</article-title>
          ,
          <source>in: 29th ACM</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>tems</surname>
          </string-name>
          , Volume
          <volume>2</volume>
          (
          <issue>ASPLOS</issue>
          '24), ACM,
          <year>2024</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          https://pytorch.org/assets/pytorch2-
          <fpage>2</fpage>
          .pdf. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <volume>1145</volume>
          /3620665.3640366. [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>ifcient data sampling and routing, 2024</article-title>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          //arxiv.org/abs/2212.03597. arXiv:
          <volume>2212</volume>
          .
          <fpage>03597</fpage>
          . [22]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Gqa: A new
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>pattern recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6700</fpage>
          -
          <lpage>6709</lpage>
          . [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lenci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Basili</surname>
          </string-name>
          , et al.,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          2021 Proceedings of the Eighth Italian
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          3033,
          <year>2021</year>
          . [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>