<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Eighth Workshop on Natural Language for Artificial Intelligence, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elio Musacchio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <addr-line>Via E. Orabona, 4 - 70125 Bari</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Italian National PhD Program in Artificial Intelligence, University of Bari Aldo Moro</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>6</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Since their initial inception, large language models have undergone many innovations. One of these innovations concerns multimodality. Several adaptation strategies have been developed to expand LLMs to process multimodal signals. However, the training procedure for these multimodal models is performed on English-only visionlanguage datasets in the current literature, limiting their capabilities for other languages. This work proposes the first family of LMMs for the Italian language. We trained them using state-of-the-art backbone models and datasets, translated into Italian using the most up-to-date machine translation model available. In support of open science, we publicly release the data, models, and code used to develop these models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;NLP</kwd>
        <kwd>Multimodality</kwd>
        <kwd>LLM</kwd>
        <kwd>LMM</kwd>
        <kwd>LVLM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Large Language Models (LLMs) have been rising in research interest due to their generalization
capabilities, which allow them to solve tasks never seen during training. However, their capabilities are limited
to the textual domain. In light of this, researchers have started proposing solutions to bridge the gap
between the textual world and the others (e.g. visual or aural). Specifically, instead of pre-training a new
model with multimodal capabilities from scratch, these solutions leverage a pre-trained decoder-only
LLM. This is both cost-eficient, avoiding the expensive training procedures of full multimodal training,
and efective, as many of these solutions reported optimal results.</p>
      <p>
        In this work, we will be focusing on the vision-language world, specifically Large Vision Language
Models (LVLMs). These models are often trained following a traditional two-step approach: pre-training
followed by fine-tuning . However, one notable issue is that the vision-language training mixture often
consists of curated and selected datasets that predominantly feature English text, as seen in models like
LLaVA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This further propagates an inherent problem of these large models, where the pre-training
corpus mainly consists of English data. For example, LLaMA 2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a LLM by META, was pre-trained
on a corpus of 89.70% English language and of 8.38% unknown language (e.g. programming code). As
a result, even the developers of the models explicitly state that their usage is intended for English use
cases only.
      </p>
      <p>Furthermore, there is a significant gap due to the absence of large-scale, multitask and multilingual
datasets. While the English vision-language datasets are conceptually diverse and rich (e.g., scientific
question answering, OCR), non-English datasets tend to be limited in scope, focusing on specific
high-level tasks (e.g., image captioning, visual question answering).</p>
      <p>For these reasons, there are currently very few LVLMs in the state-of-the-art for non-English
languages. While some models support multilingual and multimodal data, they often fall behind their</p>
      <p>English counterparts in terms of architecture performance and training data quality. The reasons behind
this are twofold: new LLMs are constantly being released, and training data lacks quality, focusing
only on high-level tasks due to the lack of data. Furthermore, current multilingual and multimodal
benchmarks are not as conceptually rich as English ones, making evaluation of these models more
dificult for non-English languages.</p>
      <p>Therefore, in this work, we propose an approach to train and evaluate a LVLM for the Italian language.
We also release LLaVA-NDiNO (Large Language and Vision Assistant: New Domain integration for
Natural Observations), the first family of openly-available Italian LVLMs trained and evaluated by following
the proposed approach. While this approach heavily relies on the use of machine translation, we show
that even when using machine-translated datasets at train time it is possible to achieve remarkable
performance during evaluation on datasets that are natively in the Italian language. Specifically, the
contributions of this work are the following:
• We apply a vision-language adaptation step designed to improve the performance of the model
for a specific language. We compare the performance of a model trained using this additional
step w.r.t. one without this step;
• We propose a new evaluation suite based on both machine-translated and natively Italian data
from state-of-the-art benchmarks;
• We openly release code, data and models that have been obtained from our experiments, in the
hope of boosting research in this field and in support of open science. 1</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        LVLMs have begun to see widespread success following the release of GPT-4V [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the OpenAI model
which supported vision-language inputs. However, since the model is proprietary, possibilities for
research are relatively limited. Because of this, many works proposed open-source solutions, trying to
match the performance obtained by GPT-4V on state-of-the-art benchmarks. One of the most popular
solutions in this field of research is LLaVA [
        <xref ref-type="bibr" rid="ref2 ref5">5, 2</xref>
        ]. The model uses a projection module (either a projection
matrix in its first version or a Multi-Layer Perceptron in version 1.5) to project the visual embeddings
extracted from a visual encoder into the latent space of the LLM. This approach is simple and eficient,
since it only relies on a single projection module. However, the original LLaVA architecture, as well as
other LVLMs, struggled with high-resolution images tasks due to the requirements imposed by vision
encoders. This is because vision encoders, like the Vision Transformer (ViT) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], are trained on a fixed
image size. Therefore, during inference or embedding extraction, the same image size is expected as
input. To overcome this limitation, LLaVA-NeXT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] was developed. In this model, the image is split
into grids of fixed size and the embeddings for each grid are extracted and concatenated. Finally, the
original image is resized and its embeddings are extracted and concatenated to the previous output.
This technique allows the model to better understand the overall visual characteristics of the input
images.
      </p>
      <p>
        However, all of the LLaVA models were trained on English-only vision-language data. Specifically, an
instruction-tuning approach over a rich set of vision-language tasks was performed. Therefore, while
the LLaVA models perform well on English tasks, the lack of curated multilingual vision-language
instruction-tuning datasets makes it challenging to train multilingual LVLMs on a set of conceptually
diverse tasks. In light of this, some works focus on multilingual training procedures for LVLMs. Geigle
et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] released mBlip, a version of the BLIP 2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] model trained on an English vision-text dataset
machine-translated to 95 diferent languages To do so, the authors used a neural machine translation
model, that is nllb-200-distilled-1.3B [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. There is also Pali-X [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where the vision and language
components are jointly scaled, following the work done in Pali [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The model is pre-trained on a rich
range of datasets, among which there is WebLI [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a rich corpus consisting of images with alt-texts
from the web and OCR annotations obtained from the Google Cloud Vision API, covering a total of
1https://github.com/swapUniba/LLaVA-NDiNO
100 languages. Finally, there is X-LLaVA [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], where the authors adapted LLaVA 1.5 by expanding
its dictionary for English and Korean and performing a language adaptation step based on the one
performed by Conneau and Lample [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], that is pre-training on a data corpus extracted from Wikipedia.
      </p>
      <p>
        Regarding datasets used to train these models, for LLaVA 1.5 a mixture of English only
visionlanguage datasets was used. Specifically, the mixture contained 158, 000 GPT-generated multimodal
instruction-following data instances, 450, 000 academic-task-oriented visual question answering data
instances and 40, 000 ShareGPT data instances. Laurençon et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] released The Cauldron, a
collection of 50 diferent datasets pre-formatted for instruction-tuning. This dataset was used to train
Idefics 2 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] model. The dataset consists of state-of-the-art vision-language datasets and covers
a wide array of conceptual tasks. Specifically, the authors identify the following categories: general
visual question answering, captioning, OCR, document understanding, text transcription, chart/figure
understanding, table understanding, reasoning, logic, maths, textbook/academic questions, diferences
between two images, screenshot to code.
      </p>
      <p>Despite all this, best practises regarding language adaptation of LVLMs are still unclear.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We define three diferent steps in our methodology:
• Italian vision-language pre-training: training the model to optimize its general understanding
of the Italian language;
• Italian vision-language instruction-tuning: fine-tuning the model on task specific
visionlanguage data to improve its performance in following instructions;
• Italian vision-language long instruction-tuning: fine-tuning the model to produce long
outputs in response to instructions.</p>
      <p>
        We adapt a pre-trained decoder LLM and a pre-trained encoder vision transformer to the Italian
language by performing an Italian vision-language pre-training approach. This is based on an approach
used for LLMs, which consists in further training the model on a wide corpus of generic data of a
specific language [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In this step, we perform the same approach but using vision-text data instead.
Specifically, we directly use an English pre-trained decoder LLM and an English pre-trained vision
encoder and perform joint language adaptation on both of them, as well as the adaptation module, on a
collection of image-text pairs natively in Italian. We expect the model pre-trained on Italian data to
perform better in Italian vision-language tasks, thanks to the additional knowledge it has gained.
      </p>
      <p>Furthermore, while the instruction-tuning datasets are often unavailable in multiple languages,
vision-language pre-train data is. Thanks to this, the data quality during pre-train is guaranteed since
the text would be natively in Italian. However, the situation is diferent for instruction-tuning. Due to
the lack of instruction-tuning Italian datasets, we must rely on machine translation. While the data
quality will sufer from this, this approach is the only one that allows us to obtain the large quantity
of data needed to achieve the generalization capabilities of LVLMs. Finally, we also perform further
instruction-tuning for long response generation. This is because humans tend to prefer long and
descriptive answers when interacting with LLMs and LVLMs. We decided to use the LLaVA-NeXT
architecture since it is one of the most recent LVLMs available in the state-of-the-art. We detail all the
steps we carried out, from data collection to evaluation.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Creation</title>
        <p>
          For the Italian language pre-training dataset, following the best practises by Laurençon et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], we
setup three conceptually diferent datasets: Interleaved image-text documents, Image-text pairs
and PDF documents. For interleaved image-text documents and image-text pairs, we use the WIT
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] dataset, a collection of images and their associated text sections obtained from Wikipedia pages in
multiple languages. Specifically, after collecting the Italian portion of the dataset, we use the text of a
section where an image appears as interleaved image-text document and the caption of the image as
image-text pair. Note that for interleaved image-text documents we only use a single pair of image-text
section, rather than multiple sections from the same Wikipedia page. For PDF documents, there are
no multilingual datasets fitting this criteria in the literature. In particular, there are no handwritten
datasets of this type, but only typewritten. Therefore, we decided to use MultiEURLEX [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], a corpus
containing European laws in 23 languages. While this corpus is typewritten only, we prefer to include it
in the pre-train dataset rather than not covering OCR at all. We retrieve the Italian PDF files associated
with the corresponding CELEX_ID and extract the text from each document using Tesseract [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. We
also filter the dataset to control the distribution of these diferent sets. The pre-train dataset consists
of 250, 000 instances, of which 168, 000 are interleaved image-text documents, 72, 000 are image-text
pairs, and 10, 000 are PDF documents.
        </p>
        <p>
          For the Italian language instruction-tuning dataset, we use The Cauldron [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], a collection of 50
vision-language datasets already formatted for instruction-tuning. Since the dataset is in English, we
use machine translation to Italian. Details regarding the machine translation procedure will be discussed
in Section 3.2. However, we first perform a filtering step of the 50 available tasks. This is because many
tasks would lose their meaning when translated from English to another language (e.g. extraction of
information from the image of a table where the text is in English). Because of this, we remove all tasks
which focus on images containing English text (e.g. docvqa or ocrvqa). After performing this manual
ifltering step, we have a total of 15 tasks. For each task, we select the first 10, 000 rows of the dataset
and perform machine translation on each instance in each row (more than one text-vision pair can be
present for each row). Additionally, we also add the train sets of MTVQA and V-EXAMS, datasets that
are natively in Italian. This increases both the quality of the instruction-tuning dataset, as the datasets
are not machine translated, and its concept distribution, since two new tasks are added. MTVQA is
the only dataset containing Italian visual text extraction and V-EXAMS is the only dataset containing
Italian academic visual question answering. In total, the instruction-tuning dataset consists of 260, 302
instances.
        </p>
        <p>
          For the Italian language long instruction-tuning dataset, we use LLaVA Conversation 58k [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
a subset of the LLaVA Instruct 150K dataset. It consists of 58k conversations, a dataset generated
using GPT-4V for conversational purposes. Again, since the dataset is in English, we perform machine
translation.
        </p>
        <p>Finally, for evaluation, we collect the OK-VQA, SeedBench and POPE datasets, that are popular
benchmarks used in the literature for English LVLMs. We machine translate them to the Italian language
as well. We also collect the test sets of MTVQA, V-EXAMS and GQA-it.</p>
        <p>We provide an overview of the 15 datasets from The Cauldron used for the instruction-tuning step
in Table 1. We also provide the same details for the natively Italian datasets in Table 2 and evaluation
datasets in Table 4.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Translation</title>
        <p>To translate the data, we use one of the newest machine translation models openly available, that is
MADLAD-400 3B2 [36]. To accomplish this task, we use a cluster equipped with multiple NVIDIA A16
16GB VRAM GPUs. We use 4 GPUs in parallel and perform inference with a batch size per device of 4.</p>
        <p>
          To translate the data from The Cauldron, we directly use the formatted instruction pairs present
in the dataset. By doing so, the answer is translated with the context given by the question, reducing
the possibility of a translation error. We do the same for closed-ended tasks, where a list of options
is given in the question. However, this translation procedure may cause the model to translate text
inaccurately. Therefore, some options for closed-ended tasks may not be translated correctly. For
example, during translation, some closed-ended options might not align correctly with the original
content, causing errors like having more options than in the original text. To avoid this issue, we check
via regex matching that: 1) the question or instruction is present at the beginning; 2) the number of
A-OKVQA [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
        </p>
        <p>
          CLEVR [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]
COCO-QA [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
        </p>
        <sec id="sec-3-2-1">
          <title>Geomverse [22]</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>IconQA [23]</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>InterGPS [24]</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Localized Narratives [25]</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Mimic CGD [26] NLVR2 [27] Raven [28]</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Spot the Difference [29]</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>TallyQA [30]</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Visual7w [31]</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>VQArad [32] VQAv2 [33]</title>
          <p># Train Translated
10,107
92,670
16,167
3,324
10,980
1,498
9,178
16,807
18,363
9,216
9,187
14,024
43,228
739
1,563</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Description</title>
        <sec id="sec-3-3-1">
          <title>VQA dataset requiring world knowledge and common sense for a correct answer.</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>VQA dataset designed for visual rea</title>
          <p>soning regarding objects in images.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>VQA dataset containing descriptive</title>
          <p>and rich question-answer pairs.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>VQA dataset regarding geometric reasoning.</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>VQA dataset regarding abstract diagram understanding.</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>VQA dataset regarding geometric reasoning, annotated in a formal language.</title>
        </sec>
        <sec id="sec-3-3-7">
          <title>VQA dataset designed to provide rich</title>
          <p>descriptions of image contents.</p>
        </sec>
        <sec id="sec-3-3-8">
          <title>VQA dataset designed to enhance the performance of vision language models in real-life scenarios.</title>
        </sec>
        <sec id="sec-3-3-9">
          <title>VQA dataset regarding truthfulness of a natural language sentence about a pair of photographs.</title>
        </sec>
        <sec id="sec-3-3-10">
          <title>VQA dataset regarding Raven’s Progressive Matrices.</title>
        </sec>
        <sec id="sec-3-3-11">
          <title>VQA dataset regarding diferences between two images.</title>
        </sec>
        <sec id="sec-3-3-12">
          <title>VQA dataset regarding complex count</title>
          <p>ing questions of objects in images.</p>
        </sec>
        <sec id="sec-3-3-13">
          <title>VQA dataset regarding object-level</title>
          <p>grounding, using questions that start
with one of what, where, when, who,
why, how and which.</p>
        </sec>
        <sec id="sec-3-3-14">
          <title>VQA dataset regarding radiology images.</title>
        </sec>
        <sec id="sec-3-3-15">
          <title>VQA dataset requiring understanding of vision, language and commonsense knowledge to answer.</title>
          <p>options is the same before and after translation; 3) the answer is present at the end of the translated
string. In all cases where a check is not passed, the translated instance is removed from the dataset.
We follow this same procedure to translate evaluation benchmarks. Because of this, some of these
translated datasets may have a diferent cardinality w.r.t. original ones.</p>
          <p>For LLaVA Conversation 58k we directly translate the user question and the system response. By
testing the model, we noticed that translation errors are frequent when a newline character is present in
the input. Therefore, we split inputs when two consecutive newline characters are present and further</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Description</title>
        <sec id="sec-3-4-1">
          <title>VQA dataset of multilingual text scenes. The dataset is manually labelled.</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>VQA dataset of multilingual school exam questions. The dataset is obtained from real exam questions for each language.</title>
          <p>split the output when a single newline character is present. The obtained strings are translated and the
original newline characters are progressively added for each translated instance, efectively recreating
the original formatting of the string but in another language.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Training Details</title>
        <p>
          We distinguish between four total train steps:
• MLP pre-training: the weights of the MLP module are initialized, following the strategy
described by Liu et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ];
• Italian language pre-training: we optimize the model to the Italian language by further training
the English backbones on a mixture of native Italian text-vision data;
• Italian language instruction-tuning: we optimize performance of the model in providing
meaningful responses by performing instruction-tuning;
• Italian language long instruction-tuning: we optimize performance of the model in providing
meaningful and descriptive responses by performing instruction-tuning.
        </p>
        <p>
          For the Multi-Layer Perceptron (MLP) pre-training step, we use the same dataset as Liu et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ],
that is LCS-558K. It is a subset of the LAION/CC/SBU dataset, filtered with a more balanced concept
coverage distribution, and augmented with BLIP synthetic captions. We follow the procedure described
in LLaVA 1.5 for this step.
        </p>
        <p>Then, we perform our training using the translated Cauldron dataset on LLaMA 3 8B base [37] as
LLM and CLIP ViT large-patch14-338 [38] as vision encoder. This is to follow the configuration
used by LLaVA-NeXT, except for the LLM model. We decided to use the base version instead of the
instruct one. Since we have to perform pre-training, we have found the base version of the model to be
more fitting for this purpose.</p>
        <p>We train all models for a direct response in a single round user-system conversational setting.
Specifically, we use two prompt formats: plain for the MLP and Italian pre-training, and the LLaMA 3
instruct format without system prompt for instruction-tuning. These prompt formats are shown in
Listing 1 and 2.</p>
        <p>A diagram presenting an overview of the entire training pipeline is shown in Figure 1.</p>
        <p>For all models, we perform full-parameter training. Regarding additional technical details, we report
hyperparameters used in Table 3. The training was run on a cluster with 4 NVIDIA A100 64 GB GPUs
per node. Specifically, we use 2 nodes for a total of 8 GPUs. We use a server with 8 NVIDIA A16 16 GB
GPUs for evaluation, running the procedure on 4 GPUs.
4.1.1. Instruction-tuning and Evaluation
To assess the performance of the pre-trained model, we perform two diferent training procedures:
• LLaVA-NDiNO IT: only MLP pre-training and instruction-tuning have been performed;
• LLaVA-NDiNO PT + IT: MLP pre-training, Italian language pre-training and instruction-tuning
have been performed.</p>
        <p>Listing 1: Plain Format, {text} is the text associated with the image
Listing 2: LLaMA 3 Format, {user_message} is the message sent by the user, while {system_message}
is the model response.</p>
        <p>Parameter
batch size
lr
vision tower lr
lr schedule
lr warmup ratio
weight decay
epochs
optimizer
max length</p>
        <sec id="sec-4-1-1">
          <title>DeepSpeed stage</title>
          <p>• Machine-translated state-of-the-art benchmarks: we use some of the most popular
benchmarks for evaluation of LVLMs translated to the Italian language;
• Natively Italian benchmarks: we use benchmarks that include Italian text-vision data instances
where the text is originally written in Italian.</p>
          <p>For evaluation, we use lmms-eval3 [44] a fork of lm-eval-harness4, a library for evaluation of LLMs,
but designed for LVLMs. We create custom tasks to evaluate the models on Italian datasets.</p>
          <p>The first set of benchmarks allows us to have somewhat comparable conceptual coverage compared
to the state-of-the-art since the datasets that we consider cover the diverse skills of the models. We
provide an overview of the tasks alongside their cardinality in Table 4.</p>
          <p>Instead, the second set of benchmarks allows us to understand if training on machine-translated data
severely afects performance. This is because these datasets are natively in the Italian language. For
this purpose we use the test sets of the previously presented MTVQA and V-EXAMS datasets, keeping
only the Italian instances of these multilingual datasets.
3https://github.com/EvolvingLMMs-Lab/lmms-eval
4https://github.com/EleutherAI/lm-evaluation-harness</p>
          <p>
            To understand if our trained models excel in the Italian language, we compare our results with the
mBlip T0 [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] model, a multilingual vision-language model which includes Italian as one of the training
languages. For the evaluation metrics, in all cases we use exact match for open-ended tasks and accuracy
for closed-ended ones. The only exception is POPE for which we report the F1 score. All metrics reflect
common best practises used for the original datasets in the English language. We followed the same
evaluation design for MTVQA and V-EXAMS as well.
          </p>
          <p>Analyzing the results, both our models perform better w.r.t. the baseline in all tasks. Remarkably, while
the mBlip model performs very poorly on the MTVQA dataset, both our models show improvements.
GQA-it
[39, 40]
OK-VQA</p>
          <p>[41]</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>SeedBench [42] POPE [43]</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>LLaVABench [5] 12,578 5,050</title>
          <p>18,000
9,000
60
5,046
2,496
9,000
60</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Description</title>
        <sec id="sec-4-2-1">
          <title>Open-ended VQA dataset regarding compositional questions of real-world images, specifically regarding objects, attributes and relations in the images.</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Open-ended VQA dataset regarding questions where the model needs to have external knowledge in order to answer.</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Closed-ended VQA multiple-choice dataset re</title>
          <p>garding temporal and spatial questions.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Open-ended VQA dataset regarding object hallucination (answer is expected to be either ’Yes’ or ’No’).</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>Open-ended VQA dataset to test the abilities of</title>
          <p>the models in solving challenging tasks, thanks
to a highly-detailed and manually-curated
description and a proper selection of questions for
each instance.
However, for both LLAVA-NDiNO models, average results are fairly similar regardless of the pre-training
step. In light of this, we perform statistical testing using McNemar’s test. The test reveals that for
most tasks, the p-value is greater than 0.05; therefore, there are no discernible diferences between the
two setups. We believe this is due to the nature of the evaluation tasks, since the model only needs to
pick the correct option or to generate a simple word or phrase. These tasks are not useful for evaluating
the quality of the pre-train. In light of this, we will perform an additional experiment to assess the
models’ performance on longer and richer textual descriptions.
4.1.2. Instruction-tuning and Evaluation for Long Output Generation
For this step, we further train our models for long response generation. Specifically, we use data
taken from LLaVA Conversation 58k extracting user question and system answer pairs to use as
single-round interactions. After extracting the single-round instances, we perform training following
the same procedure used for instruction-tuning.</p>
          <p>We perform four diferent training procedures:
Source: https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg
Short Answer Question: Quante persone ci sono in questa immagine? Rispondi brevemente.
English Translation: How many people are there in the image? Answer briefly.</p>
          <p>LLaVA-NDiNO PT + IT Answer: 1.</p>
          <p>English Translation: 1.</p>
          <p>LLaVA-NDiNO PT + IT + LONG-IT Answer: C’è una persona in questa immagine.</p>
          <p>English Translation: There is one person in this image.</p>
          <p>Long Answer Question: Cosa c’è di strano in questa immagine?
English Translation: What is strange about this image?
LLaVA-NDiNO PT + IT Answer: Un uomo è seduto su una sedia a rotelle che lava i panni.
English Translation: A man is sitting in a wheelchair washing clothes.</p>
          <p>LLaVA-NDiNO PT + IT + LONG-IT Answer: L’immagine è strana perché mostra un uomo che asciuga
le camicie mentre è in piedi sulla parte superiore di un camion giallo, che è un modo insolito e non
convenzionale per asciugare le camicie.</p>
          <p>English Translation: The image is strange because it shows a man drying shirts while standing on top of a
yellow truck, which is an unusual and unconventional way to dry shirts.
• LLaVA-NDiNO LONG-IT: only MLP pre-training and long instruction-tuning have been
performed;
• LLaVA-NDiNO PT + LONG-IT: MLP pre-training, Italian language pre-training and long Italian
language instruction-tuning have been performed;
• LLaVA-NDiNO IT + LONG-IT: MLP pre-training, Italian language instruction-tuning and long</p>
          <p>
            Italian language instruction-tuning have been performed;
• LLaVA-NDiNO PT + IT + LONG-IT: MLP pre-training, Italian language pre-training, Italian
language instruction-tuning and long Italian language instruction-tuning have been performed.
mBlip T0 XL [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]
          </p>
        </sec>
        <sec id="sec-4-2-6">
          <title>LLaVA-NDiNO IT</title>
        </sec>
        <sec id="sec-4-2-7">
          <title>LLaVA-NDiNO PT + IT</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Datasets</title>
        <p>MTVQA-IT ↑ V-EXAMS-IT ↑
0.04 0.20
0.15 0.25
0.17 0.24</p>
        <p>To evaluate the quality of long output generation, we use both the LLaVA-Bench and the MTVQA
datasets. LLaVA-Bench is selected for its inclusion of GPT-4V responses, allowing us to evaluate models
on long and descriptive answers. Meanwhile, MTVQA is used to extend the previous evaluation on
instruction-tuned models.</p>
        <p>In this case, we use Perplexity as metric, to understand how certain a model is of the actual answer.
The question-answer pairs of the datasets are formatted using the previously presented prompts LLaMA
3 instruct format. We compute the perplexity of the model on the expected answer only, but conditioned
on the context of the question (that is, the loss is only computed on the answer tokens). Instances where
the Perplexity exceeds 1,000 are treated as outliers and skipped. We expect models trained on multiple
steps to have an overall lower degree of Perplexity. The results of this evaluation step, shown in Table
7, align with the expectations: models subjected to long instruction-tuning have better performance on
LLaVA-Bench, while instruction-tuned models perform better on MTVQA. Furthermore, while in the
previous evaluation step there were no significant diferences on the MTVQA dataset, we can assess in
these results that the instruction-tuned models have learned a diferent language distribution. This is
important since using a generation strategy diferent from greedy decoding can lead to notably diferent
outputs.</p>
        <p>Finally, we showcase two diferent examples to further illustrate the diference between models
trained on long output generation and others. In Figure 2, we compare two of our models on answering
two diferent questions (one expecting a short answer while the other a long one) for the same image.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We introduce and release a family of LMMs trained for the Italian language. Specifically, we train the
models considering three diferent possible steps: Italian adaptation, Italian instruction-tuning and Italian
instruction-tuning for long responses. To train the models, we collect a large collection of state-of-the-art
datasets for the English language. Specifically, The Cauldron and LLaVA Conversation 58k for
instruction-tuning and GQA, OK-VQA, SeedBench, POPE and LLaVA-Bench for evaluation. These
datasets are then translated using MADLAD, one of the most recent neural machine translation models.
We also collect natively Italian data to boost the quality of both training and evaluation. Specifically, we
collect MTVQA and V-EXAMS for both instruction-tuning and evaluation, as well as a rich pre-training
corpus consisting of image-text pairs from WiT and MultiEURLEX.</p>
      <p>We train several models on diferent possible configurations, that is multiple train steps using diferent
datasets. An extensive evaluation procedure compared our results with a popular multilingual and
multimodal model that is, mBlip. Results are promising against the baseline, but we noticed that for
most tasks there were no significant diferences on the results of the instruction-tuned models. However,
we find relevant diferences when evaluating the models using Perplexity.</p>
      <p>As future works, we plan to investigate the performance diference between a model instruction-tuned
for both short and long answer generation in Italian at the same time w.r.t. proposed pipeline. We
also aim to study conversational multi-round multimodal models since, in this work, we focused on
single-round conversations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 6
Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.
Models are built on the Leonardo supercomputer with the support of CINECA-Italian Super Computing
Resource Allocation, class C project IscrC_LLMM (HP10CLKWTP).
[29] H. Jhamtani, T. Berg-Kirkpatrick, Learning to describe diferences between pairs of similar images,
in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2018.
[30] M. Acharya, K. Kafle, C. Kanan, Tallyqa: Answering complex counting questions, in: AAAI, 2019.
[31] Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7W: Grounded Question Answering in Images, in:</p>
      <p>IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[32] J. J. Lau, S. Gayen, A. Ben Abacha, D. Demner-Fushman, A dataset of clinically generated visual
questions and answers about radiology images, Scientific data 5 (2018) 1–10.
[33] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, VQA: Visual Question</p>
      <p>Answering, in: International Conference on Computer Vision (ICCV), 2015.
[34] J. Tang, Q. Liu, Y. Ye, J. Lu, S. Wei, C. Lin, W. Li, M. F. F. B. Mahmood, H. Feng, Z. Zhao, Y. Wang,
Y. Liu, H. Liu, X. Bai, C. Huang, Mtvqa: Benchmarking multilingual text-centric visual question
answering, 2024. arXiv:2405.11985.
[35] R. J. Das, S. E. Hristov, H. Li, D. I. Dimitrov, I. Koychev, P. Nakov, Exams-v: A
multidiscipline multilingual multimodal exam benchmark for evaluating vision language models, 2024.
arXiv:2403.10378.
[36] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, D. Xin, A. Kusupati, R. Stella, A. Bapna, O. Firat,
Madlad-400: A multilingual and document-level large audited dataset, Advances in Neural
Information Processing Systems 36 (2024).
[37] A. Dubey, et Al., The llama 3 herd of models, 2024. URL: https://arxiv.org/abs/2407.21783.</p>
      <p>arXiv:2407.21783.
[38] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, G. Krueger, I. Sutskever, Learning transferable visual models from natural language
supervision, 2021. URL: https://arxiv.org/abs/2103.00020. arXiv:2103.00020.
[39] D. Croce, L. C. Passaro, A. Lenci, R. Basili, Gqa-it: Italian question answering on image scene
graphs, Computational Linguistics CliC-it 2021 (2022) 92.
[40] D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual reasoning and compositional
question answering, Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[41] K. Marino, M. Rastegari, A. Farhadi, R. Mottaghi, Ok-vqa: A visual question answering benchmark
requiring external knowledge, in: Conference on Computer Vision and Pattern Recognition
(CVPR), 2019.
[42] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, Y. Shan, Seed-bench: Benchmarking multimodal llms with
generative comprehension, arXiv preprint arXiv:2307.16125 (2023).
[43] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, J.-R. Wen, Evaluating object hallucination in large
vision-language models, in: The 2023 Conference on Empirical Methods in Natural Language
Processing, 2023. URL: https://openreview.net/forum?id=xozJw0kZXF.
[44] B. Li, P. Zhang, K. Zhang, F. Pu, X. Du, Y. Dong, H. Liu, Y. Zhang, G. Zhang, C. Li, Z. Liu,
Lmmseval: Accelerating the development of large multimoal models, 2024. URL: https://github.com/
EvolvingLMMs-Lab/lmms-eval.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Hromei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          , Preface to the
          <source>Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2024</year>
          )
          <article-title>co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved baselines with visual instruction tuning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>26296</fpage>
          -
          <lpage>26306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2307.09288. arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruction tuning,
          <source>Advances in neural information processing systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sharir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zelnik-Manor</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words, what is a video worth?</article-title>
          ,
          <source>arXiv preprint arXiv:2103.13915</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2407.07895</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Geigle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Timofte</surname>
          </string-name>
          , G. Glavaš, mBLIP: Eficient bootstrapping of multilingual visionLLMs, in: J.
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>T.-J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hudson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Celikyilmaz</surname>
          </string-name>
          , W. Wang (Eds.),
          <source>Proceedings of the 3rd Workshop on Advances in Language and Vision Research</source>
          (ALVR),
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>25</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .alvr-
          <volume>1</volume>
          .2.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19730</fpage>
          -
          <lpage>19742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Team</surname>
          </string-name>
          , et al.,
          <article-title>No language left behind: Scaling human-centered machine translation</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2207.04672. arXiv:
          <volume>2207</volume>
          .
          <fpage>04672</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Djolonga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Padlewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Changpinyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          , et al.,
          <article-title>Pali-x: On scaling up a multilingual vision and language model</article-title>
          ,
          <source>arXiv preprint arXiv:2305.18565</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Changpinyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piergiovanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Padlewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Salz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Grycner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          , et al.,
          <article-title>Pali: A jointly-scaled multilingual language-image model</article-title>
          ,
          <source>arXiv preprint arXiv:2209.06794</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lim</surname>
          </string-name>
          , I. Won,
          <string-name>
            <given-names>C.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <surname>X-</surname>
          </string-name>
          <article-title>LLaVA: Optimizing bilingual large vision-language alignment</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-naacl.
          <volume>158</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-naacl.
          <volume>158</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , G. Lample,
          <article-title>Cross-lingual language model pretraining</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Laurençon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tronchon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <article-title>What matters when building vision-language models</article-title>
          ?,
          <year>2024</year>
          . arXiv:
          <volume>2405</volume>
          .
          <fpage>02246</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Raman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          , Wit:
          <article-title>Wikipedia-based image text dataset for multimodal multilingual machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:2103</source>
          .
          <year>01913</year>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>Multieurlex - a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2109.00904.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Antonova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Adapting the tesseract open source ocr engine for multilingual ocr</article-title>
          ., in: V.
          <string-name>
            <surname>Govindaraju</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chaudhury</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          Lopresti (Eds.),
          <source>MOCR '09: Proceedings of the International Workshop on Multilingual OCR</source>
          , ACM International Conference Proceeding Series, ACM,
          <year>2009</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: https://storage.googleapis.com/pub-tools
          <article-title>-public-publication-data/pdf/ 35248.pdf</article-title>
          . doi:http://doi.acm.org/10/1145/1577802.1577804.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mottaghi</surname>
          </string-name>
          ,
          <article-title>A-okvqa: A benchmark for visual question answering using world knowledge</article-title>
          ,
          <source>in: European conference on computer vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          , L. van der Maaten, L.
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Clevr: A diagnostic dataset for compositional language and elementary visual reasoning</article-title>
          , in: CVPR,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <article-title>Exploring models and data for image question answering</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>28</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Alvari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soricut</surname>
          </string-name>
          ,
          <article-title>Geomverse: A systematic evaluation of large models for geometric reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2312.12241</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          , S.-C. Zhu,
          <article-title>Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning</article-title>
          ,
          <source>arXiv preprint arXiv:2110.13214</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          , S.-C. Zhu,
          <article-title>Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning, in: The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th</article-title>
          <source>International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pont-Tuset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uijlings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Changpinyo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soricut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          ,
          <article-title>Connecting vision and language with localized narratives</article-title>
          ,
          <source>in: ECCV</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Chen,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Mimic-it: Multi-modal in-context instruction tuning</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.05425. arXiv:
          <volume>2306</volume>
          .
          <fpage>05425</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Suhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          ,
          <article-title>A corpus of natural language for visual reasoning</article-title>
          ,
          <source>in: Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          . URL: https://api.semanticscholar. org/CorpusID:19435386.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S.-C. Zhu,
          <article-title>Raven: A dataset for relational and analogical visual reasoning</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>