<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Elio</forename><surname>Musacchio</surname></persName>
							<email>elio.musacchio@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Italian National PhD Program in Artificial Intelligence</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<settlement>Bari (</settlement>
									<country key="IT">ITALY</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lucia</forename><surname>Siciliani</surname></persName>
							<email>lucia.siciliani@uniba.it</email>
							<affiliation key="aff1">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>Via E. Orabona</addrLine>
									<postCode>4 -70125</postCode>
									<settlement>Bari</settlement>
									<country key="IT">ITALY</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pierpaolo</forename><surname>Basile</surname></persName>
							<email>pierpaolo.basile@uniba.it</email>
							<affiliation key="aff1">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>Via E. Orabona</addrLine>
									<postCode>4 -70125</postCode>
									<settlement>Bari</settlement>
									<country key="IT">ITALY</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giovanni</forename><surname>Semeraro</surname></persName>
							<email>giovanni.semeraro@uniba.it</email>
							<affiliation key="aff1">
								<orgName type="department">Dept. of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>Via E. Orabona</addrLine>
									<postCode>4 -70125</postCode>
									<settlement>Bari</settlement>
									<country key="IT">ITALY</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">7A0053348845102B9695B587AA785BCE</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>NLP</term>
					<term>Multimodality</term>
					<term>LLM</term>
					<term>LMM</term>
					<term>LVLM</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Since their initial inception, large language models have undergone many innovations. One of these innovations concerns multimodality. Several adaptation strategies have been developed to expand LLMs to process multimodal signals. However, the training procedure for these multimodal models is performed on English-only visionlanguage datasets in the current literature, limiting their capabilities for other languages. This work proposes the first family of LMMs for the Italian language. We trained them using state-of-the-art backbone models and datasets, translated into Italian using the most up-to-date machine translation model available. In support of open science, we publicly release the data, models, and code used to develop these models.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Large Language Models (LLMs) have been rising in research interest due to their generalization capabilities, which allow them to solve tasks never seen during training. However, their capabilities are limited to the textual domain. In light of this, researchers have started proposing solutions to bridge the gap between the textual world and the others (e.g. visual or aural). Specifically, instead of pre-training a new model with multimodal capabilities from scratch, these solutions leverage a pre-trained decoder-only LLM. This is both cost-efficient, avoiding the expensive training procedures of full multimodal training, and effective, as many of these solutions reported optimal results.</p><p>In this work, we will be focusing on the vision-language world, specifically Large Vision Language Models (LVLMs). These models are often trained following a traditional two-step approach: pre-training followed by fine-tuning. However, one notable issue is that the vision-language training mixture often consists of curated and selected datasets that predominantly feature English text, as seen in models like LLaVA <ref type="bibr" target="#b1">[2]</ref>. This further propagates an inherent problem of these large models, where the pre-training corpus mainly consists of English data. For example, LLaMA 2 <ref type="bibr" target="#b2">[3]</ref>, a LLM by META, was pre-trained on a corpus of 89.70% English language and of 8.38% unknown language (e.g. programming code). As a result, even the developers of the models explicitly state that their usage is intended for English use cases only.</p><p>Furthermore, there is a significant gap due to the absence of large-scale, multitask and multilingual datasets. While the English vision-language datasets are conceptually diverse and rich (e.g., scientific question answering, OCR), non-English datasets tend to be limited in scope, focusing on specific high-level tasks (e.g., image captioning, visual question answering).</p><p>For these reasons, there are currently very few LVLMs in the state-of-the-art for non-English languages. While some models support multilingual and multimodal data, they often fall behind their English counterparts in terms of architecture performance and training data quality. The reasons behind this are twofold: new LLMs are constantly being released, and training data lacks quality, focusing only on high-level tasks due to the lack of data. Furthermore, current multilingual and multimodal benchmarks are not as conceptually rich as English ones, making evaluation of these models more difficult for non-English languages.</p><p>Therefore, in this work, we propose an approach to train and evaluate a LVLM for the Italian language. We also release LLaVA-NDiNO (Large Language and Vision Assistant: New Domain integration for Natural Observations), the first family of openly-available Italian LVLMs trained and evaluated by following the proposed approach. While this approach heavily relies on the use of machine translation, we show that even when using machine-translated datasets at train time it is possible to achieve remarkable performance during evaluation on datasets that are natively in the Italian language. Specifically, the contributions of this work are the following:</p><p>• We apply a vision-language adaptation step designed to improve the performance of the model for a specific language. We compare the performance of a model trained using this additional step w.r. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>LVLMs have begun to see widespread success following the release of GPT-4V <ref type="bibr" target="#b3">[4]</ref>, the OpenAI model which supported vision-language inputs. However, since the model is proprietary, possibilities for research are relatively limited. Because of this, many works proposed open-source solutions, trying to match the performance obtained by GPT-4V on state-of-the-art benchmarks. One of the most popular solutions in this field of research is LLaVA <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b1">2]</ref>. The model uses a projection module (either a projection matrix in its first version or a Multi-Layer Perceptron in version 1.5) to project the visual embeddings extracted from a visual encoder into the latent space of the LLM. This approach is simple and efficient, since it only relies on a single projection module. However, the original LLaVA architecture, as well as other LVLMs, struggled with high-resolution images tasks due to the requirements imposed by vision encoders. This is because vision encoders, like the Vision Transformer (ViT) <ref type="bibr" target="#b5">[6]</ref>, are trained on a fixed image size. Therefore, during inference or embedding extraction, the same image size is expected as input. To overcome this limitation, LLaVA-NeXT <ref type="bibr" target="#b6">[7]</ref> was developed. In this model, the image is split into grids of fixed size and the embeddings for each grid are extracted and concatenated. Finally, the original image is resized and its embeddings are extracted and concatenated to the previous output. This technique allows the model to better understand the overall visual characteristics of the input images. However, all of the LLaVA models were trained on English-only vision-language data. Specifically, an instruction-tuning approach over a rich set of vision-language tasks was performed. Therefore, while the LLaVA models perform well on English tasks, the lack of curated multilingual vision-language instruction-tuning datasets makes it challenging to train multilingual LVLMs on a set of conceptually diverse tasks. In light of this, some works focus on multilingual training procedures for LVLMs. Geigle et al. <ref type="bibr" target="#b7">[8]</ref> released mBlip, a version of the BLIP 2 <ref type="bibr" target="#b8">[9]</ref> model trained on an English vision-text dataset machine-translated to 95 different languages To do so, the authors used a neural machine translation model, that is nllb-200-distilled-1.3B <ref type="bibr" target="#b9">[10]</ref>. There is also Pali-X <ref type="bibr" target="#b10">[11]</ref>, where the vision and language components are jointly scaled, following the work done in Pali <ref type="bibr" target="#b11">[12]</ref>. The model is pre-trained on a rich range of datasets, among which there is WebLI <ref type="bibr" target="#b11">[12]</ref>, a rich corpus consisting of images with alt-texts from the web and OCR annotations obtained from the Google Cloud Vision API, covering a total of 100 languages. Finally, there is X-LLaVA <ref type="bibr" target="#b12">[13]</ref>, where the authors adapted LLaVA 1.5 by expanding its dictionary for English and Korean and performing a language adaptation step based on the one performed by Conneau and Lample <ref type="bibr" target="#b13">[14]</ref>, that is pre-training on a data corpus extracted from Wikipedia.</p><p>Regarding datasets used to train these models, for LLaVA 1.5 a mixture of English only visionlanguage datasets was used. Specifically, the mixture contained 158, 000 GPT-generated multimodal instruction-following data instances, 450, 000 academic-task-oriented visual question answering data instances and 40, 000 ShareGPT data instances. Laurençon et al. <ref type="bibr" target="#b14">[15]</ref> released The Cauldron, a collection of 50 different datasets pre-formatted for instruction-tuning. This dataset was used to train Idefics 2 <ref type="bibr" target="#b14">[15]</ref> model. The dataset consists of state-of-the-art vision-language datasets and covers a wide array of conceptual tasks. Specifically, the authors identify the following categories: general visual question answering, captioning, OCR, document understanding, text transcription, chart/figure understanding, table understanding, reasoning, logic, maths, textbook/academic questions, differences between two images, screenshot to code.</p><p>Despite all this, best practises regarding language adaptation of LVLMs are still unclear.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>We define three different steps in our methodology:</p><p>• Italian vision-language pre-training: training the model to optimize its general understanding of the Italian language; • Italian vision-language instruction-tuning: fine-tuning the model on task specific visionlanguage data to improve its performance in following instructions; • Italian vision-language long instruction-tuning: fine-tuning the model to produce long outputs in response to instructions.</p><p>We adapt a pre-trained decoder LLM and a pre-trained encoder vision transformer to the Italian language by performing an Italian vision-language pre-training approach. This is based on an approach used for LLMs, which consists in further training the model on a wide corpus of generic data of a specific language <ref type="bibr" target="#b13">[14]</ref>. In this step, we perform the same approach but using vision-text data instead. Specifically, we directly use an English pre-trained decoder LLM and an English pre-trained vision encoder and perform joint language adaptation on both of them, as well as the adaptation module, on a collection of image-text pairs natively in Italian. We expect the model pre-trained on Italian data to perform better in Italian vision-language tasks, thanks to the additional knowledge it has gained.</p><p>Furthermore, while the instruction-tuning datasets are often unavailable in multiple languages, vision-language pre-train data is. Thanks to this, the data quality during pre-train is guaranteed since the text would be natively in Italian. However, the situation is different for instruction-tuning. Due to the lack of instruction-tuning Italian datasets, we must rely on machine translation. While the data quality will suffer from this, this approach is the only one that allows us to obtain the large quantity of data needed to achieve the generalization capabilities of LVLMs. Finally, we also perform further instruction-tuning for long response generation. This is because humans tend to prefer long and descriptive answers when interacting with LLMs and LVLMs. We decided to use the LLaVA-NeXT architecture since it is one of the most recent LVLMs available in the state-of-the-art. We detail all the steps we carried out, from data collection to evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Dataset Creation</head><p>For the Italian language pre-training dataset, following the best practises by Laurençon et al. <ref type="bibr" target="#b14">[15]</ref>, we setup three conceptually different datasets: Interleaved image-text documents, Image-text pairs and PDF documents. For interleaved image-text documents and image-text pairs, we use the WIT <ref type="bibr" target="#b15">[16]</ref> dataset, a collection of images and their associated text sections obtained from Wikipedia pages in multiple languages. Specifically, after collecting the Italian portion of the dataset, we use the text of a section where an image appears as interleaved image-text document and the caption of the image as image-text pair. Note that for interleaved image-text documents we only use a single pair of image-text section, rather than multiple sections from the same Wikipedia page. For PDF documents, there are no multilingual datasets fitting this criteria in the literature. In particular, there are no handwritten datasets of this type, but only typewritten. Therefore, we decided to use MultiEURLEX <ref type="bibr" target="#b16">[17]</ref>, a corpus containing European laws in 23 languages. While this corpus is typewritten only, we prefer to include it in the pre-train dataset rather than not covering OCR at all. We retrieve the Italian PDF files associated with the corresponding CELEX_ID and extract the text from each document using Tesseract <ref type="bibr" target="#b17">[18]</ref>. We also filter the dataset to control the distribution of these different sets. The pre-train dataset consists of 250, 000 instances, of which 168, 000 are interleaved image-text documents, 72, 000 are image-text pairs, and 10, 000 are PDF documents.</p><p>For the Italian language instruction-tuning dataset, we use The Cauldron <ref type="bibr" target="#b14">[15]</ref>, a collection of 50 vision-language datasets already formatted for instruction-tuning. Since the dataset is in English, we use machine translation to Italian. Details regarding the machine translation procedure will be discussed in Section 3.2. However, we first perform a filtering step of the 50 available tasks. This is because many tasks would lose their meaning when translated from English to another language (e.g. extraction of information from the image of a table where the text is in English). Because of this, we remove all tasks which focus on images containing English text (e.g. docvqa or ocrvqa). After performing this manual filtering step, we have a total of 15 tasks. For each task, we select the first 10, 000 rows of the dataset and perform machine translation on each instance in each row (more than one text-vision pair can be present for each row). Additionally, we also add the train sets of MTVQA and V-EXAMS, datasets that are natively in Italian. This increases both the quality of the instruction-tuning dataset, as the datasets are not machine translated, and its concept distribution, since two new tasks are added. MTVQA is the only dataset containing Italian visual text extraction and V-EXAMS is the only dataset containing Italian academic visual question answering. In total, the instruction-tuning dataset consists of 260, 302 instances.</p><p>For the Italian language long instruction-tuning dataset, we use LLaVA Conversation 58k <ref type="bibr" target="#b4">[5]</ref>, a subset of the LLaVA Instruct 150K dataset. It consists of 58k conversations, a dataset generated using GPT-4V for conversational purposes. Again, since the dataset is in English, we perform machine translation.</p><p>Finally, for evaluation, we collect the OK-VQA, SeedBench and POPE datasets, that are popular benchmarks used in the literature for English LVLMs. We machine translate them to the Italian language as well. We also collect the test sets of MTVQA, V-EXAMS and GQA-it.</p><p>We provide an overview of the 15 datasets from The Cauldron used for the instruction-tuning step in Table <ref type="table">1</ref>. We also provide the same details for the natively Italian datasets in Table <ref type="table">2</ref> and evaluation datasets in Table <ref type="table">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Translation</head><p>To translate the data, we use one of the newest machine translation models openly available, that is MADLAD-400 3B<ref type="foot" target="#foot_1">2</ref>  <ref type="bibr" target="#b35">[36]</ref>. To accomplish this task, we use a cluster equipped with multiple NVIDIA A16 16GB VRAM GPUs. We use 4 GPUs in parallel and perform inference with a batch size per device of 4.</p><p>To translate the data from The Cauldron, we directly use the formatted instruction pairs present in the dataset. By doing so, the answer is translated with the context given by the question, reducing the possibility of a translation error. We do the same for closed-ended tasks, where a list of options is given in the question. However, this translation procedure may cause the model to translate text inaccurately. Therefore, some options for closed-ended tasks may not be translated correctly. For example, during translation, some closed-ended options might not align correctly with the original content, causing errors like having more options than in the original text. To avoid this issue, we check via regex matching that: 1) the question or instruction is present at the beginning; 2) the number of</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset # Train Translated Description</head><p>A-OKVQA <ref type="bibr" target="#b18">[19]</ref> 10,107</p><p>VQA dataset requiring world knowledge and common sense for a correct answer.</p><p>CLEVR <ref type="bibr" target="#b19">[20]</ref> 92,670 VQA dataset designed for visual reasoning regarding objects in images.</p><p>COCO-QA <ref type="bibr" target="#b20">[21]</ref> 16,167 VQA dataset containing descriptive and rich question-answer pairs.</p><p>Geomverse <ref type="bibr" target="#b21">[22]</ref> 3,324 VQA dataset regarding geometric reasoning.</p><p>IconQA <ref type="bibr" target="#b22">[23]</ref> 10,980 VQA dataset regarding abstract diagram understanding.</p><p>InterGPS <ref type="bibr" target="#b23">[24]</ref> 1,498</p><p>VQA dataset regarding geometric reasoning, annotated in a formal language. Localized Narratives <ref type="bibr" target="#b24">[25]</ref> 9,178 VQA dataset designed to provide rich descriptions of image contents.</p><p>Mimic CGD <ref type="bibr" target="#b25">[26]</ref> 16,807</p><p>VQA dataset designed to enhance the performance of vision language models in real-life scenarios.</p><p>NLVR2 <ref type="bibr" target="#b26">[27]</ref> 18,363</p><p>VQA dataset regarding truthfulness of a natural language sentence about a pair of photographs.</p><p>Raven <ref type="bibr" target="#b27">[28]</ref> 9,216 VQA dataset regarding Raven's Progressive Matrices. Spot the Difference <ref type="bibr" target="#b28">[29]</ref> 9,187 VQA dataset regarding differences between two images.</p><p>TallyQA <ref type="bibr" target="#b29">[30]</ref> 14,024 VQA dataset regarding complex counting questions of objects in images.</p><p>Visual7w <ref type="bibr" target="#b30">[31]</ref> 43,228</p><p>VQA dataset regarding object-level grounding, using questions that start with one of what, where, when, who, why, how and which.</p><p>VQArad <ref type="bibr" target="#b31">[32]</ref> 739 VQA dataset regarding radiology images.</p><p>VQAv2 <ref type="bibr" target="#b32">[33]</ref> 1,563</p><p>VQA dataset requiring understanding of vision, language and commonsense knowledge to answer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Overview of all datasets from The Cauldron used during the instruction-tuning procedure of our models. # Train Translated is the amount of total translated instances obtained from the original first 10k rows of the dataset.</p><p>options is the same before and after translation; 3) the answer is present at the end of the translated string. In all cases where a check is not passed, the translated instance is removed from the dataset. We follow this same procedure to translate evaluation benchmarks. Because of this, some of these translated datasets may have a different cardinality w.r.t. original ones.</p><p>For LLaVA Conversation 58k we directly translate the user question and the system response. By testing the model, we noticed that translation errors are frequent when a newline character is present in the input. Therefore, we split inputs when two consecutive newline characters are present and further VQA dataset of multilingual school exam questions. The dataset is obtained from real exam questions for each language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Overview of all datasets natively in Italian used during the instruction-tuning procedure of our models split the output when a single newline character is present. The obtained strings are translated and the original newline characters are progressively added for each translated instance, effectively recreating the original formatting of the string but in another language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Training Details</head><p>We distinguish between four total train steps:</p><p>• MLP pre-training: the weights of the MLP module are initialized, following the strategy described by Liu et al. <ref type="bibr" target="#b1">[2]</ref>; • Italian language pre-training: we optimize the model to the Italian language by further training the English backbones on a mixture of native Italian text-vision data; • Italian language instruction-tuning: we optimize performance of the model in providing meaningful responses by performing instruction-tuning; • Italian language long instruction-tuning: we optimize performance of the model in providing meaningful and descriptive responses by performing instruction-tuning.</p><p>For the Multi-Layer Perceptron (MLP) pre-training step, we use the same dataset as Liu et al. <ref type="bibr" target="#b1">[2]</ref>, that is LCS-558K. It is a subset of the LAION/CC/SBU dataset, filtered with a more balanced concept coverage distribution, and augmented with BLIP synthetic captions. We follow the procedure described in LLaVA 1.5 for this step.</p><p>Then, we perform our training using the translated Cauldron dataset on LLaMA 3 8B base <ref type="bibr" target="#b36">[37]</ref> as LLM and CLIP ViT large-patch14-338 <ref type="bibr" target="#b37">[38]</ref> as vision encoder. This is to follow the configuration used by LLaVA-NeXT, except for the LLM model. We decided to use the base version instead of the instruct one. Since we have to perform pre-training, we have found the base version of the model to be more fitting for this purpose.</p><p>We train all models for a direct response in a single round user-system conversational setting. Specifically, we use two prompt formats: plain for the MLP and Italian pre-training, and the LLaMA 3 instruct format without system prompt for instruction-tuning. These prompt formats are shown in Listing 1 and 2.</p><p>A diagram presenting an overview of the entire training pipeline is shown in Figure <ref type="figure" target="#fig_0">1</ref>. For all models, we perform full-parameter training. Regarding additional technical details, we report hyperparameters used in Table <ref type="table" target="#tab_3">3</ref>. The training was run on a cluster with 4 NVIDIA A100 64 GB GPUs per node. Specifically, we use 2 nodes for a total of 8 GPUs. We use a server with 8 NVIDIA A16 16 GB GPUs for evaluation, running the procedure on 4 GPUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.">Instruction-tuning and Evaluation</head><p>To assess the performance of the pre-trained model, we perform two different training procedures:</p><p>• LLaVA-NDiNO IT: only MLP pre-training and instruction-tuning have been performed;  To evaluate the models, we distinguish between two different benchmarks:</p><p>• Machine-translated state-of-the-art benchmarks: we use some of the most popular benchmarks for evaluation of LVLMs translated to the Italian language; • Natively Italian benchmarks: we use benchmarks that include Italian text-vision data instances where the text is originally written in Italian.</p><p>For evaluation, we use lmms-eval<ref type="foot" target="#foot_2">3</ref>  <ref type="bibr" target="#b43">[44]</ref> a fork of lm-eval-harness <ref type="foot" target="#foot_3">4</ref> , a library for evaluation of LLMs, but designed for LVLMs. We create custom tasks to evaluate the models on Italian datasets.</p><p>The first set of benchmarks allows us to have somewhat comparable conceptual coverage compared to the state-of-the-art since the datasets that we consider cover the diverse skills of the models. We provide an overview of the tasks alongside their cardinality in Table <ref type="table">4</ref>.</p><p>Instead, the second set of benchmarks allows us to understand if training on machine-translated data severely affects performance. This is because these datasets are natively in the Italian language. For this purpose we use the test sets of the previously presented MTVQA and V-EXAMS datasets, keeping only the Italian instances of these multilingual datasets. To understand if our trained models excel in the Italian language, we compare our results with the mBlip T0 <ref type="bibr" target="#b7">[8]</ref> model, a multilingual vision-language model which includes Italian as one of the training languages. For the evaluation metrics, in all cases we use exact match for open-ended tasks and accuracy for closed-ended ones. The only exception is POPE for which we report the F1 score. All metrics reflect common best practises used for the original datasets in the English language. We followed the same evaluation design for MTVQA and V-EXAMS as well.</p><p>Analyzing the results, both our models perform better w.r.t. the baseline in all tasks. Remarkably, while the mBlip model performs very poorly on the MTVQA dataset, both our models show improvements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset # Original # IT MT Description</head><p>GQA-it <ref type="bibr" target="#b38">[39,</ref><ref type="bibr" target="#b39">40]</ref> 12,578 -Open-ended VQA dataset regarding compositional questions of real-world images, specifically regarding objects, attributes and relations in the images.</p><p>OK-VQA <ref type="bibr" target="#b40">[41]</ref> 5,050 5,046</p><p>Open-ended VQA dataset regarding questions where the model needs to have external knowledge in order to answer. SeedBench <ref type="bibr" target="#b41">[42]</ref> 18,000 2,496 Closed-ended VQA multiple-choice dataset regarding temporal and spatial questions.</p><p>POPE <ref type="bibr" target="#b42">[43]</ref> 9,000 9,000</p><p>Open-ended VQA dataset regarding object hallucination (answer is expected to be either 'Yes' or 'No').</p><p>LLaVA-Bench <ref type="bibr" target="#b4">[5]</ref> 60 60</p><p>Open-ended VQA dataset to test the abilities of the models in solving challenging tasks, thanks to a highly-detailed and manually-curated description and a proper selection of questions for each instance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Overview of all datasets machine translated to the Italian language used for evaluation. # Original and # IT MT are the number of instances in the original dataset and in the machine-translated one respectively. For GQA-it we report the original cardinality Model Datasets GQA-IT* ↑ OK-VQA-IT ↑ SeedBench-IT ↑ POPE-IT* ↑ mBlip T0 XL <ref type="bibr" target="#b7">[8]</ref> 0 Results obtained for evaluation datasets machine translated to the Italian language. &lt;DATASET_NAME&gt;-IT refers to the machine translated version of the original dataset. For GQA-IT, OK-VQA-IT and SeedBench-IT the metric is exact match, for POPE-IT the metric is Accuracy. The ↑ indicates that the greater value obtained for the metric of that dataset the better the performance. The asterisk indicates that there is statistical significance between the two LLaVA-NDiNO model results for that dataset However, for both LLAVA-NDiNO models, average results are fairly similar regardless of the pre-training step. In light of this, we perform statistical testing using McNemar's test. The test reveals that for most tasks, the p-value is greater than 0.05; therefore, there are no discernible differences between the two setups. We believe this is due to the nature of the evaluation tasks, since the model only needs to pick the correct option or to generate a simple word or phrase. These tasks are not useful for evaluating the quality of the pre-train. In light of this, we will perform an additional experiment to assess the models' performance on longer and richer textual descriptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2.">Instruction-tuning and Evaluation for Long Output Generation</head><p>For this step, we further train our models for long response generation. Specifically, we use data taken from LLaVA Conversation 58k extracting user question and system answer pairs to use as single-round interactions. After extracting the single-round instances, we perform training following the same procedure used for instruction-tuning. We perform four different training procedures:</p><p>Short Answer Question: Quante persone ci sono in questa immagine? Rispondi brevemente. English Translation: How many people are there in the image? Answer briefly.</p><p>LLaVA-NDiNO PT + IT Answer: 1. English Translation: 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaVA-NDiNO PT + IT + LONG-IT Answer: C'è una persona in questa immagine.</head><p>English Translation: There is one person in this image.</p><p>Long Answer Question: Cosa c'è di strano in questa immagine? English Translation: What is strange about this image?</p><p>LLaVA-NDiNO PT + IT Answer: Un uomo è seduto su una sedia a rotelle che lava i panni. English Translation: A man is sitting in a wheelchair washing clothes.</p><p>LLaVA-NDiNO PT + IT + LONG-IT Answer: L'immagine è strana perché mostra un uomo che asciuga le camicie mentre è in piedi sulla parte superiore di un camion giallo, che è un modo insolito e non convenzionale per asciugare le camicie. English Translation: The image is strange because it shows a man drying shirts while standing on top of a yellow truck, which is an unusual and unconventional way to dry shirts. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>Datasets MTVQA-IT ↑ V-EXAMS-IT ↑ mBlip T0 XL <ref type="bibr" target="#b7">[8]</ref> 0.04 0.20 LLaVA-NDiNO IT 0.15 0.25 LLaVA-NDiNO PT + IT 0.17 0.24  Results obtained for Perplexity evaluation of the models. &lt;DATASET_NAME&gt;-IT refers to the machine translated version of the original dataset for LLaVA-Bench and to the filtered version with only Italian instances for MTVQA-IT. ↓ indicates that the lesser value obtained for the metric of that dataset the better the performance.</p><p>In cases with ◇, Perplexity was always greater than the fixed threshold.</p><p>To evaluate the quality of long output generation, we use both the LLaVA-Bench and the MTVQA datasets. LLaVA-Bench is selected for its inclusion of GPT-4V responses, allowing us to evaluate models on long and descriptive answers. Meanwhile, MTVQA is used to extend the previous evaluation on instruction-tuned models.</p><p>In this case, we use Perplexity as metric, to understand how certain a model is of the actual answer. The question-answer pairs of the datasets are formatted using the previously presented prompts LLaMA 3 instruct format. We compute the perplexity of the model on the expected answer only, but conditioned on the context of the question (that is, the loss is only computed on the answer tokens). Instances where the Perplexity exceeds 1,000 are treated as outliers and skipped. We expect models trained on multiple steps to have an overall lower degree of Perplexity. The results of this evaluation step, shown in Table <ref type="table" target="#tab_6">7</ref>, align with the expectations: models subjected to long instruction-tuning have better performance on LLaVA-Bench, while instruction-tuned models perform better on MTVQA. Furthermore, while in the previous evaluation step there were no significant differences on the MTVQA dataset, we can assess in these results that the instruction-tuned models have learned a different language distribution. This is important since using a generation strategy different from greedy decoding can lead to notably different outputs.</p><p>Finally, we showcase two different examples to further illustrate the difference between models trained on long output generation and others. In Figure <ref type="figure" target="#fig_1">2</ref>, we compare two of our models on answering two different questions (one expecting a short answer while the other a long one) for the same image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>We introduce and release a family of LMMs trained for the Italian language. Specifically, we train the models considering three different possible steps: Italian adaptation, Italian instruction-tuning and Italian instruction-tuning for long responses. To train the models, we collect a large collection of state-of-the-art datasets for the English language. Specifically, The Cauldron and LLaVA Conversation 58k for instruction-tuning and GQA, OK-VQA, SeedBench, POPE and LLaVA-Bench for evaluation. These datasets are then translated using MADLAD, one of the most recent neural machine translation models. We also collect natively Italian data to boost the quality of both training and evaluation. Specifically, we collect MTVQA and V-EXAMS for both instruction-tuning and evaluation, as well as a rich pre-training corpus consisting of image-text pairs from WiT and MultiEURLEX.</p><p>We train several models on different possible configurations, that is multiple train steps using different datasets. An extensive evaluation procedure compared our results with a popular multilingual and multimodal model that is, mBlip. Results are promising against the baseline, but we noticed that for most tasks there were no significant differences on the results of the instruction-tuned models. However, we find relevant differences when evaluating the models using Perplexity.</p><p>As future works, we plan to investigate the performance difference between a model instruction-tuned for both short and long answer generation in Italian at the same time w.r.t. proposed pipeline. We also aim to study conversational multi-round multimodal models since, in this work, we focused on single-round conversations.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Overview of the training pipeline, using LLaMA 3 base as LLM and CLIP ViT as vision encoder. There are four total steps: English MLP Pre-Train, Italian Pre-Train, Italian Instruction-Tuning and Italian Long Instruction-Tuning. In this figure, all steps of the pipeline are applied.</figDesc><graphic coords="8,184.82,65.60,225.64,520.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Example comparing the answers of two different models to two different questions.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>t. one without this step; • We propose a new evaluation suite based on both machine-translated and natively Italian data from state-of-the-art benchmarks; • We openly release code, data and models that have been obtained from our experiments, in the hope of boosting research in this field and in support of open science. 1</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>•</head><label></label><figDesc>LLaVA-NDiNO PT + IT: MLP pre-training, Italian language pre-training and instruction-tuning have been performed. LLaMA 3 Format, {user_message} is the message sent by the user, while {system_message} is the model response.</figDesc><table><row><cell cols="4">&lt;|begin _ of _ text|&gt;&lt;image&gt;{text}&lt;|end _ of _ text|&gt;</cell><cell></cell></row><row><cell cols="5">Listing 1: Plain Format, {text} is the text associated with the image</cell></row><row><cell cols="4">&lt;|begin _ of _ text|&gt;&lt;|start _ header _ id|&gt;user&lt;|end _ header _ id|&gt;</cell><cell></cell></row><row><cell cols="5">{user _ message}&lt;|eot _ id|&gt;&lt;|start _ header _ id|&gt;assistant&lt;|end _ header _ id|&gt;</cell></row><row><cell cols="2">{system _ message}&lt;|eot _ id|&gt;</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Listing 2: Parameter</cell><cell></cell><cell></cell><cell>Training Step</cell><cell></cell></row><row><cell></cell><cell>MLP</cell><cell>Italian</cell><cell>Italian</cell><cell>Italian</cell></row><row><cell></cell><cell>pre-train</cell><cell>pre-train</cell><cell>instruction-tuning</cell><cell>long instruction-tuning</cell></row><row><cell>batch size</cell><cell>256</cell><cell>128</cell><cell>128</cell><cell>128</cell></row><row><cell>lr</cell><cell>1e-3</cell><cell>1e-5</cell><cell>1e-5</cell><cell>1e-5</cell></row><row><cell>vision tower lr</cell><cell>-</cell><cell>2e-6</cell><cell>2e-6</cell><cell>2e-6</cell></row><row><cell>lr schedule</cell><cell>cosine</cell><cell>cosine</cell><cell>cosine</cell><cell>cosine</cell></row><row><cell>lr warmup ratio</cell><cell>0.03</cell><cell>0.03</cell><cell>0.03</cell><cell>0.03</cell></row><row><cell>weight decay</cell><cell>0</cell><cell>0</cell><cell>0</cell><cell>0</cell></row><row><cell>epochs</cell><cell>1</cell><cell>1</cell><cell>1</cell><cell>500 steps</cell></row><row><cell>optimizer</cell><cell cols="2">AdamW AdamW</cell><cell>AdamW</cell><cell>AdamW</cell></row><row><cell>max length</cell><cell>8192</cell><cell>8192</cell><cell>8192</cell><cell>8192</cell></row><row><cell>DeepSpeed stage</cell><cell>3</cell><cell>3</cell><cell>3</cell><cell>3</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Hyperparameters used during each training step</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc></figDesc><table><row><cell></cell><cell>.13</cell><cell>0.13</cell><cell>0.51</cell><cell>0.49</cell></row><row><cell>LLaVA-NDiNO IT</cell><cell>0.27</cell><cell>0.19</cell><cell>0.67</cell><cell>0.84</cell></row><row><cell>LLaVA-NDiNO PT + IT</cell><cell>0.28</cell><cell>0.19</cell><cell>0.68</cell><cell>0.86</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6</head><label>6</label><figDesc>Results obtained for evaluation datasets natively in Italian language. &lt;DATASET_NAME&gt;-IT refers to the filtered version of the original multilingual dataset containing only Italian instances. For both MTVQA and V-EXAMS the metric is exact match. The ↑ indicates that the greater value obtained for the metric of that dataset the better the performance</figDesc><table><row><cell>Model</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 7</head><label>7</label><figDesc></figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/swapUniba/LLaVA-NDiNO</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://huggingface.co/jbochi/madlad400-3b-mt</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://github.com/EvolvingLMMs-Lab/lmms-eval</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://github.com/EleutherAI/lm-evaluation-harness</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We acknowledge the support of the PNRR project FAIR -Future AI Research (PE00000013), Spoke 6 -Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU. Models are built on the Leonardo supercomputer with the support of CINECA-Italian Super Computing Resource Allocation, class C project IscrC_LLMM (HP10CLKWTP).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Preface to the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bonetta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Hromei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Siciliani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Stranisci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</title>
				<meeting>the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Improved baselines with visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="26296" to="26306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Llama 2: Open foundation and fine-tuned chat models</title>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2307.09288.arXiv:2307.09288" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2303.08774.arXiv:2303.08774" />
		<title level="m">Gpt-4 technical report</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Visual instruction tuning</title>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Sharir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Noy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zelnik-Manor</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.13915</idno>
		<title level="m">An image is worth 16x16 words, what is a video worth?</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2407.07895</idno>
		<title level="m">Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">mBLIP: Efficient bootstrapping of multilingual vision-LLMs</title>
		<author>
			<persName><forename type="first">G</forename><surname>Geigle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Timofte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Glavaš</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.alvr-1.2" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Gu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T.-J</forename><forename type="middle">R</forename><surname>Fu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Hudson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Celikyilmaz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</editor>
		<meeting>the 3rd Workshop on Advances in Language and Vision Research (ALVR), Association for Computational Linguistics<address><addrLine>Bangkok, Thailand</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="7" to="25" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savarese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="19730" to="19742" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">No language left behind: Scaling human-centered machine translation</title>
		<author>
			<persName><forename type="first">N</forename><surname>Team</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2207.04672.arXiv:2207.04672" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Djolonga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Padlewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Changpinyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Goodman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.18565</idno>
		<title level="m">Pali-x: On scaling up a multilingual vision and language model</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Changpinyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Piergiovanni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Padlewski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Salz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Goodman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Grycner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mustafa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2209.06794</idno>
		<title level="m">Pali: A jointly-scaled multilingual language-image model</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">X-LLaVA: Optimizing bilingual large vision-language alignment</title>
		<author>
			<persName><forename type="first">D</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Won</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lim</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2024.findings-naacl.158</idno>
		<ptr target="https://aclanthology.org/2024.findings-naacl.158.doi:10.18653/v1/2024.findings-naacl.158" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Duh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Gomez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</editor>
		<meeting><address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="2463" to="2473" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Cross-lingual language model pretraining</title>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">What matters when building vision-language models?</title>
		<author>
			<persName><forename type="first">H</forename><surname>Laurençon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tronchon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2405.02246</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Srinivasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Raman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bendersky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Najork</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.01913</idno>
		<title level="m">Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Multieurlex -a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer</title>
		<author>
			<persName><forename type="first">I</forename><surname>Chalkidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fergadiotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Androutsopoulos</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2109.00904" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Adapting the tesseract open source ocr engine for multilingual ocr</title>
		<author>
			<persName><forename type="first">R</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Antonova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D.-S</forename><surname>Lee</surname></persName>
		</author>
		<ptr target="http://doi.acm.org/10/1145/1577802.1577804" />
	</analytic>
	<monogr>
		<title level="m">MOCR &apos;09: Proceedings of the International Workshop on Multilingual OCR, ACM International Conference Proceeding Series</title>
				<editor>
			<persName><forename type="first">V</forename><surname>Govindaraju</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Natarajan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Chaudhury</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Lopresti</surname></persName>
		</editor>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A-okvqa: A benchmark for visual question answering using world knowledge</title>
		<author>
			<persName><forename type="first">D</forename><surname>Schwenk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Marino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mottaghi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European conference on computer vision</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="146" to="162" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Clevr: A diagnostic dataset for compositional language and elementary visual reasoning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hariharan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Der Maaten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girshick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Exploring models and data for image question answering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kiros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zemel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Kazemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Alvari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Soricut</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.12241</idno>
		<title level="m">Geomverse: A systematic evaluation of large models for geometric reasoning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-C</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2110.13214</idno>
		<title level="m">Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-C</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Connecting vision and language with localized narratives</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pont-Tuset</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uijlings</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Changpinyo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Soricut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ferrari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Mimic-it: Multi-modal in-context instruction tuning</title>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2306.05425.arXiv:2306.05425" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">A corpus of natural language for visual reasoning</title>
		<author>
			<persName><forename type="first">A</forename><surname>Suhr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:19435386" />
	</analytic>
	<monogr>
		<title level="m">Annual Meeting of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Raven: A dataset for relational and analogical visual reasoning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-C</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Learning to describe differences between pairs of similar images</title>
		<author>
			<persName><forename type="first">H</forename><surname>Jhamtani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berg-Kirkpatrick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)</title>
				<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Tallyqa: Answering complex counting questions</title>
		<author>
			<persName><forename type="first">M</forename><surname>Acharya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kafle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kanan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>AAAI</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Visual7W: Grounded Question Answering in Images</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Groth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bernstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Demner-Fushman, A dataset of clinically generated visual questions and answers about radiology images</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gayen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ben Abacha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific data</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">VQA: Visual Question Answering</title>
		<author>
			<persName><forename type="first">S</forename><surname>Antol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitchell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Zitnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Parikh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Computer Vision (ICCV)</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">F F B</forename><surname>Mahmood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Huang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2405.11985</idno>
		<title level="m">Mtvqa: Benchmarking multilingual text-centric visual question answering</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<title level="m" type="main">Exams-v: A multidiscipline multilingual multimodal exam benchmark for evaluating vision language models</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Hristov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">I</forename><surname>Dimitrov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Koychev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2403.10378</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Madlad-400: A multilingual and document-level large audited dataset</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kudugunta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Caswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Xin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kusupati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bapna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Firat</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">The llama 3 herd of models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Dubey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Al</forename></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2407.21783.arXiv:2407.21783" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2103.00020.arXiv:2103.00020" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Gqa-it: Italian question answering on image scene graphs</title>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">C</forename><surname>Passaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Basili</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics CliC-it</title>
		<imprint>
			<biblScope unit="page">92</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Gqa: A new dataset for real-world visual reasoning and compositional question answering</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">A</forename><surname>Hudson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<publisher>CVPR</publisher>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Ok-vqa: A visual question answering benchmark requiring external knowledge</title>
		<author>
			<persName><forename type="first">K</forename><surname>Marino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rastegari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mottaghi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.16125</idno>
		<title level="m">Seed-bench: Benchmarking multimodal llms with generative comprehension</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Evaluating object hallucination in large vision-language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-R</forename><surname>Wen</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=xozJw0kZXF" />
	</analytic>
	<monogr>
		<title level="m">The 2023 Conference on Empirical Methods in Natural Language Processing</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<ptr target="https://github.com/EvolvingLMMs-Lab/lmms-eval" />
		<title level="m">Lmmseval: Accelerating the development of large multimoal models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
