1. Introduction

E. Musacchio);

Is Multimodality still Required for Multimodal Machine Translation? A case study on English and Italian

Elio Musacchio

0 1

Lucia Siciliani

Pierpaolo Basile

Giovanni Semeraro

0 0 Department of Computer Science, University of Bari Aldo Moro , Italy 1 National PhD in Artificial Intelligence, University of Pisa , Italy

2025

000 0 0002

Large Language Models (LLMs) have demonstrated remarkable capabilities in machine translation. A related task is multimodal machine translation, where text is paired with an image. While intuition suggests that models supporting multimodal inputs (e.g. Large Vision-Language Models or LVLMs) are essential for this task due to their image understanding, we hypothesize that, in general, text contains several clues that might be enough for efective translation. In this work, we rigorously test both LLMs and LVLMs on the multimodal machine translation task for the English and Italian languages, thoroughly analyzing the impact of text and images on translation quality.

eol>Large Language Models Large Vision-Language Models Machine Translation Multimodal Machine Translation

1. Introduction Furthermore, we release code and resources related to

this study1.

2. Related Works

nent datasets for MMT, that is Multi30k [7], only provides captions. We believe that this dataset is not enough to evaluate the capabilities of models in the MMT task.

In light of this, we aim to investigate the impact of both the additional visual input and the descriptiveness of the textual input for multimodal machine translation in LLMs and LVLMs. We conducted this study in both English and Italian, using our knowledge of these languages to carry out the study carefully. Hence, the contributions of this work are the following:

The most used resource for MMT is Multi30k [7], a

dataset consisting of parallel image descriptions. The dataset has been created starting from the Flickr30k [13] dataset, which contains 31,014 images sourced from Flickr and a large number of image captions obtained through Amazon Turk. Multi30k extended the dataset with professional manual translations from English to German. It was then further extended to French by Elliott et al. [14] and Czech by Barrault et al. [15]. The dataset • We extend an existing multimodal machine trans- has become a reliable benchmark for MMT and has been lation dataset to include the Italian language; used in numerous works as their main dataset for experi• We create a new multimodal machine translation mentation. Researchers have proposed several solutions dataset for English and Italian, with a focus on to tackle the challenges of the MMT task. Specifically, Yao short texts consisting of only a few words; and Wan [5] developed a multimodal transformer model, • We benchmark several LLMs and LVLMs on both which employs a multimodal self-attention mechanism to datasets for this task, analyzing and studying the adjust the attention score of each word w.r.t. the contents impact of the input modalities on the output. of the image. VGAMT [6] adapts a text-only encoderdecoder machine translation model to multimodality by incorporating the features of the image in the encoderside of the model and employing guided self-attention to obtain better alignment between text and images. SoulMix [16] leverages a manifold mixup method to mix the predicted translation of several text-image pairs, where the image is kept as is while the text is processed through degradation schemes. To the best of our knowledge, there are no works studying the efect of the granularity of text in MMT using modern LVLMs supporting multilingual inputs.

2.1. Large Vision-Language Models

Early releases in open LLMs mainly focused on textual processing and were tailored to the English language. For example, the LLaMA 2 models [8], for which the language distribution of the train set has been oficially reported, were extensively trained and tested on English text data without any mechanism to support other modalities. In light of this, several works started proposing solutions to bridge this gap. The main idea was to leverage a pretrained LLM and extend it to an LVLM, therefore avoiding the costly procedure of multimodal pre-training from

3. Problem Formulation

In MMT, the model is given an input comprising a text in a specified source language _ and an image , semantically related to the given text. The desired output is a translated text in a target language _. The obscratch. A well-known example is LLaVA [9], where vi- jective is for _ to be not only syntactically correct, sual embeddings are extracted from a pre-trained vision that is it has no grammatical errors in the target language, encoder and projected into the latent space of the LLM. but also accurately aligned with _ both syntactiThis strategy has been widely adopted, and many mod- cally (ensuring all relevant words from the input text are ern LVLMs are based on this paradigm. Among these, present in the output) and semantically (preserving the LVLMs supporting multiple languages include: Qwen 2.5 original meaning of the input text). VL [10], Gemma 3 [11] and LLaMA 4 [12]. All of them As previously mentioned, research in multimodal maare LVLMs supporting modern strategies, for example, chine translation has often focused on image captioning Qwen 2.5 VL employs dynamic resolution to decrease the datasets. A caption is a short description of the image number of visual tokens w.r.t. resolution of the input im- that meaningfully describes the most relevant aspects of age, while LLaMA Scout is based on a mixture-of-experts the image. However, we argue that, despite the caption architecture (i.e. tokens are handled by diferent layers being a short text, the image does not provide additional according to a routing function). Finally, all of these mod- context w.r.t. text. This is because: 1) a good caption els have been extensively trained on a multimodal and already contains extensive information about the image; multilingual data mixture. 2) the caption often contains enough words to allow for proper translation without additional context. However, if the text consists of only a few words, the task becomes 1https://github.com/swapUniba/MM-MT-ITA much more challenging. This is because, to perform an

4. Dataset

optimal translation, the model is also required to understand the meaning of each word in the input sentence.

Specifically, translating polysemous words requires addi- In this section, we describe the datasets that will be used tional context, either from the textual or visual modality. for the experimentation. Specifically, we aim to test the We showcase this in Figure 1, where we present an ex- ability of LVLMs in MMT for two diferent types of inample of machine translation of two image-text pairs. In stances: 1) text containing a rich description of the image; the instance from the MSCOCO [17] dataset, the word 2) text containing only a few words. Going forward, we "remote" is translated as "remoto" (i.e. something that is will reference the former as "long" dataset and the latter far away) rather than its proper translation, that is "tele- as "short" dataset. comando". Due to the absence of substantial textual clues, the model provides a translation that is not aligned w.r.t. 4.1. Dataset Collection the contents of the image. In the second instance from the Multi30K dataset, however, the caption is correctly trans- For the "long" dataset, we collect the English 2016 Flickr lated and aligns well with the image’s contents. In this test set from the Multi30K dataset. Specifically, we levercase, the word "vest" is correctly translated to "giubbotto" age a version uploaded on HuggingFace. For the "short" (i.e. a jacket), thanks to the additional words present in dataset, we collect lemmas from BabelNet [18]. BabelNet the text. In light of this, we aim to understand the rela- is a semantic network organized according to a synset tionship between the granularity of the input text and hierarchy. A synset is a synonym set, containing all the associated image in multimodal machine translation. possible words that can be associated with that concept. To do so, we need to collect two diferent datasets, one Additionally, in BabelNet, each synset is linked with one made of very short texts consisting of only a few words or more images, providing useful resources for multiand one made of image captions. modality. It also provides lemmas in multiple languages, allowing access to the lemmas for all required languages.

In our case, we collect both the first lemma in English and Italian, as well as the best image for each synset.

However, these datasets cannot be used directly after collecting them as they are. In fact, Multi30K does not more precise evaluation, we perform an exact match for provide labels in Italian, and BabelNet lemmas are not each possible lemma associated with the synset of the precise translations from English to Italian and vice versa. instance. If at least one of the labels exactly matches the For example, the English lemma "economy of resources" generated output, the translation is considered correct. is paired with the Italian lemma "eficienza", which is not For example, for the synset "bn:00109359a" with English a literary translation of the original text. In light of this, lemma "quiet" and Italian lemmas "tranquillo", "calmo", we perform manual annotation for "long" dataset and "silenzioso", "quieto", the translation from English to Italmanual verification for the "short" dataset. ian is correct as long as the generated output is one of the Italian lemmas of the synset (and viceversa for trans4.2. Dataset Annotation lation from Italian to English). Thanks to the multiple labels, we cover cases where the model may translate For the "long" dataset, we begin by performing a prelimi- the input lemma with a word that has the same meaning. nary Italian translation of the data with LLaMA 3.3 70B All models are evaluated using greedy decoding, which Instruct, which helps reduce the editing overload. After makes the inference process reproducible and removes that, we manually check each translated instance and any randomness from the outputs. In all cases, the chat correct any machine translation errors that are present template associated with each model is used during inferin the dataset. Specifically, we follow these guidelines ence. We consider the following models for evaluation: when correcting the translated text: 1) we use Italian Qwen 2.5 VL and LLaMA Scout. Both models support ifgures of speech whenever possible (e.g. we translate multimodal and multilingual inputs. For Qwen 2.5 VL, "shirtless man" as "uomo a torso nudo" instead of "uomo we consider the 3B, 7B and 72B checkpoints, while for senza maglietta"); 2) we only keep English words when LLaMA Scout, we consider the only available checkpoint they represent commonly used terms across languages (17B with 16 experts). Inference is performed locally (e.g. we keep the word "cowboy" as is). For the "short" for Qwen 2.5 VL 3B and 7B, while we rely on a cloud dataset, we manually filter each pair of lemmas in Ital- service2 for Qwen 72B and LLaMA Scout. All models ian and English to include only those that are proper are prompted using the following input strings if the translations of one another. After performing the previ- image associated to the text is provided: "Translate the ously described steps, we obtain the final versions of the following text from [_] to [_]: "[TEXT]". "long" and "short" datasets. The "long" dataset consists Use the image as additional context for the translation. of 1,000 instances, the same cardinality as the original Provide only the translated text.", otherwise the input Multi30k dataset, while the "short" dataset consists of string is "Translate the following text from _ to 400 instances. [_]: "[TEXT]". Provide only the translated text.". [_] and [_] are placeholders for represen5. Evaluation tative strings of the source and target languages; in this case, they are either "English" or "Italian", while [TEXT] is a placeholder for the text of the instance.

In this section, we describe the evaluation setting that has been considered for all models (e.g. generation strategy), we discuss the obtained results and present some interesting additional experiments.

Additionally, we aim to answer the following research questions: 1) Are LVLMs capable of performing MMT for both the "short" and "long" dataset? 2) Is performance afected by the presence of the image in the input? 3) Are LLMs as capable as LVLMs in MMT? 4) Does the generation strategy impact the quality of MMT?

5.1. Evaluation Setting

We use the same metrics as the original Multi30K dataset for the "long" dataset, namely BLEU and METEOR. Additionally, we also include COMET, since it has been widely used in machine translation. For our short dataset, since it consists of only a few words, we perform an exact match, that is, we verify that the generated output is identical to the ground truth label. However, to have a

5.2. Results

We report results on the Multi30k test set in Table 1 while results for the BabelNet test set can be found in table Table 2. Overall, both the "long" and "short" datasets are sensitive to the scale of the model, with larger models achieving better results on every metric. Furthermore, the translation from English to Italian makes the task more challenging for smaller models. As a matter of fact, Qwen 2.5 VL 7B Instruct achieves a score of .4800 in BLEU for the "long" dataset in translation from English to Italian, while it achieves a score of .5839 in translation from Italian to English. The same pattern is also present for the "short" dataset, where the model achieves a score of .4700 in exact match in translation from English to Italian, while it achieves a score of .5900 in translation from Italian to English. This pattern is less prevalent for Qwen2.5-VL-3B-Instruct Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-72B-Instruct Llama-4-Scout-17B-16E-Instruct X ✓ X ✓ X ✓ X ✓

BLEU METEOR METEOR bigger models, for example, Qwen 2.5 VL 72B Instruct achieves a score of .6186 in BLEU for the "long" dataset in translation from English to Italian and a score of .6027 in translation from Italian to English. This showcases that natural language generation capabilities of smaller models are limited in a multilingual use case w.r.t. bigger models, since they achieve better performance when generating English text. Finally, results also showcase that, in general, the presence of the image in the input is better for translation. For example, Qwen 2.5 VL 7B Instruct achieves an exact match score of .5900 on the "short" dataset for translation from Italian to English when the image is provided in the input, while it achieves a score of .5150 when it is not provided. However, there are some exceptions, for example, LLaMA Scout performs better when the image is not provided as part of the input, which highlights the importance of testing the behaviour of diferent models for this task.

5.3. Evaluation of LLMs against LVLMs All models considered so far are LVLMs, that is, they have

been extensively trained on a multimodal data mixture.

However, since we have also studied these models for MMT without providing the input image, the underlying vision encoder used by LVLMs becomes useless, as no visual input is provided. In light of this, we compare the performance of two models of the same size and architecture, where one is an LLM and the other is an LVLM. This allows us to determine whether multimodal training can still be beneficial for MMT even when an image is not provided as additional input. To perform this experiment, we rely on Qwen 2.5 VL 7B and Qwen 2.5 7B, which guarantees fairness of the experiment between the two

Multi30K BLEU .4132 .4793 models (since they share the same number of parame- the paths with the highest probability. Therefore, the reters and underlying architecture). Results are reported sults are still reproducible, and randomness is not present. in Table 3. Interestingly, the LVLM performs better than Results for the "long" and "short" datasets are reported the LLM on both the "short" and "long" datasets. This in Table 4. Results indicate that performance improves highlights that multimodal training still helps in MMT when using beam search, both for inference with and when the image input is not provided. This is probably without the image associated with the text. Remarkably, due to the style of the text that LVLMs are trained on. performance is also better for the "short" dataset, indiFor example, LVLM training includes data containing cating that even for the generation of a short sequence image captions, which still afects the model even when of tokens, beam search still proves more efective than no image is provided in the input during inference. greedy decoding.

5.4. Evaluation of generation strategy 5.5. Error Analysis

All results considered so far used greedy decoding as the We perform manual verification of a subset of instances generation strategy. In greedy decoding, each new token for both the "long" and "short" datasets. We aim to find that is generated is selected according to the highest prob- types of errors in instances where the generated lemma is ability out of all the ones available in the model’s vocabu- not correct (for the "short" dataset) and where the generlary. However, beam search has been widely considered ated translated sentence is not correct (for the "long" as the standard generation strategy for the machine trans- dataset). For LLaMA Scout, most error cases for the lation task [19]. In beam search, the model considers the "short" dataset are related to the model generating longer possible paths with the highest probability at each gen- outputs to describe the reasoning process or alternative eration step, instead of only considering the path of the options. For example, the model may provide a list of highest probability token for each generation step. This possible alternatives, separated by a newline character, strategy enables the model to avoid greedy predictions, instead of a single string. This highlights that the model where the overall probability of a greedy-generated path is not as capable of following instructions embedded is lower than the overall probability of another path that within the prompt (that is, the string "Provide only the wasn’t considered due to greedy generation. However, translated text") when the text to translate only contains in modern LLMs, this strategy has been widely disre- a few words. This behavior is not as prevalent for the garded. Even popular frameworks used for inference and "long" dataset where the model only provides the transdeployment of LLMs are considering dropping support lated sentence directly. Additionally, this pattern is more for this generation strategy3, since most models lever- present for outputs obtained when performing inference age sampling-based strategies, where the next token is using the image, rather than text alone. This explains sampled from the probability distribution learned from the lower result for exact match on the "short" dataset in the model. This is due to computational eficiency, since translation from Italian to English for LLaMA Scout as beam search considers multiple possible generation paths shown in Table 2. However, this does not seem to afect it takes more time than greedy decoding. Therefore, we Qwen 2.5 VL 72B as much, since there is no instance are interested in understanding how relevant is beam of generated text showcasing the previously described search in modern LVLMs for the MMT task. In this case, problem. Finally, we also showcase a relevant problem in we only consider the Qwen 2.5 VL 7B model and all pre- MMT for the "long" dataset. That is, properly evaluating viously considered settings on this model. We perform domain-specific knowledge is complex in the MMT task. beam search decoding with a number of beams equal For example, several instances within the original dataset to 3. Note that there is still no sampling when using refer to the "football" sport (e.g. "A young man about to this approach, since the strategy still relies on navigating throw a football."). When translating these instances from Italian to English with the image paired to it, even when Qwen2.5-VL-7B-Instruct GD Qwen2.5-VL-7B-Instruct BS

X ✓ X ✓

BLEU the word "football" was kept in the translated text (e.g. 6. Conclusions "Un ragazzo pronto a lanciare un pallone da football."), the model translated it with "rugby" (e.g. "A boy ready to In this work, we have extended the current state-of-thethrow a rugby ball."). Interestingly, this pattern is not as art in MMT by providing a study on the English and prevalent when the image is not provided to the model, Italian languages for the task. Specifically, we extended which tends to follow the terminology used in the input the most relevant dataset in the state-of-the-art for MMT, sentence (e.g. "A boy ready to throw a football."). This that is Multi30K and introduced a new benchmark based pattern was also evident for the Qwen 2.5 VL 72B model, on BabelNet, which allows to study the efectiveness of which is the best-performing model on the benchmark. MMT when the text only consists of few words. MoreThis highlights that the models tend to prefer specific over, we have conducted extensive experimentation with terminology and are overall deeply afected by the image several modern LVLMs, evaluating their performance in that is paired with the input text. In Figure 2 we provide MMT across two diferent use cases ("long" and "short" visual examples of these two types of errors we found input text). Finally, we have studied and discussed the during manual verification. impact of several factors on the performance of the models for MMT, namely the presence of an image along with the input text, the scale of the model, the use of LLMs instead of LVLMs, and the generation strategy. In the future, we plan to further extend this study to more models and to consider additional languages, like German and French that are present in the original Multi30K dataset.

Acknowledgments We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.

Declaration on Generative AI During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

Huizenga ,

Kharitonov ,

Liu , G. Amirkhanyan, Shared Task Papers, 2018 , pp. 304 - 323 .

Cameron ,

Hashemi ,

Klimczak-Plucińska , [16]

Cheng ,

Yao ,

Xin ,

An ,

Li ,

Zou ,

Chan ,

Stanton ,

Wieting ,

Lai ,

Orbay , J.

Fer- 62nd Annual Meeting of the Association for Compu-

nandez , J. Newlan, J. yeong Ji, J.

Singh , K.

Black , tational Linguistics (Volume 1 : Long

Papers)

, 2024 ,

Yu ,

Hui ,

Vodrahalli ,

Gref , L. Qiu, pp. 11283 - 11294 .

Valentine ,

Coelho ,

Ritter ,

Hofman , [17] T.-Y. Lin , M.

Maire , S.

Belongie , L.

Bourdev , R. Gir-

Chauhan ,

Sachdeva ,

Bunyan , P. Botarda, context, 2015 . URL: https://arxiv.org/abs/1405.0312.

Caron ,

P. K.

Rubenstein ,

Culliton , P. Schmid, arXiv: 1405 . 0312 .

P. G.

Sessa ,

Xu ,

Stanczyk ,

Tafti ,

Shivanna , [18]

Navigli ,

S. P.

Ponzetto , Babelnet: Building a very

Mullins ,

Jerome ,

Smoot ,

Girgin , S.

Iqbal, ings of the 48th annual meeting of the association

Reddy ,

Sheth ,

Põder ,

Bhatnagar , S. R. for computational linguistics , 2010 , pp. 216 - 225 .

Panyam , S.

Eiger , S.

Zhang , T. Liu, T. Yacovone, [19] M.

Freitag , Y.

Al-Onaizan , Beam search strate-

Babar ,

Lo ,

Moreira ,

L. G.

Martins , O. Sanse- ver, 2017 , pp. 56 - 60 . URL: https://aclanthology.org/

viero , L. Gonzalez , Z.

Gleicher , T.

Warkentin , V.

Mir- W17-3207/. doi: 10 .18653/v1/ W17 -3207.

Dadashi , L. Hussenot, Gemma 3 technical re-

port , 2025 . URL: https://arxiv.org/abs/2503.19786.

arXiv:2503 . 19786 . [12] M. AI , The llama 4 herd: The beginning

novation , 2025 . URL: https://ai.meta.com/blog/

llama-4 - multimodal-intelligence/. [13] P.

Young , A.

Lai , M.

Hodosh , J.

Hockenmaier , From

Computational Linguistics 2 ( 2014 ) 67 - 78 . URL:

https://aclanthology.org/Q14-1006/. doi: 10 .1162/

tacl_a_00166 . [14]

Elliott ,

Frank ,

Barrault ,

Bougares , L. Spe-

ference on Machine Translation , Volume 2 : Shared

guistics , Copenhagen, Denmark, 2017 , pp. 215 - 233 .

URL: http://www.aclweb.org/anthology/W17-4718. [15]

Barrault ,

Bougares ,

Specia ,

Lala , D. El-