<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>E. Musacchio);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Is Multimodality still Required for Multimodal Machine Translation? A case study on English and Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elio Musacchio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National PhD in Artificial Intelligence, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Large Language Models (LLMs) have demonstrated remarkable capabilities in machine translation. A related task is multimodal machine translation, where text is paired with an image. While intuition suggests that models supporting multimodal inputs (e.g. Large Vision-Language Models or LVLMs) are essential for this task due to their image understanding, we hypothesize that, in general, text contains several clues that might be enough for efective translation. In this work, we rigorously test both LLMs and LVLMs on the multimodal machine translation task for the English and Italian languages, thoroughly analyzing the impact of text and images on translation quality.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Large Vision-Language Models</kwd>
        <kwd>Machine Translation</kwd>
        <kwd>Multimodal Machine Translation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Furthermore, we release code and resources related to</title>
        <p>this study1.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>nent datasets for MMT, that is Multi30k [7], only provides
captions. We believe that this dataset is not enough to
evaluate the capabilities of models in the MMT task.</p>
      <p>In light of this, we aim to investigate the impact of both
the additional visual input and the descriptiveness of the
textual input for multimodal machine translation in LLMs
and LVLMs. We conducted this study in both English
and Italian, using our knowledge of these languages to
carry out the study carefully. Hence, the contributions
of this work are the following:</p>
      <sec id="sec-2-1">
        <title>The most used resource for MMT is Multi30k [7], a</title>
        <p>dataset consisting of parallel image descriptions. The
dataset has been created starting from the Flickr30k
[13] dataset, which contains 31,014 images sourced from
Flickr and a large number of image captions obtained
through Amazon Turk. Multi30k extended the dataset
with professional manual translations from English to
German. It was then further extended to French by Elliott
et al. [14] and Czech by Barrault et al. [15]. The dataset
• We extend an existing multimodal machine trans- has become a reliable benchmark for MMT and has been
lation dataset to include the Italian language; used in numerous works as their main dataset for
experi• We create a new multimodal machine translation mentation. Researchers have proposed several solutions
dataset for English and Italian, with a focus on to tackle the challenges of the MMT task. Specifically, Yao
short texts consisting of only a few words; and Wan [5] developed a multimodal transformer model,
• We benchmark several LLMs and LVLMs on both which employs a multimodal self-attention mechanism to
datasets for this task, analyzing and studying the adjust the attention score of each word w.r.t. the contents
impact of the input modalities on the output. of the image. VGAMT [6] adapts a text-only
encoderdecoder machine translation model to multimodality by
incorporating the features of the image in the
encoderside of the model and employing guided self-attention to
obtain better alignment between text and images.
SoulMix [16] leverages a manifold mixup method to mix the
predicted translation of several text-image pairs, where
the image is kept as is while the text is processed through
degradation schemes. To the best of our knowledge, there
are no works studying the efect of the granularity of text
in MMT using modern LVLMs supporting multilingual
inputs.</p>
        <sec id="sec-2-1-1">
          <title>2.1. Large Vision-Language Models</title>
          <p>Early releases in open LLMs mainly focused on textual
processing and were tailored to the English language. For
example, the LLaMA 2 models [8], for which the language
distribution of the train set has been oficially reported,
were extensively trained and tested on English text data
without any mechanism to support other modalities. In
light of this, several works started proposing solutions
to bridge this gap. The main idea was to leverage a
pretrained LLM and extend it to an LVLM, therefore
avoiding the costly procedure of multimodal pre-training from</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Formulation</title>
      <p>In MMT, the model is given an input comprising a text
in a specified source language _ and an image ,
semantically related to the given text. The desired output
is a translated text in a target language _. The
obscratch. A well-known example is LLaVA [9], where vi- jective is for _ to be not only syntactically correct,
sual embeddings are extracted from a pre-trained vision that is it has no grammatical errors in the target language,
encoder and projected into the latent space of the LLM. but also accurately aligned with _ both
syntactiThis strategy has been widely adopted, and many mod- cally (ensuring all relevant words from the input text are
ern LVLMs are based on this paradigm. Among these, present in the output) and semantically (preserving the
LVLMs supporting multiple languages include: Qwen 2.5 original meaning of the input text).
VL [10], Gemma 3 [11] and LLaMA 4 [12]. All of them As previously mentioned, research in multimodal
maare LVLMs supporting modern strategies, for example, chine translation has often focused on image captioning
Qwen 2.5 VL employs dynamic resolution to decrease the datasets. A caption is a short description of the image
number of visual tokens w.r.t. resolution of the input im- that meaningfully describes the most relevant aspects of
age, while LLaMA Scout is based on a mixture-of-experts the image. However, we argue that, despite the caption
architecture (i.e. tokens are handled by diferent layers being a short text, the image does not provide additional
according to a routing function). Finally, all of these mod- context w.r.t. text. This is because: 1) a good caption
els have been extensively trained on a multimodal and already contains extensive information about the image;
multilingual data mixture. 2) the caption often contains enough words to allow for
proper translation without additional context. However,
if the text consists of only a few words, the task becomes
1https://github.com/swapUniba/MM-MT-ITA much more challenging. This is because, to perform an</p>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
      <p>optimal translation, the model is also required to
understand the meaning of each word in the input sentence.</p>
      <p>Specifically, translating polysemous words requires addi- In this section, we describe the datasets that will be used
tional context, either from the textual or visual modality. for the experimentation. Specifically, we aim to test the
We showcase this in Figure 1, where we present an ex- ability of LVLMs in MMT for two diferent types of
inample of machine translation of two image-text pairs. In stances: 1) text containing a rich description of the image;
the instance from the MSCOCO [17] dataset, the word 2) text containing only a few words. Going forward, we
"remote" is translated as "remoto" (i.e. something that is will reference the former as "long" dataset and the latter
far away) rather than its proper translation, that is "tele- as "short" dataset.
comando". Due to the absence of substantial textual clues,
the model provides a translation that is not aligned w.r.t. 4.1. Dataset Collection
the contents of the image. In the second instance from the
Multi30K dataset, however, the caption is correctly trans- For the "long" dataset, we collect the English 2016 Flickr
lated and aligns well with the image’s contents. In this test set from the Multi30K dataset. Specifically, we
levercase, the word "vest" is correctly translated to "giubbotto" age a version uploaded on HuggingFace. For the "short"
(i.e. a jacket), thanks to the additional words present in dataset, we collect lemmas from BabelNet [18]. BabelNet
the text. In light of this, we aim to understand the rela- is a semantic network organized according to a synset
tionship between the granularity of the input text and hierarchy. A synset is a synonym set, containing all
the associated image in multimodal machine translation. possible words that can be associated with that concept.
To do so, we need to collect two diferent datasets, one Additionally, in BabelNet, each synset is linked with one
made of very short texts consisting of only a few words or more images, providing useful resources for
multiand one made of image captions. modality. It also provides lemmas in multiple languages,
allowing access to the lemmas for all required languages.</p>
      <p>In our case, we collect both the first lemma in English
and Italian, as well as the best image for each synset.</p>
      <p>However, these datasets cannot be used directly after
collecting them as they are. In fact, Multi30K does not more precise evaluation, we perform an exact match for
provide labels in Italian, and BabelNet lemmas are not each possible lemma associated with the synset of the
precise translations from English to Italian and vice versa. instance. If at least one of the labels exactly matches the
For example, the English lemma "economy of resources" generated output, the translation is considered correct.
is paired with the Italian lemma "eficienza", which is not For example, for the synset "bn:00109359a" with English
a literary translation of the original text. In light of this, lemma "quiet" and Italian lemmas "tranquillo", "calmo",
we perform manual annotation for "long" dataset and "silenzioso", "quieto", the translation from English to
Italmanual verification for the "short" dataset. ian is correct as long as the generated output is one of
the Italian lemmas of the synset (and viceversa for
trans4.2. Dataset Annotation lation from Italian to English). Thanks to the multiple
labels, we cover cases where the model may translate
For the "long" dataset, we begin by performing a prelimi- the input lemma with a word that has the same meaning.
nary Italian translation of the data with LLaMA 3.3 70B All models are evaluated using greedy decoding, which
Instruct, which helps reduce the editing overload. After makes the inference process reproducible and removes
that, we manually check each translated instance and any randomness from the outputs. In all cases, the chat
correct any machine translation errors that are present template associated with each model is used during
inferin the dataset. Specifically, we follow these guidelines ence. We consider the following models for evaluation:
when correcting the translated text: 1) we use Italian Qwen 2.5 VL and LLaMA Scout. Both models support
ifgures of speech whenever possible (e.g. we translate multimodal and multilingual inputs. For Qwen 2.5 VL,
"shirtless man" as "uomo a torso nudo" instead of "uomo we consider the 3B, 7B and 72B checkpoints, while for
senza maglietta"); 2) we only keep English words when LLaMA Scout, we consider the only available checkpoint
they represent commonly used terms across languages (17B with 16 experts). Inference is performed locally
(e.g. we keep the word "cowboy" as is). For the "short" for Qwen 2.5 VL 3B and 7B, while we rely on a cloud
dataset, we manually filter each pair of lemmas in Ital- service2 for Qwen 72B and LLaMA Scout. All models
ian and English to include only those that are proper are prompted using the following input strings if the
translations of one another. After performing the previ- image associated to the text is provided: "Translate the
ously described steps, we obtain the final versions of the following text from [_] to [_]: "[TEXT]".
"long" and "short" datasets. The "long" dataset consists Use the image as additional context for the translation.
of 1,000 instances, the same cardinality as the original Provide only the translated text.", otherwise the input
Multi30k dataset, while the "short" dataset consists of string is "Translate the following text from _ to
400 instances. [_]: "[TEXT]". Provide only the translated text.".
[_] and [_] are placeholders for
represen5. Evaluation tative strings of the source and target languages; in this
case, they are either "English" or "Italian", while [TEXT]
is a placeholder for the text of the instance.</p>
      <sec id="sec-4-1">
        <title>In this section, we describe the evaluation setting that has been considered for all models (e.g. generation strategy), we discuss the obtained results and present some interesting additional experiments.</title>
        <p>Additionally, we aim to answer the following research
questions: 1) Are LVLMs capable of performing MMT for
both the "short" and "long" dataset? 2) Is performance
afected by the presence of the image in the input? 3)
Are LLMs as capable as LVLMs in MMT? 4) Does the
generation strategy impact the quality of MMT?</p>
        <sec id="sec-4-1-1">
          <title>5.1. Evaluation Setting</title>
          <p>We use the same metrics as the original Multi30K dataset
for the "long" dataset, namely BLEU and METEOR.
Additionally, we also include COMET, since it has been widely
used in machine translation. For our short dataset, since
it consists of only a few words, we perform an exact
match, that is, we verify that the generated output is
identical to the ground truth label. However, to have a</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>5.2. Results</title>
          <p>We report results on the Multi30k test set in Table 1 while
results for the BabelNet test set can be found in table
Table 2. Overall, both the "long" and "short" datasets are
sensitive to the scale of the model, with larger models
achieving better results on every metric. Furthermore,
the translation from English to Italian makes the task
more challenging for smaller models. As a matter of fact,
Qwen 2.5 VL 7B Instruct achieves a score of .4800 in
BLEU for the "long" dataset in translation from English
to Italian, while it achieves a score of .5839 in translation
from Italian to English. The same pattern is also present
for the "short" dataset, where the model achieves a score
of .4700 in exact match in translation from English to
Italian, while it achieves a score of .5900 in translation
from Italian to English. This pattern is less prevalent for
Qwen2.5-VL-3B-Instruct
Qwen2.5-VL-7B-Instruct</p>
          <p>Qwen2.5-VL-72B-Instruct
Llama-4-Scout-17B-16E-Instruct
X
✓
X
✓
X
✓
X
✓</p>
          <p>BLEU
METEOR
METEOR
bigger models, for example, Qwen 2.5 VL 72B Instruct
achieves a score of .6186 in BLEU for the "long" dataset
in translation from English to Italian and a score of .6027
in translation from Italian to English. This showcases
that natural language generation capabilities of smaller
models are limited in a multilingual use case w.r.t. bigger
models, since they achieve better performance when
generating English text. Finally, results also showcase that,
in general, the presence of the image in the input is better
for translation. For example, Qwen 2.5 VL 7B Instruct
achieves an exact match score of .5900 on the "short"
dataset for translation from Italian to English when the
image is provided in the input, while it achieves a score
of .5150 when it is not provided. However, there are
some exceptions, for example, LLaMA Scout performs
better when the image is not provided as part of the input,
which highlights the importance of testing the behaviour
of diferent models for this task.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>5.3. Evaluation of LLMs against LVLMs</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>All models considered so far are LVLMs, that is, they have</title>
        <p>been extensively trained on a multimodal data mixture.</p>
        <p>However, since we have also studied these models for
MMT without providing the input image, the underlying
vision encoder used by LVLMs becomes useless, as no
visual input is provided. In light of this, we compare the
performance of two models of the same size and
architecture, where one is an LLM and the other is an LVLM. This
allows us to determine whether multimodal training can
still be beneficial for MMT even when an image is not
provided as additional input. To perform this experiment,
we rely on Qwen 2.5 VL 7B and Qwen 2.5 7B, which
guarantees fairness of the experiment between the two</p>
        <p>Multi30K
BLEU
.4132
.4793
models (since they share the same number of parame- the paths with the highest probability. Therefore, the
reters and underlying architecture). Results are reported sults are still reproducible, and randomness is not present.
in Table 3. Interestingly, the LVLM performs better than Results for the "long" and "short" datasets are reported
the LLM on both the "short" and "long" datasets. This in Table 4. Results indicate that performance improves
highlights that multimodal training still helps in MMT when using beam search, both for inference with and
when the image input is not provided. This is probably without the image associated with the text. Remarkably,
due to the style of the text that LVLMs are trained on. performance is also better for the "short" dataset,
indiFor example, LVLM training includes data containing cating that even for the generation of a short sequence
image captions, which still afects the model even when of tokens, beam search still proves more efective than
no image is provided in the input during inference. greedy decoding.</p>
        <sec id="sec-4-2-1">
          <title>5.4. Evaluation of generation strategy</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>5.5. Error Analysis</title>
          <p>All results considered so far used greedy decoding as the We perform manual verification of a subset of instances
generation strategy. In greedy decoding, each new token for both the "long" and "short" datasets. We aim to find
that is generated is selected according to the highest prob- types of errors in instances where the generated lemma is
ability out of all the ones available in the model’s vocabu- not correct (for the "short" dataset) and where the
generlary. However, beam search has been widely considered ated translated sentence is not correct (for the "long"
as the standard generation strategy for the machine trans- dataset). For LLaMA Scout, most error cases for the
lation task [19]. In beam search, the model considers the "short" dataset are related to the model generating longer
 possible paths with the highest probability at each gen- outputs to describe the reasoning process or alternative
eration step, instead of only considering the path of the options. For example, the model may provide a list of
highest probability token for each generation step. This possible alternatives, separated by a newline character,
strategy enables the model to avoid greedy predictions, instead of a single string. This highlights that the model
where the overall probability of a greedy-generated path is not as capable of following instructions embedded
is lower than the overall probability of another path that within the prompt (that is, the string "Provide only the
wasn’t considered due to greedy generation. However, translated text") when the text to translate only contains
in modern LLMs, this strategy has been widely disre- a few words. This behavior is not as prevalent for the
garded. Even popular frameworks used for inference and "long" dataset where the model only provides the
transdeployment of LLMs are considering dropping support lated sentence directly. Additionally, this pattern is more
for this generation strategy3, since most models lever- present for outputs obtained when performing inference
age sampling-based strategies, where the next token is using the image, rather than text alone. This explains
sampled from the probability distribution learned from the lower result for exact match on the "short" dataset in
the model. This is due to computational eficiency, since translation from Italian to English for LLaMA Scout as
beam search considers multiple possible generation paths shown in Table 2. However, this does not seem to afect
it takes more time than greedy decoding. Therefore, we Qwen 2.5 VL 72B as much, since there is no instance
are interested in understanding how relevant is beam of generated text showcasing the previously described
search in modern LVLMs for the MMT task. In this case, problem. Finally, we also showcase a relevant problem in
we only consider the Qwen 2.5 VL 7B model and all pre- MMT for the "long" dataset. That is, properly evaluating
viously considered settings on this model. We perform domain-specific knowledge is complex in the MMT task.
beam search decoding with a number of beams equal For example, several instances within the original dataset
to 3. Note that there is still no sampling when using refer to the "football" sport (e.g. "A young man about to
this approach, since the strategy still relies on navigating throw a football."). When translating these instances from
Italian to English with the image paired to it, even when
Qwen2.5-VL-7B-Instruct GD
Qwen2.5-VL-7B-Instruct BS</p>
          <p>X
✓
X
✓</p>
          <p>BLEU
the word "football" was kept in the translated text (e.g. 6. Conclusions
"Un ragazzo pronto a lanciare un pallone da football."),
the model translated it with "rugby" (e.g. "A boy ready to In this work, we have extended the current
state-of-thethrow a rugby ball."). Interestingly, this pattern is not as art in MMT by providing a study on the English and
prevalent when the image is not provided to the model, Italian languages for the task. Specifically, we extended
which tends to follow the terminology used in the input the most relevant dataset in the state-of-the-art for MMT,
sentence (e.g. "A boy ready to throw a football."). This that is Multi30K and introduced a new benchmark based
pattern was also evident for the Qwen 2.5 VL 72B model, on BabelNet, which allows to study the efectiveness of
which is the best-performing model on the benchmark. MMT when the text only consists of few words.
MoreThis highlights that the models tend to prefer specific over, we have conducted extensive experimentation with
terminology and are overall deeply afected by the image several modern LVLMs, evaluating their performance in
that is paired with the input text. In Figure 2 we provide MMT across two diferent use cases ("long" and "short"
visual examples of these two types of errors we found input text). Finally, we have studied and discussed the
during manual verification. impact of several factors on the performance of the
models for MMT, namely the presence of an image along with
the input text, the scale of the model, the use of LLMs
instead of LVLMs, and the generation strategy. In the
future, we plan to further extend this study to more models
and to consider additional languages, like German and
French that are present in the original Multi30K dataset.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.</title>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used Grammarly in order to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>E.</given-names>
            <surname>Huizenga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharitonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , G. Amirkhanyan, Shared Task Papers,
          <year>2018</year>
          , pp.
          <fpage>304</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Cameron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hashemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klimczak-Plucińska</surname>
          </string-name>
          , [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stanton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wieting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Orbay</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Fer- 62nd Annual Meeting of the Association for Compu-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>nandez</surname>
            ,
            <given-names>J.</given-names>
            Newlan, J. yeong Ji, J.
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Black</surname>
          </string-name>
          , tational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vodrahalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gref</surname>
          </string-name>
          , L. Qiu, pp.
          <fpage>11283</fpage>
          -
          <lpage>11294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Valentine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hofman</surname>
          </string-name>
          , [17]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bourdev</surname>
          </string-name>
          , R. Gir-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Chauhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sachdeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bunyan</surname>
          </string-name>
          , P. Botarda, context,
          <year>2015</year>
          . URL: https://arxiv.org/abs/1405.0312.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Rubenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Culliton</surname>
          </string-name>
          , P. Schmid, arXiv:
          <fpage>1405</fpage>
          .
          <fpage>0312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Sessa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stanczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tafti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shivanna</surname>
          </string-name>
          , [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          , Babelnet: Building a very
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Mullins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jerome</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Smoot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Girgin</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Iqbal, ings of the 48th annual meeting of the association</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Põder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. R.</surname>
          </string-name>
          <article-title>for computational linguistics</article-title>
          ,
          <year>2010</year>
          , pp.
          <fpage>216</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Panyam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Eiger</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , T. Liu, T. Yacovone, [19]
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Onaizan</surname>
          </string-name>
          ,
          <article-title>Beam search strate-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Babar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Moreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Martins</surname>
          </string-name>
          , O. Sanse- ver,
          <year>2017</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>60</lpage>
          . URL: https://aclanthology.org/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>viero</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Gleicher</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Warkentin</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Mir-</surname>
          </string-name>
          W17-3207/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W17</fpage>
          -3207.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Dadashi</surname>
          </string-name>
          , L. Hussenot, Gemma 3 technical re-
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>port</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2503.19786.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>arXiv:2503</source>
          .
          <fpage>19786</fpage>
          . [12]
          <string-name>
            <surname>M. AI</surname>
          </string-name>
          ,
          <article-title>The llama 4 herd: The beginning</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>novation</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://ai.meta.com/blog/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          llama-4
          <string-name>
            <surname>-</surname>
            multimodal-intelligence/. [13]
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hodosh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , From
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>Computational Linguistics</source>
          <volume>2</volume>
          (
          <year>2014</year>
          )
          <fpage>67</fpage>
          -
          <lpage>78</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          https://aclanthology.org/Q14-1006/. doi:
          <volume>10</volume>
          .1162/
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>tacl_a_00166</source>
          . [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          , L. Spe-
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>ference on Machine Translation</source>
          , Volume
          <volume>2</volume>
          : Shared
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>guistics</surname>
          </string-name>
          , Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          URL: http://www.aclweb.org/anthology/W17-4718. [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lala</surname>
          </string-name>
          , D. El-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>