1. Introduction

E. Musacchio);

Extending Italian Large Language Models for vision-language tasks

Elio Musacchio

0 2

Lucia Siciliani

Pierpaolo Basile

Asia Beatrice Uboldi

Giovanni Germani

Giovanni Semeraro

0 0 Department of Computer Science, University of Bari Aldo Moro , Italy 1 Fastweb SpA , Milan , Italy 2 National PhD in Artificial Intelligence, University of Pisa , Italy

2025

000 0 0002

With the growing evolution of Large Language Models, there has also been a rising interest in extending these models to incorporate non-textual signals. Specifically, Large Vision-Language Models have been developed, which extend Large Language Models to understand and process visual signals. This allows them to solve complex vision-language tasks, further extending their inherent abilities in text-only task resolution. However, for the Italian language, most works still focus on text-only solutions without extending them to multimodality. In this work, we extend Large Language Models for the Italian language to multimodality and benchmark the performance of these models when trained using the same experimental setting.

eol>Large Language Models Large Vision-Language Models Multimodality

1. Introduction

enabling them to process visual inputs together with textual ones. Also in this case, there are training procedures In the last years, interest in Large Language Models that allow leveraging existing LLMs instead of training (LLMs) has been growing steadily. The ability of these from scratch for vision-language inputs. This makes models to solve complex tasks, even when they have not the process both more eficient, since the pre-training been trained with that specific objective, makes them phase is skipped, and more efective, as the textual knowlextremely useful for any natural language processing edge of the model is leveraged to learn how to perform task. However, as it often occurs in the Natural Language vision-language tasks. Despite this, many open LLMs Processing research field, the abundance of English data supporting the Italian language have not been extended meant that the first openly released LLMs only supported to support multimodality. This is due to the limited availthe English language (e.g. LLaMA 2 [ 1 ]), limiting the ap- ability of training data for vision-language tasks in Italian, plicability of these models to other languages. To cover whereas English training data often comprises multiple this gap, several LLMs were trained to directly support diverse and rich tasks. Furthermore, with the proliferathe Italian language, using either a monolingual or mul- tion of Italian LLMs, like Minerva [2] and Velvet1, it tilingual strategy. Whichever the selected strategy, these becomes increasingly important to test their capabilities models were obtained using one of the following method- in the multimodal domain. This raises the question of ologies: fine-tuning pre-existing models or training from whether it is possible to extend current LLMs trained for scratch on datasets consisting mainly of Italian data. This the Italian language for multimodality. Do these models trend allowed to extend LLMs not only to multiple un- perform well when extended to support it? In this work, derrepresented languages but also to new modalities. An we propose a study on the multimodal performance of example is represented by Large Vision-Language Mod- Italian LLMs extended to LVLMs using a state-of-the-art els (LVLMs) that are LLMs extended with a technique approach.

Specifically, this work extends current literature as follows:

• We train several LLMs supporting the Italian lan

guage to extend them to LVLMs; • We benchmark these models using datasets that

are natively in Italian; • We study the efect of diferent prompt formatting at inference time and showcase the length bias in the response of LVLMs.

Finally, we want to underline that we are forced to use machine translation for training data due to the scarcity of large-scale multimodal data for non-English languages.

However, we focus our evaluation on natively Italian multimodal datasets. Therefore, if a large-scale multimodal dataset natively in Italian were to be released, we can expect further improvements in performance since fewer machine translation errors would be present.

Furthermore, we release code and resources related to this study2.

2. Related Works

For LVLMs, several methodologies have been designed to adapt LLMs. One of the most prominent approaches is the makes 23% of the data). The two available checkpoints one introduced in LLaVA [3], where visual embeddings for Velvet have 2B and 14B parameters. extracted from a Vision Transformer [4] are projected FastwebMIIA3 (Italian Artificial Intelligence Model) is into the latent space of an LLM. This strategy has been a 7-billion-parameter autoregressive model developed by further refined in LLaVA 1.5 [5], where the projection Fastweb. Based on a decoder-only architecture with romatrix is replaced with a Multi-Layer Perceptron, and tary positional embeddings, it has been trained on about LLaVA-OneVision [6], a LLaVA-based model enhanced 3 trillion tokens, with a strong focus on Italian. It uses a to also perform multi-image and video tasks. Other ap- custom tokenizer optimised for Italian, English and proproaches include the one used in BLIP-2 [7], leveraging gramming languages, with a vocabulary of 50,000 tokens. a QFormer module to extract the most relevant features It supports a context window of 16k tokens and has been of images, and Flamingo [8], where cross-attention lay- trained in a distributed pipeline on NVIDIA H100 GPUs ers are added to the LLM and relevant visual tokens are via MLDE and LLMFoundry. extracted using a Perceiver Resampler module. Addi- Furthermore, at the time of writing, LLaVA-NDiNO tionally, there is also LLaVA-NeXT [5] (also known as [10] is the only family of multimodal models extensively LLaVA 1.5 HD), which introduces a technique to process trained for the Italian language only, further showcasing high-resolution images. The idea is to resize the image the need for a more in-depth investigation of the current to a higher resolution than the one supported by the un- landscape of Italian LLMs and their extension to LVLMs. derlying vision encoder and split it into multiple images. For LLM evaluation in Italian, many eforts have been Embeddings are then extracted for each image, as well as carried out to extensively evaluate Italian LLMs. For exa resized version of the image to the supported resolution ample, Bacciu et al. [11] introduced an open LLM leaderof the vision encoder to incorporate global details, and board for the Italian language, Moroni et al. [12] released lfattened into a single vector. ITA-Bench, a comprehensive evaluation suite for Ital

For Italian LLMs, several models have been released ian LLMs consisting of both machine-translated and nawhich incorporate a great quantity of natively Italian tively Italian benchmarks, Attanasio et al. [13] released training data. Minerva [2] is the first family of models CALAMITA, a dynamic and growing benchmark for the trained from scratch on an open data mixture consisting Italian language. of only English and Italian data. It has several check- Finally, we also highlight that there are novel works points with diferent parameter counts, that are 1B, 3B that showcase how non-trivial it is to evaluate LLMs. For and 7B. The 7B model was trained on a total of 2.48 tril- example, Wang et al. [14] found mismatches between lion tokens of Italian, English and code. EuroLLM [9] is the generated output and output obtained using loga family of LLMs developed in Europe to support all the likelihood for next token prediction. Additionally, several 24 oficial European Union languages. Its two available works started to use a LLM-as-a-judge approach where checkpoints have 1.7B and 9B parameters. The models the LLM is used as a model for evaluation [15]. are pre-trained on a total of 4 trillion tokens, where 50% of the data is in English, 5% is code, and the remaining 45% are other languages (including Italian). Velvet is 3. Methodology a family of LLMs trained on a balanced mixture of six languages, with particular emphasis on Italian (which

As mentioned in the introduction, our aim is to extend

existing Italian LLMs with multimodal capabilities. We

2https://github.com/swapUniba/Extending-LLMs-VL-ITA 3https://huggingface.co/Fastweb/FastwebMIIA-7B Question Answering, Visual Grounding, ...), which al

Listing 1: Mistral chat template used for base models. lows the model to learn to correctly solve this type of {user} and {assistant} are placeholders for the task, while the latter is a multi-turn dataset generated by user and assistant messages respectively. prompting GPT-4. Thanks to this training mixture, the <s>[INST] {user} [/INST] {assistant}</s> model learns to both solve tasks and provide meaningful and complete responses to user prompts. For MultiInstruct, we perform some additional processing operations.

Instructions are manually translated, therefore only the chose Minerva, EuroLLM, Velvet and FastwebMIIA, data instances (e.g. questions and answers in a visual since they are among the most recently released LLMs question-answer task) are machine-translated. For tasks supporting the Italian language and clearly define the that use bounding boxes, we normalize the bounding box amount of Italian data used in training. For each model, values to the [ 0, 1 ] range so that the values are consiswe evaluate both its base and instruct variants at their tent with the reference images and independent of their largest available parameter scale. The only exception resolution. For tasks that provide options to choose from is represented by Velvet, for which only the instruct within the instruction, we format them as an ordered list version is available. using either numbers, uppercase or lowercase letters, or

For the vision backbone, we use the vision transformer plain text. In such cases, we also replace the target text to of the CLIP [16] model, specifically, we focus on the large be predicted with the corresponding identifier (e.g. if the checkpoint with patch size 14 and image size 336.4 We option is a number, the target text is also converted to a use this model since it is often used in the state-of-the-art number). Finally, we append a string to guide model reresearch as the visual backbone for LVLMs [3]. sponses, depending on the type of output that is expected:

To train the models, we use the methodology of LLaVA "Rispondi solamente con il numero dell’opzione corretta NeXT, because of both its performance and its open code- dalle scelte date." ("Answer with the option’s number base, which allows for easier reproducibility of this study. from the given choices directly." in English) when the opThis methodology is made of two steps: pre-training to tions are identified by numbers, "Rispondi solamente con warm up the multi-layer perceptron projector and visual la lettera dell’opzione corretta dalle scelte date." ("Answer instruction tuning to teach the model how to solve vision- with the option’s letter from the given choices directly." language tasks. For both steps, training is performed in English) when the options are identified by letters, using the next token prediction objective, implemented "Rispondi usando una zona di delimitazione." ("Answer as cross-entropy loss. We report hyperparameters used using a bounding box." in English) when the target text is for both steps in Table 1. For base models, we apply the a bounding box and, finally, "Rispondi usando una singola Mistral chat template reported in Listing 1, since they do parola o frase." ("Answer the question using a single word not have a chat template associated with them, while for or phrase." in English) for all other cases. In total, the instruct models we apply their own chat template. training mixture combining these two datasets consists of 172,335 instances.

3.1. Training Mixture 3.2. Hardware and Software Configuration For both training steps, we use a state-of-the-art machine

translation model to translate popular vision-language English-only datasets to Italian. This is necessary since, Our experimental setup was provided by Fastweb SpA via at the time of writing, there is no large-scale vision- a high-performance computing cluster 5 composed of 31 language dataset for instruction tuning in Italian. There- NVIDIA DGX H100 systems, organized according to the fore, we use MADLAD 400 3B [17], since it is one of the NVIDIA DGX SuperPOD reference architecture. The cluslatest and best-performing machine translation models. ter is deployed within a datacenter located in Lombardia, For pre-training, we use the same dataset as LLaVA trans- Italy, and ofers a total of 248 NVIDIA H100 Tensor Core lated to the Italian language. During pre-training, the GPUs interconnected through high-bandwidth NVLink whole model is kept frozen, except for the multi-layer and InfiniBand, enabling low-latency communication and perceptron. Thanks to this approach, the multi-layer eficient scaling across nodes. perceptron weights are initialized so that the vision em- The training and evaluation of the models was conbeddings are correctly projected into the LLM’s space. ducted in a distributed manner through the Machine For visual instruction tuning, we consider a combination Learning Distributed Engine MLDE6 platform, which enof two datasets: MultiInstruct [18] and the conversational abled eficient parallelisation of workloads on DGX H100 split of the LLaVA-Instruct [3] dataset. The former is a collection of diverse vision-language tasks (e.g. Visual

4https://huggingface.co/openai/clip-vit-large-patch14-336 5Fastweb Announcement

6https://www.hpe.com/us/en/software/marketplace/ hpe-ml-development-environment.html nodes. The software stack was based on open-source li- 4. Experiments braries, including Transformers from Hugging Face [19], which provides seamless integration with PyTorch [20] 4.1. Experimental Setting ammneodndDtealelse.inpSepficeieendtl[y21h]a.nTdhliinsgsolaftrwgeardeastatascektshaansdbecoenmipnlsetxru- Telos,ewvaeluuastee tthhreeevidsaiotans-elatsn:guGaQgAe-aIbTil[it2y2,o2f3t]h,eMseTmVQodAtaddhnuaetTdcaichb.fooiiIsmltrithaptyrala,sarrsoidacnwtariielnvaafleegrbceiatll-asinstroayaglfyebtaw-snrsiodascaraoedelficfeeicmremonnunocalfigytdtui,ieporwllasnehtaoimiolcnnhoeIfodteareanrtllesiataucornrrcweu-hldacaitirnraedeglcsptufuoaraorrge-es i[ssum2wtaa4laenl]gyr,cieenEtsdrgX.aaAdntMaasMstlTeaaSVtst.-eeQVTdtAho[sen2pi5sdlni]ataa.ttatmuGosreQaaItntlApauslr-ciaoIaelTvlnnyi,edissaec.nsoaWnnsvpsoeiiltssicatutsotiaennfldogsqritdouesefeexrvst3-tie,tic0rosea0nnml0tlaaarinninnc---sovereign AI infrastructure, ensuring data localisation, guages, in this work we focus on the Italian split, which traFnosrpatrraeinnciyngantdheremguoldaetlosr,ywceomuspeli2anGcPe.Us. The whole cMoTnVsiQstAs -oIfT8.8E4XqAuMesSt-iVonis-aancsowlleerctpiaoinrso.fWmeurlteifpelre-tcohioticaes training procedure takes about 24 hours for each model. school exam questions in multiple languages. In this case, we focus on the Italian split as well, which consists of 1,645 question-answer pairs. We refer to it as EXAMS-VIT

To take into account the efect of using diferent prompts for the same model, we test all models and all datasets using four diferent styles of formatting. Specifically, to evaluate these models, an additional string is added to the prompt to limit the generated output. In English, this string that is used depends on the model and the formatting of its training mixture, however the original LLaVA, and most other models following its setup, used "Answer the question using a single word or phrase." for open-ended tasks and "Answer with the option’s letter from the given choices directly." for closed-ended ones.

Thanks to this, it is possible to use exact match as metric, A model that performs well for all four formattings can where the generated output is compared directly to the be considered to be a consistent model, capable of answerground truth (i.e. hard syntactic match), since the model ing user queries despite the syntax used in the request. is instructed to generate only the text that is relevant Finally, all results are obtained using greedy decoding w.r.t. the label. Due to this, we want to understand if as sampling strategy at inference time, which removes and how the model performance is afected by this string. randomness in generation and guarantees improved reIf we change this string to one with a similar meaning, producibility of the obtained results. For all tasks, we use does the model generate outputs consistently? Does the the question and answer pairs provided by the task itself. position of the string matter? To answer these questions, The only exception is EXAMS-V-IT where, since the queswe apply four diferent formattings to the datasets: tion and choices are embedded within the image itself, we use the following string as question: "Fornisci una risposta alla domanda presente nell’immagine." ("Provide an answer to the question in the image" in English). All • Post: "\nRispondi utilizzando una sola parola o frase." (or "\nRispondi utilizzando direttamente la lettera dell’opzione corretta tra quelle date." for closed-ended tasks) appended to the end of the instruction • Pre-Swap: "Rispondi utilizzando una sola parola o frase.\n" (or "Rispondi utilizzando direttamente la lettera dell’opzione corretta tra quelle date.\n" for closed-ended tasks) appended to the beginning of the instruction • Post-Swap: "\sRispondi in modo breve e diretto." (or "\sRispondi con la lettera." for closed-ended tasks) appended to the end of the instruction • Pre: "Rispondi in modo breve e diretto.\s" (or "Rispondi con la lettera.\s" for closed-ended tasks) appended to the beginning of the instruction models are evaluated using the lmms-eval7 framework, Additionally, we showcase that the models are very loaded in float16 as dtype and inference is performed sensitive to the formatting of the prompt. For example, with a batch size of 1, ensuring reproducibility of the while the base version of EuroLLM achieves the best avresults. Finally, we lowercase text and ground truth and erage performance on GQA-IT, it performs well on only ignore whitespaces when evaluating using exact match. two out of the four formattings. This pattern can also be seen in other models in our evaluation, in most cases, 4.2. Results Discussion the models tend to perform better in a limited subset of formattings. After manually analyzing the generated We report the results of the experiments in Table 2. For outputs, we find that there are cases where the models the sake of comparison against already existing mod- generated the correct answer, but with additional contexels, we also report the results of LLaVA-NeXT 8B [26], a tual text. For example, for the question "È nuvoloso?" ("Is LLaVA-NeXT model trained from the LLaMA 3 Instruct it cloudy" in English) with label "Sì" ("Yes" in English), 8B checkpoint, on these benchmarks. Overall, models Minerva instruct answered "Sì" in the Post formatting, trained on Italian perform well w.r.t. LLaVA-NeXT 8B. while it answered with "Sì, è nuvoloso nell’immagine." Remarkably, the base version of EuroLLM has the best ("Yes, it is cloudy in the image" in English) in the Postaverage performance in GQA, while the base version Swap formatting. In both cases, the answer is correct, of FastwebMIIA has the best average performance in but the exact match metric fails to consider the second EXAMS-V-IT. In MTVQA-IT, Italian models tend to per- case as correct, since there is no hard syntactic match beform poorly w.r.t. LLaVA-NeXT 8B. We believe this is tween the generated output and the label. In light of this, due to the low quantity of text-centric vision-language in- we propose further evaluation to study the relationship stances in the training mixture, since MultiInstruct tasks between performance and the length of the generated focus more on natural scenes and everyday images. We response. can reasonably expect an improvement in performance for text-centric tasks when integrating this type of tasks 4.3. Evaluating for Response Length in the training mixture.

7https://github.com/EvolvingLMMs-Lab/lmms-eval To further understand if the models provide outputs that are relevant, we evaluate them by performing an approx

imate match between the label and the generated output. task using this strategy. Results for evaluation performed That is, we check that the label is a substring of the gen- using this approach are reported in Table 3. As expected, erated output. This allows us to cover cases where the we can appreciate a great improvement in performance model keeps generating contextual text together with for most models. For example, for the base version of the task answer. For example, for the question "C’è una EuroLLM-9B, performance rises from .0973 to .4807, and palla da calcio nell’immagine?" ("Is there a football ball a similar trend can be seen in the instruct version of the in the image?" in English) with label "Sì" ("Yes" in En- model. For most models, we can observe an increase in glish), the model may generate "Sì, c’è una palla da calcio performance in approximate match, except for Velvet, nell’immagine.". This case is considered incorrect in the where the performance remains the same. To further valiexact match metric, since the generated output is not the date this finding, we also evaluate under the same setting same as the ground truth label. However, the answer is the formatting where the models performed best, Results correct, and the ground truth label is in the generated for approximate match evaluation of the best formatting string itself. Our approach allows to cover these corner are reported in Table 4. Overall, the results are a lot cases, however, note that this strategy sufers from false more stable, and the degree of improvement is less with positives. For example, for the question "C’è una mano respect to worse formatting using approximate match. nell’immagine?" ("Is there a hand in the image?" In En- This highlights that the models in their best formatting glish) with label "No", the model may generate "Sì, c’è performed well because they were able to generate the exuna mano nell’immagine" ("Yes, there is a hand in the pected output directly and consistently, without adding image" in English), and it would be considered a correct additional contextual text to the answer. However, we answer since "no" is a substring of "mano". We showcase emphasize that the worst formatting evaluated with apsome examples in Figure 1 To assess the performance proximate match actually showcases better performance of the models regardless of the response length, we con- w.r.t. best formatting evaluated with approximate match. sider the formatting where each model has performed For example, the base version of EuroLLM achieves an the worst. We retrieve the generated outputs and corre- approximate match of .4807 on GQA-IT for its worst sponding ground truth labels and evaluate them using formatting, while it achieves an approximate match of an approximate match. We expect an improvement in .4497 for its best one. This pattern can be seen for all performance w.r.t. exact match. Note that we do not models, including LLaVA NeXT, the only exception beperform this evaluation for EXAMS-V, since the task is ing Velvet, where performance is consistent for both closed-ended, the answers are the identifiers of the op- formattings. This finding highlights that LLMs tend to tions (e.g. "A", "B"), making it impossible to evaluate the provide better answers when they are able to provide a

Minerva-7B Minerva-7B EuroLLM-9B

EuroLLM-9B FastwebMIIA-7B FastwebMIIA-7B

Minerva-7B Minerva-7B EuroLLM-9B EuroLLM-9B Velvet-14B

Velvet-14B FastwebMIIA-7B FastwebMIIA-7B

X ✓ X ✓ X ✓ X ✓ X ✓ X ✓ X ✓

4.4. Evaluating for Text-only Tasks Finally, we also test the ability of the LVLMs in solving

Italian text-only tasks, rather than vision-language ones. This aims to determine whether the models retain the knowledge they learned during their original text-only training procedure. Since the models didn’t see text-only data during vision-language training, we expect their performance to be lower with respect to their original LLM version. Since we only want to have a general estimate of their performance, we consider a relatively small subset of Italian tasks available through the lm-eval-harness8 framework. Namely, we consider Global-MMLU [27], specifically its LITE subset. The dataset is a balanced collection of culturally sensitive and culturally agnostic MMLU tasks (a massive multitask test dataset consisting of multiple-choice questions from various branches of knowledge), where only languages with human translation and post-edits are included. Results are reported in Table 5. Surprisingly, there are models which perform better after the visual instruction-tuning step. For example, the base version of Minerva-7B performs better on four out of the six categories of the dataset. Similar behaviour is also showcased by other models, for example, the instruct version of EuroLLM-9B also performs better on four out of the six categories, while the base version of FastwebMIIA performs better on five of them. This showcases that a vision-language training procedure 8https://github.com/EleutherAI/lm-evaluation-harness may also enhance the language-only performance of the model. However, there is an outlier to this pattern, that is Velvet-14B, where the original version of the model performs better on all categories. Furthermore, for the other models, there is no consistent improvement across all categories. This highlights that, while multimodality has helped improve the inherent knowledge of these models, it is not guaranteed, and text-only evaluation is still relevant for multimodal models.

5. Conclusions In this work, we have expanded the current landscape of

LVLMs for the Italian language. We have collected a pool of LLMs supporting the Italian language, which only process textual inputs. Then, we have extended them to LVLMs, by employing a state-of-the-art approach, namely LLaVA-NeXT, and a machine-translated corpus of vision-language tasks in Italian. Additionally, we evaluated them using only benchmarks that are natively in Italian and also studied the efect on the length of the generated response in evaluation. Finally, we also benchmarked these models on an Italian text-only benchmark to understand if the performance for text-only tasks was worse after the visual instruction-tuning step. As future work, we plan to further extend the training mixture so that it also considers text-centric tasks in Italian, improving model performance on this type of task that is currently missing in the training mixture. Specifically, we plan to incorporate multimodal document data to enhance these models in document visual question an

Acknowledgments We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.

swering. We also plan to further extend the evaluation and to improve the approximate match strategy, which soundness currently sufers from the possibility of false positives. Declaration on Generative AI During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

large audited dataset , Advances in Neural Informa- tion answering, arXiv preprint arXiv:2405.11985

tion Processing Systems 36 ( 2023 ) 67284 - 67296 . ( 2024 ). [18]

Xu ,

Shen ,

Huang , Multiinstruct: Improving [25] R. J. Das , S. E.

Hristov , H.

Li , D. I.

Dimitrov , I. Koy-

ing, in: Proceedings of the 61st Annual Meeting tilingual multimodal exam benchmark for eval-

(Volume 1 : Long

Papers)

, 2023 , pp. 11445 - 11465 . arXiv: 2403 .10378 ( 2024 ). [19]

Wolf ,

Debut ,

Sanh ,

Chaumond , C. De- [26]

Liu ,

Li ,

Zhang ,

Shen ,

Y. J.

Lee ,

towicz , J. Davison, S.

Shleifer , P. von Platen, C. Ma, knowledge, 2024 . URL: https://llava-vl.github.io/

Jernite ,

Plu ,

Xu ,

T. L.

Scao , S. Gugger, blog/2024-01-30 - llava-next/.

Drame ,

Lhoest ,

A. M.

Rush , Transformers: [27]

Singh ,

Romanou ,

Fourrier ,

D. I.

Adelani , J. G.

Proceedings of the 2020 Conference on Empirical sio ,

W. Q.

Leong ,

Susanto ,

Ng , S. Longpre,

Linguistics , Online, 2020 , pp. 38 - 45 . URL: https:// B. Ermis,

Hooker , Global mmlu: Understanding

www.aclweb.org/anthology/2020. emnlp-demos.6. and addressing cultural and linguistic biases in mul[20]

Ansel ,

Yang ,

He ,

Gimelshein , A . Jain, tilingual evaluation, 2024 . URL: https://arxiv.org/

Voznesensky ,

Bao ,

Bell , D. Berard, abs/2412.03304. arXiv: 2412 . 03304 .

P. Wu , S.

Chintala , Pytorch 2: Faster machine

formation and graph compilation , in: 29th ACM

tems , Volume 2 ( ASPLOS '24), ACM, 2024 . URL:

https://pytorch.org/assets/pytorch2- 2 .pdf. doi:10.

1145 /3620665.3640366. [21]

Li ,

Yao ,

Wu ,

Zhang ,

Holmes ,

Li ,

ifcient data sampling and routing, 2024 . URL: https:

//arxiv.org/abs/2212.03597. arXiv: 2212 . 03597 . [22]

D. A.

Hudson ,

C. D.

Manning , Gqa: A new

pattern recognition , 2019 , pp. 6700 - 6709 . [23]

Croce ,

L. C.

Passaro ,

Lenci ,

Basili , et al.,

Linguistics 2021 Proceedings of the Eighth Italian

3033, 2021 . [24]

Tang ,

Liu ,

Ye ,

Lu ,

Wei ,

Lin ,

Li ,