1. Introduction

Balancing Translation Quality and Environmental Impact: Comparing Large and Small Language Models

Antonio Castaldo

0 1

Petra Giommarelli

0 1

Johanna Monti

0 0 University of Naples L'Orientale , Via Chiatamone, 61/62, 80121 Naples , Italy 1 University of Pisa, Largo Bruno Pontecorvo , 3, 56127 Pisa , Italy

2025

Large Language Models (LLMs) have demonstrated remarkable performance in machine translation (MT), specifically concerning high-resource European languages. However, their extensive computational requirements raise sustainability concerns. This paper investigates the potential of smaller, fine-tuned language models as a more sustainable alternative for MT tasks. We conduct a comparative analysis of model performance in terms of translation quality and CO2eq emissions, and examine the key errors associated with using smaller models. Furthermore, we propose a novel metric that balances translation quality against environmental impact, aiming to inform more sustainable model selection in MT research and practice.

eol>machine translation large language models sustainability

1. Introduction

larger models. This setup allows us to assess the realworld viability of small models for machine translation MT has been a core topic in natural language processing when fine-tuned for specific language pairs and domains. (NLP) for several decades, evolving from rule-based sys- We conduct a comprehensive analysis of model perfortems to statistical methods, and more recently to neural mance, in terms of translation quality and CO2eq emismachine translation (NMT) and transformer-based mod- sions, validating our results with a human evaluation els. The emergence of LLMs has significantly advanced of the key errors associated with each model. Finally, the state-of-the-art in MT, demonstrating remarkable we introduce a metric called Carbon-Adjusted Quality performance on various NLP tasks [ 1 ]. Score (CAQS), designed to facilitate sustainable model se

Their ability to generate fluent, context-aware trans- lection, that quantifies the trade-of between translation lations in diferent domains has positioned LLMs at the quality and sustainability. forefront of MT research [ 2 ]. Their ability to model context, semantics, and discourse phenomena makes them highly attractive for both academic and industrial trans- 2. Background lation applications.

However, this performance comes at a significant envi- 2.1. LLMs and Translation ronmental cost. Training and deploying LLMs consumes LLMs have achieved state-of-the-art results in MT, by enormous computational resources, leading to consider- leveraging extensive pretraining on multilingual corpora, able carbon emissions and infrastructure demands [ 3, 4 ]. enabling them to deliver remarkable performance across These challenges have prompted the exploration of more a wide range of domains and language pairs [ 6 ]. In consustainable alternatives. trast to NMT systems, which rely primarily on paral

This paper investigates whether smaller language mod- lel corpora, LLMs are pretrained on massive web-scale els can serve as eficient and environmentally sustainable monolingual and multilingual datasets. This enables valid alternatives to LLMs in MT. Specifically, we will them to generate high-quality translations even in doifne-tune the Gemma-3-4B[ 5 ] model on a parallel English- mains where parallel data is limited [ 7 ]. Italian (EN-IT) parallel corpus, and evaluate its perfor- Notably, GPT-based models excel at producing conmance, with human and automatic evaluation, against textually accurate translations, efectively capturing disCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- course relations and maintaining sentence-level cohertics, September 24 — 26, 2025, Cagliari, Italy ence. They consistently outperform encoder-decoder † The authors contributed equally. architectures such as Transformer-big and M2M100, par$ antonio.castaldo@phd.unipi.it (A. Castaldo); ticularly in zero-shot and few-shot settings [ 8 ]. petra.giommarelli@phd.unipi.it (P. Giommarelli); jmonti@unior.it Moreover, LLMs support document-level translation (J. M00o0n9t-i0)008-3325-787X (A. Castaldo) by leveraging discourse-aware context windows, which © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License enable the maintenance of lexical cohesion and consistent Attribution 4.0 International (CC BY 4.0). resolution of anaphoric references across sentences [ 9 ]. due to optimizations in model architecture and training This capability results in more fluent translations, making strategy [14].

LLMs increasingly favored in professional translation Moreover, recent studies show that even highly comsettings. plex capabilities like multi-step reasoning, previously

The adoption of LLMs, however, requires substantial thought to emerge only in models over 100B parameters, computational resources and infrastructure, which may can be acquired by SLMs through targeted fine-tuning not be feasible for all organizations or languages. Beyond and distillation. Distilling chain-of-thought reasoning these practical limitations, the widespread adoption of abilities, for instance, from GPT-3.5 into FlanT5 variants LLMs also raises significant concerns about their envi- (250M to 3B) resulted in significant performance improveronmental sustainability. ments on math reasoning tasks without the need for full retraining of the model’s weights [ 16 ]. 2.2. LLMs Sustainability A comprehensive survey of SLMs underscores the value of model compression techniques such as pruning, While Large Language Models (LLMs) have enabled re- quantization, and knowledge distillation. These enable markable progress in NLP, their growing environmental the deployment of eficient models on mobile and edge defootprint raises important sustainability concerns. Train- vices while maintaining competitive accuracy for many ing large-scale models such as GPT-3, with hundreds tasks [ 17 ]. The adoption of SLMs is particularly promisof billions of parameters, can consume up to 1.3 GWh ing for democratizing NLP, enabling smaller institutions of electricity, comparable to the yearly energy usage of and low-resource languages to benefit from modern AI more than 100 US homes [ 10 ]. This results in hundreds of without the environmental or infrastructural burden of tons of CO2 emissions, depending on the carbon intensity LLMs. of the power grid. In this study, we evaluate whether SLMs, when com

In addition to training, the inference phase of LLMs bined with modern fine-tuning strategies and lightweight also significantly contributes to their overall carbon foot- architectures, could ofer a pragmatic and sustainable print, particularly in large-scale deployments. While the path forward for machine translation and other NLP apenergy cost of a single inference is lower than that of plications. training, the cumulative emissions can become substantial depending on usage patterns. For example, serving a single ChatGPT prompt may emit over 4g of CO2eq, 3. Fine-tuning a SLM more than 20 times the emissions of a typical web search [ 11 ].

The same study emphasizes that total environmental impact depends on a combination of factors: model size, batch size, and hardware type. The latter reflects the impact of producing high-performance GPUs, which involves substantial embodied carbon emissions. Although these emissions occur at production time, they contribute to the model’s overall environmental cost throughout its operational lifetime.

To demonstrate the efectiveness of using SLMs as sus

tainable alternatives to larger, more resource-intensive models in machine translation, we compare two stateof-the-art models: GPT-4o-mini [ 18 ] and an open-source model, Gemma-3-4B [ 15 ], which is significantly smaller than its OpenAI counterpart.

We fine-tune Gemma-3-4B on a carefully curated subset of the OpenSubtitles corpus, obtained from the Opus Corpus [ 19 ]. We evaluate both models on a held-out test set of 400 segments for the English–Italian (EN-IT) language pair and present our findings.

2.3. Small Language Models Recent research has emphasized the growing feasibility

and importance of SLMs as eficient alternatives to LLMs in constrained environments [12, 13]. SLMs, typically ranging from hundreds of millions to a few billion parameters, are substantially more resource-eficient and accessible, especially when tailored to specific tasks.

SLMs benefit from architectural simplifications, such as compact tokenizers and reduced model width and depth, which are optimized to preserve key capabilities while minimizing parameter overhead [14]. Small models, like Gemma [ 15 ] and PanGu- -1.5B Pro model with only a few billion parameters have recently outperformed much larger models on several benchmarks

3.1. Dataset Curation For our experiments, we focused on the EN–IT subset

of the OpenSubtitles corpus, made available through the Opus Corpus repository. While OpenSubtitles is a rich resource for dialogue-based translation data, it also contains a considerable amount of noise due to its automatic extraction and alignment process. Therefore, careful curation was necessary to ensure the quality and relevance of the dataset.

We began by removing duplicate entries and any empty lines. Following this, we applied the langdetect [ 20 ] tool to verify the language of each sentence. This step was essential, as web-crawled corpora, although environmental impact of our training process. intended to be language-specific, occasionally contain CodeCarbon is a Python library that estimates carsegments in other languages. Sentences detected to be in bon emissions by tracking the energy consumption of languages outside our target pair, and that could not be computing resources (CPU, GPU, RAM) during code execlassified with a high confidence score, were filtered out. cution and combining this data with the carbon intensity

Finally, we applied COMET-QE [21], a quality estima- of the electricity grid based on geographic location. tion model, to score the remaining sentence pairs. Using The fine-tuning session consumed approximately 0.65 these scores, we selected the top 100,000 highest-quality kWh, resulting in an estimated 162 g CO2eq under an translations for use in our fine-tuning experiments. The average EU grid intensity of 250 gCO2/kWh. strategy of mining large datasets and selecting top-k sentence pairs based on quality metrics for fine-tuning helps 3.3. Gemma-3 Evaluation to further filter out noisy segments and ensures that the limited available data contribute maximally to model training [ 22 ]. This approach is consistent with our goal of reducing computational costs. By carefully curating a smaller but higher-quality dataset, we limit energy consumption and the associated environmental costs, while maximizing translation performance.

We conduct our evaluation on a held-out test set of 400

segments from the same corpus, ensuring no overlap with the training data. Table 2 reports the evaluation of EN–IT translation performance for Gemma-3-4B before and after LoRA fine-tuning, using BLEU [ 27], chrF [29], and COMET [30] as quality metrics. Our fine-tuned Gemma-3-4B model, with only 0.42% of additional trainable parameters, shows a notable improvement over the base version, achieving a +4 point gain in BLEU, a modest increase in chrF, and a +1 point gain in COMET. These results place our model on par with GPT-4o in COMET and above GPT-4o-mini in all three metrics.

In addition to performance, we also measure the environmental impact of inference using the CodeCarbon library. The estimated carbon emissions per inference for the fine-tuned model are approximately 0.028g CO 2eq, twice that of the base model, but significantly lower than GPT-4o models, each exceeding 0.42g per inference as estimated in a relevant study [ 31 ].

Our evaluation demonstrates that fine-tuning Gemma3-4B with LoRA leads to competitive performance gains with low additional environmental cost.

3.2. Training

The Gemma-3-4B model was fine-tuned for three epochs using Low-Rank Adaptation (LoRA) [ 23 ], a fine-tuning technique which injects small trainable matrices in the model’s weights. The adoption of LoRA for fine-tuning has shown strong empirical results in machine translation [24, 25], enhancing eficiency, while reducing train- 4. Quality-Sustainability Trade-Of ing time and computational costs. As demonstrated in experiments conducted by [26], fine-tuning with LoRA In our second experiment, to further assess the viability obtained the same improvements in terms of BLEU score of trading of quality for sustainability with the use of [27], while drastically reducing training time and modify- SLMs, we extend our evaluation on a set of multilingual ing only a small number of trainable parameters, with re- LMs, of diferent parameter sizes. We select the models spect to supervised fine-tuning involving all parameters for our evaluation based on state-of-the-art performance of the original network. In our case, we train efectively and usage in the research community. We benchmark 0.42% of the trainable parameters, corresponding to the each model on the same held-out EN–IT test set, using LoRA adapter matrices injected in Gemma-3-4B. BLEU, chrF and COMET, and log the CO2eq emissions per

Our fine-tuning pipeline was implemented using the inference using the CodeCarbon framework. Importantly, Hugging Face Transformers library [28], leveraging its we emphasize in our approach that a sustainable model integration with the PEFT library. For the LoRA config- choice should not be based on its parameter size alone, uration, we set the rank (r) to 16 and the scaling factor but actual carbon emissions. (alpha) to 16, with a dropout rate of 0.05 to improve As shown in Table 3, we highlight that the relationgeneralization. The training was carried out on a single ship between model size and emissions is non-linear. NVIDIA A100 GPU using mixed-precision (fp16) compu- For instance, Qwen-3B [ 32 ], despite its relatively small tation. We used the CodeCarbon1 library to monitor the size, exhibits disproportionately high emissions. This can be attributed to its reasoning behavior during inference, which results in extended reasoning outputs before gen- calculating a carbon-adjusted score that considers both erating a final answer. This behavior increases inference translation quality and sustainability. latency and environmental cost.

Similarly, the assumption that larger models necessarily produces more carbon emissions does not always hold. 5. Error Analysis This is the case for models developed with a Mixture-ofExperts (MoE) architectures. In these models, only a sub- To complement the quantitative results and better set of the total parameters is activated during inference. understand the practical implications of the qualityAs a result, MoE models like Mixtral, although large in sustainability trade-of, we conduct a manual error analaggregate size, can have lower or comparable emissions ysis on the translations generated by four representative to smaller, densely activated models. This decoupling models: our fine-tuned version of Gemma-3-4B, and the of parameter size and runtime eficiency highlights the baseline instruction-tuned Gemma-3-27B, Llama-3.2-3B need for measuring more empirical results, such as CO2eq and Llama-3.3-70B. emissions.

Therefore, we introduce a Carbon-Adjusted Quality Score (CAQS) metric as a measure of model costefectiveness, and we calculate it on each corpus translation generated by the models evaluated in our study. Our CAQS score penalizes each gram of carbon emissions exponentially, while ensuring that low-quality models are not rewarded more than high-quality ones, regardless of their eficiency. We define the CAQS metric as follows.

6. Conclusions

larger and environmentally demanding model, Llama-3.3- In this study, we investigated the potential of SLMs as 70B. In terms of weighted scores, both models show simi- sustainable alternatives to LLMs, for MT tasks focusing lar results, with very few major errors and a comparable on the EN-IT language pair. Our results demonstrate number of minor ones. The smallest Llama checkpoint that parameter-eficient fine-tuning of SLMs can achieve presents a very high number of both major and minor competitive translation quality while dramatically reducerrors, when compared to the Gemma-3-4B model. The ing environmental impact. The fine-tuned Gemma-3-4B ifndings may suggest that Llama-3’s architecture is sub- model achieved performance comparable to GPT-4o and optimal for translation tasks across model sizes, given outperformed GPT-4o-mini across all metrics, while conthat Gemma-3-4B matches the performance of its largest suming approximately 15 times less energy per inference. checkpoint. However, the results should be interpreted We complement these results with a MQM human evalwith caution, as our evaluation was limited to a small uation across a set of representative models, confirming test set and a single language pair. that Gemma-3-4B performed comparably to the much

In terms of error category distribution increasing pa- larger Llama-3.3-70B, producing only minor fluency and rameter size leads to an overall performance improve- spelling errors. ment, as seen in Table 5. This trend is particularly evident We also highlighted that the relationship between within the Gemma models, where the jump from 4B to model size and carbon emissions is non-linear and highly 27B parameters results in a significant drop in errors dependent on architectural choices, emphasizing the across all categories. In contrast, Llama-3.2 models ex- need for accurate measurements of carbon emissions. hibit a less linear improvement, suggesting diminishing Given the non-linear relation between model size returns from scaling model size. This observation, how- and environmental impact, we introduced the CAQS, a ever, is limited by the fact that only the smallest Gemma novel metric specifically designed to facilitate sustainable model was LoRA-adapted, while the LLaMA models were model selection by integrating translation quality and evaluated in their original form. A more rigorous com- carbon emissions. CAQS includes a sensitivity paramparison, involving both original and adapted versions eter that allows users to adjust how strongly quality is across model sizes, is left for future work. penalized by the model’s carbon footprint. According to

When comparing Gemma-3-4B and Llama-3.3-70B, we this metric, Gemma-3-4B and Magistral-Small emerged ifnd that most of the errors in the Gemma model are con- as the most eficient models in our study, ofering optimal centrated in surface-level issues, especially in spelling trade-ofs between sustainability and translation quality. diacritics. These errors, however, do not compromise

7. Limitations Acknowledgments This work has been funded by the Italian National PhD

programme in Artificial Intelligence, partnered by University of Pisa and University of Naples “L’Orientale”, through a doctoral grant (ID 39-411-24-DOT23A27WJ6603) established by Ex DM 318, of type 4.1, co-financed by the National Recovery and Resilience Plan. In light of practical constraints related to time and resources, the main limitations of our study lie in the relatively small sample of segments and the domain-specific nature of the OpenSubtitles corpus, used for both training and inference. For this reason, we highlight that our evaluation results may not be reproducible in other domains.

As our evaluation focuses on a relatively high-resource language pair (EN-IT), our findings may not be applicable for distant or low-resource pairs. Finally, our carbon emission measurements are specific to the computational infrastructure used (NVIDIA A100 GPUs, EU electricity grid). Results may difer when deploying models on diferent hardware configurations, cloud providers, or geographical regions.

Declaration on Generative AI

[5] Google

DeepMind

, Gemma: Open models for responsible ai , https://deepmind.google/models/ gemma/, 2024 . Accessed: 2025 -05-27.

[6]

Hendy ,

Abdelrehim ,

Sharaf ,

Raunak ,

Gabr ,

Matsushita ,

Y. J.

Kim ,

Afify ,

H. H.

Awadalla , How good are gpt models at machine translation? a comprehensive evaluation, 2023 . URL: https://arxiv.org/abs/2302.09210. arXiv: 2302 . 09210 .

[7]

He ,

Liang ,

Jiao ,

Zhang ,

Yang ,

Wang ,

Tu ,

Shi ,

Wang , Exploring human-like translation strategy with large language models , Transactions of the Association for Computational Linguistics 12 ( 2024 ) 229 - 246 . doi: 10 .1162/tacl_a_ 00642 .

[8]

Moslem ,

Haque ,

J. D.

Kelleher , A. Way, Adaptive machine translation with large language models , arXiv preprint arXiv:2301.13294 ( 2023 ).

[9]

Wang ,

Lyu ,

Ji ,

Zhang ,

Yu ,

Shi ,

Tu , Document-level machine translation with large language models , arXiv preprint arXiv:2304.02210 ( 2023 ).

[10]

U.S.

Energy Information Administration , Electricity use in homes, 2023 . URL: https://www.eia.gov/energyexplained/ use -of-energy/electricity-use-in-homes .php, accessed: 2025 -06-16.

[11]

Nguyen ,

Zhou ,

Ding , S. Liu, Towards sustainable large language model serving , 2024 . URL: https: //arxiv.org/abs/2501. 01990 . arXiv: 2501 . 01990 .

[1]

Lyu ,

Du ,

Xu ,

Duan ,

Wang , New trends [12]

Wang ,

Zhang ,

Wu ,

Mo , Q. Lu, in machine translation with large language models , W. Wang,

Li ,

Xu ,

Tang ,

He ,

Ma , 2023 . M. Huang , S. Wang , A comprehensive survey of

[2]

Jiao ,

Wang , J.-t. Huang,

Wang ,

Shi , Z. Tu, small language models in the era of large lanIs chatgpt a good translator? yes with gpt-4 as the guage models: Techniques, enhancements , appliengine, arXiv preprint arXiv:2301.08745 ( 2023 ). cations, collaboration with llms, and trustworthi-

[3]

Singh ,

N. P.

Patel ,

Ehtesham ,

Kumar , T. Ta- ness, 2024 . URL: https://arxiv.org/abs/2411.03350. laei Khoei , A survey of sustainability in large lan- arXiv:2411.03350. guage models: Applications , economics, and chal- [13] Y.-C. Lin , S.

Sharma , H.

Manikandan , J. Kumar, lenges, arXiv preprint arXiv:2412.04782 ( 2025 ). T. H. King , J. Zheng , Eficient multitask learning

[4]

M. C.

Rillig ,

Ågerstrand ,

Bi , K. A. Gould, in small language models through upside-down reU. Sauerland, Risks and benefits of large lan- inforcement learning , 2025 . URL: https://arxiv.org / guage models for the environment , Environmen- abs/2502.09854. arXiv:2502.09854. tal Science & Technology 57 ( 2023 ) 3464 - 3466 . [14]

Tang , K. Han, F . Liu,

Ni ,

Tian , et al., Rethinkdoi:10.1021/acs.est.3c01106. ing optimization and architecture for tiny language 2106 .09685, arXiv: 2106 .09685 [cs]. models, in: Proceedings of the 41st International [24]

Zheng ,

Hong ,

Liu ,

Wang ,

Su , Conference on Machine Learning, PMLR , 2024 .

Liang , S. Wu, Fine-tuning large language

[15]

Team ,

Kamath ,

Ferret ,

Pathak , N. Vieil- models for domain-specific machine translalard , R. Merhej, et al, Gemma 3 technical re- tion, 2024 . URL: https://arxiv.org/abs/2402.15061. port, 2025 . URL: https://arxiv.org/abs/2503.19786. arXiv: 2402 .15061. arXiv: 2503 . 19786 . [25] D. M. Alves , N. M.

Guerreiro , J.

Alves , J. Pombal,

[16]

Fu ,

Peng ,

Ou ,

Sabharwal , T. Khot, Special- R. Rei , J. G. C. de Souza , P.

Colombo , A. F. T.

Marizing smaller language models towards multi-step tins, Steering large language models for machine reasoning , in: Proceedings of the 40th International translation with finetuning and in-context learnConference on Machine Learning, PMLR , 2023 . ing, 2023 . URL: https://arxiv.org/abs/2310.13448.

[17]

C. V.

Nguyen ,

Shen ,

Aponte ,

Xia , arXiv: 2310 . 13448 . et al., A survey of small language models , 2024 . [26]

Zhang ,

Rajabi ,

Duh ,

Koehn , Machine arXiv: 2410 .20011. translation with large language models: Prompting,

[18] OpenAI , Gpt-4o, https://openai.com/gpt-4o, 2024 . few-shot learning, and fine-tuning with QLoRA , Accessed: 2025 -07- 23 . in: P. Koehn , B.

Haddow , T.

Kocmi , C. Monz (Eds.),

[19]

Tiedemann , S. Thottingal, OPUS-MT - Building Proceedings of the Eighth Conference on Machine open translation services for the World , in: A. Mar- Translation, Association for Computational Lintins,

Moniz ,

Fumega ,

Martins ,

Batista , guistics, Singapore, 2023 , pp. 468 - 481 . URL: https: L. Coheur , C. Parra , I. Trancoso, M. Turchi, //aclanthology.org/ 2023 .wmt- 1 .43/. doi: 10 .18653/ A. Bisazza,

Moorkens ,

Guerberof , M. Nurmi- v1/ 2023 .wmt- 1 .43. nen , L. Marg, M. L. Forcada (Eds.), Proceedings of [27]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: the 22nd Annual Conference of the European As- a method for automatic evaluation of machine sociation for Machine Translation , European Asso- translation, in: P. Isabelle , E.

Charniak , D.

Lin ciation for Machine Translation , Lisboa, Portugal, (Eds.), Proceedings of the 40th Annual Meeting of 2020 , pp. 479 - 480 . URL: https://aclanthology.org/ the Association for Computational Linguistics, As2020.eamt-1 .61. sociation for Computational Linguistics , Philadel-

[20]

Nakatani , Langdetect: Language detection library phia , Pennsylvania, USA, 2002 , pp. 311 - 318 . URL: for python, https://pypi.org/project/langdetect/, https://aclanthology.org/P02-1040/. doi: 10 .3115/ 2014 . Port of Google's language-detection library . 1073083 .1073135.

[22]

E. A.

Chimoto ,

B. A.

Bassett , Comet-qe and ac- //aclanthology.org/W15-3049/. doi: 10 .18653/v1/ tive learning for low-resource machine transla- W15-3049. tion , 2022 . URL: https://arxiv.org/abs/2210.15696. [30]

Rei ,

Stewart ,

A. C.

Farinha ,

Lavie , COMET: arXiv: 2210 .15696. A neural framework for MT evaluation , in: B. Web-

[23]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li , ber, T. Cohn,

He , Y. Liu (Eds.), Proceedings

Wang , L.

Wang , W. Chen, LoRA: Low-Rank Adap- of the 2020 Conference on Empirical Methods tation of Large Language Models , 2021 . URL: http: in Natural Language Processing (EMNLP) , As//arxiv.org/abs/2106.09685. doi: 10 .48550/arXiv. sociation for Computational Linguistics, Online, 2020 , pp. 2685 - 2702 . URL: https://aclanthology.org/ 2020 .emnlp-main. 213 /. doi: 10 .18653/v1/ 2020 . emnlp-main. 213 .

[31]

Jegham ,

Abdelatti ,

Elmoubarki ,

Hendawi , How hungry is ai? benchmarking energy, water, and carbon footprint of llm inference , 2025 . URL: https://arxiv.org/abs/2505.09598. arXiv: 2505 . 09598 .

[32]

Yang ,

Li ,

Yang ,

Zhang ,

Hui ,

Zheng ,

Yu ,

Gao ,

Huang ,

Lv ,

Zheng ,

Liu ,

Zhou ,

Huang ,

Hu ,

Ge ,

Wei ,

Lin ,

Tang ,

Yang ,

Tu ,

Zhang ,

Yang ,

Zhou ,

Lin ,

Dang ,

Bao ,

Yang ,

Yu ,

Deng ,

Li ,

Xue ,

Li ,

Zhang ,

Wang ,

Zhu ,

Men ,

Gao , S. Liu,

Luo ,

Li ,

Tang ,

Yin ,

Ren ,

Wang ,

Zhang ,

Ren ,

Fan ,

Su ,

Zhang ,

Wan ,

Liu ,

Wang ,

Cui ,

Zhang ,

Zhou ,

Qiu , Qwen3 technical report , 2025 . URL: https://arxiv. org/abs/2505.09388. arXiv: 2505 . 09388 .

[33] Mistral-AI , :, A.

Rastogi , A. Q.

Jiang , A.

Lo , G. Berrada, G. Lample, J.

Rute , J.

Barmentlo , K.

Yadav , e. a. Kartik Khandelwal, Magistral, 2025 . URL: https://arxiv.org/abs/2506.10910. arXiv: 2506 . 10910 .

[34]

Grattafiori ,

Dubey ,

Jauhri ,

Pandey ,

Kadian ,

Al-Dahle ,

Letman ,

Mathur , A . Schelten, e. a. Alex Vaughan, The llama 3 herd of models , 2024 . URL: https://arxiv.org/abs/2407.21783. arXiv: 2407 . 21783 .

[35]

Abdin ,

Aneja ,

Awadalla ,

Awadallah ,

A. A.

Awan ,

Bach ,

Bahree ,

Bakhtiari , J. Bao, e. a. Harkirat Behl, Phi-3 technical report: A highly capable language model locally on your phone , 2024 . URL: https://arxiv.org/abs/2404.14219. arXiv: 2404 . 14219 .

[36]

Lommel ,

Gladkof ,

Melby ,

S. E.

Wright , I. Strandvik,

Gasova ,

Vaasa ,

Benzo ,

R. M.

Sparano ,

Foresi ,

Innis , L. Han, G . Nenadic, The multi-range theory of translation quality measurement: Mqm scoring models and statistical quality control , 2024 . URL: https://arxiv.org/abs/2405.16969. arXiv: 2405 . 16969 .