1. Introduction

DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation

Pranav Kasela

Marco Braga

0 1

Alessandro Ghiotto

Andrea Pilzer

Marco Viviani

Alessandro Raganato

1 0 DAUIN Dipartimento di Automatica e Informatica , Politecnico di Torino , Italy 1 Department of Informatics , Systems and Communication - DISCo , University of Milano-Bicocca , Italy 2 NVIDIA AI Technology Center , Italy 3 Università degli Studi di Pavia , Italy

2025

In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian-English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, webcrawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian-English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian-English machine translation: https://github.com/pkasela/DIETA-Machine-Translation.

eol>Machine Translation Large Language Models Italian-English Translations Parallel Corpus

1. Introduction

ing and releasing a specialized, small decoder-only Transformer model optimized for high-quality Italian–English translation; (ii) creating and publicly releasing a largescale, carefully curated parallel corpus from diverse sources, and generating a synthetic corpus through backtranslation; (iii) introducing the new WikiNews-25 evaluation set to facilitate benchmarking on recent, human- 3.1. Parallel Training Corpus corrected content; (iv) conducting thorough evaluations using multiple MT metrics.

This section outlines the creation of a large Italian–English sentence pair corpus and a synthetic dataset derived from Web News and crawled data. 2. Related Works 3. Data Collection and Preparation Publicly available bilingual corpora play a central role in

the development and evaluation of Machine Translation (MT) systems. Among these, OPUS [20, 16] is a wellknown source of multilingual datasets that have been widely used in both statistical and neural MT research.

Large-scale web-crawled corpora such as ParaCrawl [11] and NLLB [21] are particularly noteworthy for their coverage and scale, making them important resources for training state-of-the-art multilingual MT models.

Recent Transformer models such as mBART-50 [22], Filtering prompt NLLB-200 [21], MADLAD-400 [23], Tower [24], and Gemma-2 [25] have showed that expanding language Given the English and Italian sentences below, are coverage and model capacity can significantly enhance they translations of each other? Answer with yes or many-to-many translation quality. However, the compu- no only. tational demands of these massive models, and the inherent competition for representational capacity across hundreds of languages, often leave room for improve- Figure 1: Prompt issued to Phi-4 during quality filtering. ment on specific language pairs such as English–Italian.

For many language directions, the open OPUS-MT family [17, 16] remains a widely used baseline, yet its more compact architectures lag behind the newest LLM-based systems in fluency and versatility.

General-purpose models like the GPT and LLaMA series, when prompted or instruction-tuned, achieve impressive zero-shot MT results. Specialised variants, like GemmaX2-28 [26], further narrow the gap with commer- Sample formatting cial MT engines. Meanwhile, to strengthen the representation of Italian within multilingual LLMs, several ENG: English sentence IT: Italian translation initiatives have introduced Italian-focused systems. Mod- IT: Italian sentence ENG: English translation els such as LLaMAntino [19], Minerva [27], Cerbero [28], ModelloItalia [29], and DanteLLM [30] leverage hundreds of billions of Italian tokens and human feedback to yield substantial improvements in Italian generation and understanding. Nonetheless, these models are designed as general-purpose language models and are not optimised specifically for the MT task.

In this work, we introduce a compact, 0.5B-parameter decoder-only model, trained from scratch on a total of 768 million parallel and synthetic sentence pairs, delivering a purpose-built, open solution for English↔Italian machine translation.

After cleaning, the corpus contains 207 864 437

high-quality sentence pairs. For bidirectional training, each pair is duplicated with explicit direction tags, resulting in a total of 415 728 874 source–target examples, as illustrated in Figure 2. To build a decoder-only model for bidirectional English ↔ Italian translation, we make use of every public bitext for the pair available in OPUS [20]. Sources span Web crawls [31, 21, 11], Wikipedia [10, 32, 14], parliamentary/legal proceedings [9, 33], and film/TV subtitles [ 12]. Because the NLLB corpus [21] contains CCMatrix, we keep only the NLLB portion to prevent duplication.

Cleaning and quality control. We remove exact du

plicates using OpusTools and OpusFilter [34, 35], then pass each remaining sentence pair to the Phi-4 LLM [36] with the binary prompt shown in Figure 1. Pairs that receive no are discarded. 3.2. Synthetic Data via Back-Translation

To expand the parallel training corpus, we generated

additional sentence pairs by back-translation [37]. As monolingual sources we used the NewsCrawl1 corpora [38] and the web-scale FineWeb collection [39, 40].

NewsCrawl. We translated Italian articles from Model Architecture. DIETA is a decoder-only Trans2008–2018 and English articles from 2023 with the former composed of six identical layers, each adopting a OPUS-MT-TC-BIG model [17, 41, 16]. The remaining seg- post-norm configuration. Every layer features a hidden ments (Italian 2019–2024 and English 2024) were trans- dimension of 2048 and 32 attention heads. The feedlated with NLLB-200-3.3B [42]. In total, this yielded forward sub-layer uses a squared-ReLU activation and 144,189,087 synthetic sentence pairs, comprising 67.8 M expands the hidden representation by a factor of four Italian and 76.3 M English sentences. before projecting it back to the residual stream. Token positions are encoded using rotary embeddings [43]. The FineWeb. From the multilingual FineWeb2 we trans- architecture further incorporates residual attention acculated 108.5 M Italian sentences, and from the English mulation [44] and query-key normalization [45, 46]. FineWeb crawl we translated 100 M English sentences resulting in a total of 208,516,318 sentences, using the multilingual GemmaX2-28-9B-v0.1 model [26].

All translations were generated with the CTranslate22 toolkit in greedy decoding mode for eficient inference with large Transformer models. 3.3. Training corpus summary

Duplicating the OPUS parallel pairs to cover both trans

lation directions (i.e., from English to Italian and vice versa) yields 415,728,874 direction-specific examples. When combined with the 144,195,695 NewsCrawl and 208,516,318 FineWeb synthetic pairs, the total training set comprises 768,440,887 source–target examples. We shufle the corpus once before mini-batch construction. 3.4. Evaluation Sets

In addition to standard benchmarks, we release

WikiNews-25, a 450-segment test set based on 2025 WikiNews sentences. Machine translations generated by Google Translate were post-edited using English as the source language, retaining only those sentences that required substantive corrections.

4. Methodology This section describes the tokenizer, the model architecture, and the training strategy adopted to develop our proposed models. Training Schedule. Our models are implemented us

ing the x-Transformers framework.4 Training is performed for a single epoch over the dataset described in Section 3, utilizing the Lion optimizer [47] with a learning rate of 2 × 10− 4 and a linear decay schedule preceded by a warm-up phase covering the first 10% of training steps. We release five variants of our trained model checkpoints: • DIETA: trained from scratch on the high-quality filtered parallel corpus (415.7M sentence pairs). • DIETA+BT: trained on the parallel corpus plus

NewsCrawl back-translations (total 559,924,569 pairs). • DIETA+cont: continues DIETA for a second epoch on the same 559,924,569-pair mixture. • DIETA+nosynth: continues DIETA for a second epoch on the original parallel data only. • DIETA+allsynth: continues DIETA+cont for a third epoch on the full corpus (parallel + NewsCrawl + FineWeb), totalling 768,440,887 pairs.

5. Experimental Setup We evaluate a broad range of translation systems, providing for each the parameter count, model architecture, and main language coverage:

• EuroLLM-1.7B (utter-project/EuroLLM-1.7B-Instruct; 1.7 B, LLaMA-style dense Transformer) — trained on ∼ 4 T multilingual tokens and instruction-tuned on EuroBlocks; covers 35 EU + major languages; • EuroLLM-9B (utter-project/EuroLLM-9B-Instruct; 9.15

B) —same recipe as above at larger scale;

4https://github.com/lucidrains/x-transformers 2https://github.com/OpenNMT/CTranslate2

3sapienzanlp/Minerva-7B-instruct-v1.0 Tokenizer. We use the 51,200-entry SentencePiece vocabulary from the Minerva family of models [27].3 Un- • LLaMAntino-8B (swap-uniba/LLaMAntino-3-ANITAlike general-purpose multilingual tokenizers, Minerva’s 8B-Inst-DPO-ITA; 8 B, Meta-Llama-3 backbone) — EN vocabulary was specifically trained on a balanced cor- ↔ IT instruction + DPO tuned; pus of high-quality Italian and English texts, resulting in optimized sub-word segments aligned closely to the • Maestrale v0.4 (mii-llm/maestrale-chat-v0.4-beta; 7.2 morphological and orthographic structures of both lan- B, Mistral-7B continued-pretrain + SFT + DPO on 1.7 guages. This choice ensures that our models efectively M Italian instructions); capture nuances specific to the Italian–English language pair. • mBART-50 (facebook/mbart-large-50-many-to-manymmt; 0.61 B seq-to-seq Transformer) —50-language many-to-many MT; • opus-mt (small) EN→IT / IT→EN (Helsinki

NLP/opus-mt-*; ∼ 270 M, Marian-Transformer); • Minerva-7B (sapienzanlp/Minerva-7B-instruct-v1.0; 7 on mT5 and attains state-of-the-art correlation at WMTB, Mistral-like) —pre-trained on 2.5 T tokens (50 % IT, 24 (we make use of google/metricx-24-hybrid-xl-v2p6), 50 % EN) + safety tuning; and COMET trains an XLM-R encoder on millions of human-scored triplets (we use Unbabel/wmt22-comet• PhiMaestra-3 (LeonardPuettmann/PhiMaestra-3- da as the comet model for evaluation). The third group Translation; 3.8 B, Phi-3 mini) —fine-tuned on 0.5 M dispenses with references: QE MetricX (a “-QE” flavour Tatoeba EN↔IT pairs; of MetricX-24) and COMETKiwi infer absolute transla• Cerbero-7B (galatolo/cerbero-7b; 7 B, Mistral-7B tion quality directly from the source–hypothesis pair, base) —Italian-centric LLM trained on synthetic Cer- enabling evaluation in real-time or on data lacking gold bero corpus; references, we make use of Unbabel/wmt23-cometkiwida-xl. Using all three families lets us cross-check surface • NLLB-200 (600 M / 1.3 B / 3.3 B) (facebook/nllb-200-*) accuracy, semantic adequacy and reference-free quality Transformer family covering 200 languages; estimation within a single experimental framework. Due to resource constraints we report only automatic evaluation; we leave human assessment to future work. • opus-mt-big EN→IT / IT→EN (Helsinki-NLP/opus- Datasets. We evaluate selected baselines and our modmt-tc-big-*; ∼ 560 M Transformer model with back- els on four widely used test collections: NTREX-128 translation); [48], Tatoeba [41], WMT-24pp [49], and FLORES-200 [15].

NTREX-128, which is based on WMT-19 [50], includes • ModelloItalia-9B (sapienzanlp/modello-italia-9b; 9 B, 1,997 sentences translated from English into 128 target

GPT-NeoX) —Italian LLM by iGenius/CINECA; languages, including Italian. Tatoeba is a community• Llama-3.1-8B-ITA (DeepMount00/Llama-3.1-8b-ITA; sourced corpus that focuses on everyday conversational 8 B, Meta-Llama-3.1 fine-tuned for Italian); language and informal registers, allowing us to assess our models’ robustness beyond formal contexts. WMT-24pp • Tower-7B (Unbabel/TowerInstruct-7B-v0.2; 6.7 B, is a professionally translated extension of the WMT24 LLaMA-2 base) —10-language MT and post-editing dataset [38] on new languages, such as Italian. FLOREStasks; 200 is composed of professionally translated Wikipedia• Gemma-2B / 9B (ModelSpace/GemmaX2-28-{2B,9B}; based sentences per language, covering encyclopedic con3.2 B / 10.2 B, Gemma-2 continued-pretrain + MT SFT tent distinct from the news domain. for 28 languages); Additionally, to specifically evaluate translation quality on recent texts, we introduce and use our new benchmark, WikiNews-25, as described earlier in Section 3. • MADLAD-3B / 7B (google/madlad400-{3b,7b}-mt; 3 B / 7.2 B, T5) —400+-language MT trained on up to 1 T tokens.

Automatic metrics. To assess the MT systems, we grouped the evaluation metrics into three categories: • Surface – overlap: sacrebleu (BLEU–4) and chrF ; • Neural, reference–based: BLEURT, Google’s MetricX

24, and Unbabel’s COMET ; • Neural, reference–free (QE): the QE MetricX variant and COMETKiwi.

The first group measures literal agreement with the

reference: sacrebleu implements the standard BLEU computation with canonical tokenisation for reproducible scores, while chrF computes a character -gram Fscore that is more robust to morphological variation. The second group regresses directly towards human Direct-Assessment/MQM ratings: BLEURT fine-tunes BERT/RemBERT to predict adequacy and fluency, in particular, we relied on BLEURT-20 model, MetricX-24 builds

6. Results Decoding policy. Unless otherwise indicated, system outputs were generated with greedy decoding. Whenever a model name ends with the sufix “ -b5” we used beam search with beam size 5.

In what follows we comment on the outcomes obtained by our DIETA model against the 15+ baselines introduced in Section 5. We discuss one benchmark at a time, always reporting the same seven automatic metrics and both translation directions (EN→IT/IT→EN). With the exception of metricx and qemetricx, higher is better. 6.1. NTREX-128 Table 1 reports NTREX-128 results. Overall performance scales with size: Gemma-9B-b5 leads on every metric (≈ 51/49 BLEU, 72/70 chrF, BLEURT 0.36/0.48, MetricX 1.60/2.43, COMET 0.90/0.89). Our compact DIETA+cont reaches 36/43 BLEU, 62/66 chrF, BLEURT 0.20/0.41 and NTREX-128 Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation. lands just behind the largest models and surpasses evB and OPUS-MT-big. The remaining gap appears chiefly ery competitor below 3B, confirming that targeted backfor the largest decoders. in reference-free QE, where MetricX is ≈ 0.3–0.5 higher translation closes most of the size-related gap, remaining while holding MetricX and COMET-Kiwi values on par 6.2. Tatoeba ing Tower-7B and surpassing all models ≤

Reference-based metrics echo this: DIETA+cont sits within

0.01–0.02 COMET of Gemma-2B-b5, while BLEURT is only 0.01–0.02 behind Madlad-7B. MetricX and COMET

Kiwi remain scale-sensitive, DIETA trails the 9 B tier by 3 B parameters.

∼ 0.9 MetricX points. parameter

DIETA model delivers mid-table performance—competitive with 7 B systems and clearly ahead

Tatoeba Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation. sacrebleu(↑) chrf(↑) bleurt(↑) metricx(↓) comet(↑) qemetricx(↓) cometkiwi(↑) 0.875/0.875). The largest gap remains in reference-free quality estimation: MetricX for DIETA is ≈ 0.6 points higher than the 9 B leader. 7B, and our DIETA+cont/ DIETA+allsynth, sits within ≈ 4 BLEU and 0.02 COMET. In particular, DIETA+all synth scores 45.7 BLEU / 67.6 chrF (EN→IT) and 43.8 BLEU / 67.3 chrF (IT→EN), essentially matching Tower-7B and

NLLB-3.3B despite being 14× smaller. Reference-based

metrics mirror this parity (COMET 0.826/0.868), while

MetricX and COMET-Kiwi still favour the largest decoders by roughly 0.3–0.4 points.

6.6. Cross-benchmark Analysis

Parameter eficiency. All five checkpoints share the

same 0.5B backbone, yet DIETA+cont and DIETA+allsynth typically rank in the second quartile of every leaderboard, on par with 1–3B models and sometimes matching 7B WMT24pp Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation. systems, while using ≤ 6% of the parameters of the stateof-the-art 9B baselines. Synthetic data provide clear gains: relative to the parallel-only DIETA, DIETA+BT adds +1− 3

BLEU on four suites, and the continued-training variants

add a further +0.5− 2 BLEU at no increase in model size.

Directionality.

For four of the five test sets (NTREX128, Tatoeba, WMT24pp, FLORES-200) the IT→EN direction stays 2− 12 BLEU easier, reflecting richer target-side data during training. WikiNews-25 is the only outlier: here, EN→IT is slightly easier, reversing the usual trend.

In all cases the gap between directions narrows as more

back-translated Italian is introduced, indicating that the synthetic signal helps balance morphological complexity.

Summary.

A single 0.5 B decoder can deliver robust performance across news, conversational, encyclopaedic and recency-sensitive domains when fed with 768 M carefully curated sentence pairs. Continued training on mixed parallel + BT data (DIETA+cont) is the best allround recipe; an additional pass that folds in FineWeb

BT (DIETA+allsynth) further strengthens out-of-domain

generalisation (FLORES, WikiNews). Remaining headroom lies almost entirely in reference-free QE metrics, suggesting future work on QE-aware objectives rather than larger models.

7. Conclusions and Future Works We presented a family of five DIETA variants, built on

the same 0.5 B-parameter decoder-only Transformer and trained on up to 768 M carefully curated parallel + backtranslated sentence pairs.

Across five diverse benchmarks, the best variants, DIETA+cont and DIETA+allsynth, consistently places in the second performance tier, matching or surpassing models 2–3 × larger and trailing the current 9 B state-of-the-art by only a few BLEU/COMET points. This shows that data scale and task-specific training can compensate for an order-of-magnitude reduction in parameters, yielding models that fit on a single consumer GPU while remaining competitive with much larger LLMs. We also released WikiNews-25, a humanpost-edited English–Italian test set built from 2025 news, adding recent news to evaluation. As future work, we plan to (i) reduce the reference-free QE gap through QEaware fine-tuning, (ii) extend DIETA with parametereficient scaling such as sparse MoE, and (iii) enable edge deployment via distillation and 8/4-bit quantisation. Flores Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation.

Acknowledgments

We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support. This work was partially supported by the European Union – Next Generation EU within the project NRPP M4C2, Investment 1.,3 DD. 341 15 march 2022 – FAIR – Future Artificial Intelligence Research – Spoke 4 - PE00000013 - D53C22002380006, and by the MUR under the grant “Dipartimenti di Eccellenza 2023-2027” of the Department of Informatics, Systems and Communication of the University of Milano-Bicocca,

Italy. This work was completed in part at the CINECA Open Hackathon, part of the Open Hackathons pro

gram. The authors would like to acknowledge OpenACC

Standard.org for their support. We would also like to thank Daniele Di Bari for his helpful feedback and support. Declaration on Generative AI During the preparation of this work, the authors used GPT3.5 and GPT-4 in order to: Grammar and spelling

check, Paraphrase and reword. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. org/2024.lrec-main.32/. [2]

M. Braga, P. Kasela, A. Raganato, G. Pasi,

Synthetic data generation with large language models for personalized com munity question answering, in: 2024 IEEE/WIC International Conference on

Web Intelligence and Intelligent Agent Technology (WI-IAT), 2024, pp. 360–366. doi:10.1109/

WI-IAT62293.2024.00057. Se-pqa: Web Con3589335.3651445. [4]

G. Peikos, P. Kasela, G. Pasi, Leveraging large language models for medical information extraction and query generation, in: 2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2024, pp. 367–372. doi:10.1109/WI-IAT62293.2024.00058. [5]

A. Raganato, F. Bartoli, C. Crocamo, D. Cavaleri, G. Carrà, G. Pasi, M. Viviani, Leveraging prompt

engineering and large language models for automating madrs score computation for depression severity assessment,

in: Ital-IA 2024: 4th National

Conference on Artificial Intelligence, organized by CINI., Naples, Italy, 2024. [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At

tention is all you need, Advances in neural information processing systems 30 (2017).

Comput. Surv. 56 (2023). [8]

A. Hendy, M. Abdelrehim, A. Sharaf, V. Raunak, M. Gabr, H. Matsushita, Y. J. Kim, M. Afify, H. H.

Awadalla, How good are gpt models at machine translation? a comprehensive evaluation, arXiv preprint arXiv:2302.09210 (2023). [9] P. Koehn, Europarl: A parallel corpus for statistical machine translation, in: Proceedings of machine translation summit x: papers, 2005, pp. 79–86. [10] J. Tiedemann, Parallel data, tools and interfaces in opus, in: N. C. C. Chair), K. Choukri, T. Declerck,

M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis (Eds.), Proceedings of the Eight Interna

tional Conference on Language Resources and Evaluation (LREC’12), European Language Resources

Association (ELRA), Istanbul, Turkey, 2012. [11] M. Esplà-Gomis, M. L. Forcada, G. Ramírez-Sánchez, H. Hoang, Paracrawl: Web-scale parallel corpora

for the languages of the eu, in: Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks, 2019, pp. 118–119. large pre-trained language models: A survey, ACM titles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora, in: The

Eleventh International Conference on Language

Resources and Evaluation (LREC 2018), 2018. [23] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, [13] P. Lison, J. Tiedemann, OpenSubtitles2016: Ex- C. A. Choquette-Choo, K. Lee, D. Xin, A. Kusupati, tracting large parallel corpora from movie and TV R. Stella, A. Bapna, O. Firat, Madlad-400: A multisubtitles, in: Proceedings of the Tenth International lingual and document-level large audited dataset, Conference on Language Resources and Evaluation 2023. arXiv:2309.04662. (LREC’16), European Language Resources Associa- [24] D. M. Alves, J. Pombal, N. M. Guerreiro, P. H. Martion (ELRA), Portorož, Slovenia, 2016, pp. 923–929. tins, J. Alves, A. Farajian, B. Peters, R. Rei, P. Fernan[14] H. Schwenk, V. Chaudhary, S. Sun, H. Gong, des, S. Agrawal, et al., Tower: An open multilingual F. Guzmán, Wikimatrix: Mining 135m parallel sen- large language model for translation-related tasks, tences in 1620 language pairs from wikipedia, in: arXiv preprint arXiv:2402.17733 (2024). The 16th Conference of the European Chapter of the [25] G. Team, M. Riviere, S. Pathak, P. G. Sessa, Association for Computational Linguistics, 2021, pp. C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, 1351–1361. B. Shahriari, A. Ramé, et al., Gemma 2: Improving [15] M. R. Costa-Jussà, J. Cross, O. Çelebi, M. Elbayad, open language models at a practical size, arXiv K. Heafield, K. Hefernan, E. Kalbassi, J. Lam, preprint arXiv:2408.00118 (2024).

D. Licht, J. Maillard, et al., No language left be- [26] M. Cui, P. Gao, W. Liu, J. Luan, B. Wang, Multihind: Scaling human-centered machine translation, lingual machine translation with open large lanarXiv preprint arXiv:2207.04672 (2022). guage models at practical scale: An empirical [16] J. Tiedemann, M. Aulamo, D. Bakshandaeva, study, 2025. URL: https://arxiv.org/abs/2502.02481.

M. Boggia, S.-A. Grönroos, T. Nieminen, A. Ra- arXiv:2502.02481. ganato, Y. Scherrer, R. Vázquez, S. Virpioja, De- [27] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Comocratizing neural machine translation with opus- nia, E. Barba, S. Orlandini, G. Fiameni, R. Navmt, Language Resources and Evaluation 58 (2024) igli, Minerva LLMs: The first family of large 713–755. language models trained from scratch on Italian [17] J. Tiedemann, S. Thottingal, OPUS-MT – building data, in: F. Dell’Orletta, A. Lenci, S. Montemagni, open translation services for the world, in: Pro- R. Sprugnoli (Eds.), Proceedings of the 10th Italian ceedings of the 22nd Annual Conference of the Eu- Conference on Computational Linguistics (CLiCropean Association for Machine Translation, Euro- it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, pean Association for Machine Translation, Lisboa, 2024 , pp. 707–719. URL: https://aclanthology.org/ Portugal, 2020. 2024.clicit-1.77/. [18] R. Orlando, L. Moroni, P.-L. H. Cabot, S. Conia, [28] F. A. Galatolo, M. G. Cimino, Cerbero-7b: A leap forE. Barba, S. Orlandini, G. Fiameni, R. Navigli, Min- ward in language-specific llms through enhanced erva llms: The first family of large language models chat corpus generation and evaluation, arXiv trained from scratch on italian data, in: Proceedings preprint arXiv:2311.15698 (2023). of the 10th Italian Conference on Computational [29] R. Navigli, S. Conia, B. Ross, Biases in large Linguistics (CLiC-it 2024), 2024, pp. 707–719. language models: Origins, inventory, and discus[19] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, sion, J. Data and Information Quality 15 (2023).

G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod- doi:10.1145/3597307. els for efective text generation in italian language, [30] A. Bacciu, C. Campagnano, G. Trappolini, F. SilarXiv preprint arXiv:2312.09993 (2023). vestri, DanteLLM: Let’s push Italian LLM research [20] J. Tiedemann, OPUS – parallel corpora for ev- forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste, eryone, in: Proceedings of the 19th Annual Con- A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of ference of the European Association for Machine the 2024 Joint International Conference on ComTranslation: Projects/Products, Baltic Journal of putational Linguistics, Language Resources and Modern Computing, Riga, Latvia, 2016. URL: https: Evaluation (LREC-COLING 2024), ELRA and ICCL, //aclanthology.org/2016.eamt-2.8/. Torino, Italia, 2024, pp. 4343–4355. URL: https: [21] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, //aclanthology.org/2024.lrec-main.388/.

S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaud- [31] H. Schwenk, G. Wenzek, S. Edunov, E. Grave, hary, et al., Beyond english-centric multilingual A. Joulin, A. Fan, CCMatrix: Mining billions of highmachine translation, Journal of Machine Learning quality parallel sentences on the web, in: C. Zong, Research 22 (2021) 1–48. F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the [22] Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, 59th Annual Meeting of the Association for ComV. Chaudhary, J. Gu, A. Fan, Multilingual trans- putational Linguistics and the 11th International lation with extensible multilingual pretraining and Joint Conference on Natural Language Processing ifnetuning, arXiv preprint arXiv:2008.00401 (2020). (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021. MT, in: The Fifth Conference on Machine Trans[32] K. Wołk, K. Marasek, Building subject-aligned com- lation, Association for Computational Linguistics, parable corpora and mining it for truly parallel sen- Online, 2020.

tence pairs, Procedia Technology 18 (2014) 126–132. [42] NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, [33] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, M. Elbayad, K. Heafield, K. Hefernan, E. Kalbassi, T. Erjavec, D. Tufis, D. Varga, The jrc-acquis: A J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenmultilingual aligned parallel corpus with 20+ lan- zek, A. Youngblood, B. Akula, L. Barrault, G. Mejiaguages, in: LREC, 2006. Gonzalez, P. Hansanti, J. Hofman, S. Jarrett, K. R. [34] M. Aulamo, U. Sulubacak, S. Virpioja, J. Tiedemann, Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, Opustools and parallel corpus diagnostics, in: Pro- N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, ceedings of the Twelfth Language Resources and V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, Evaluation Conference, 2020, pp. 3782–3789. C. Ropers, S. Saleem, H. Schwenk, J. Wang, No [35] M. Aulamo, S. Virpioja, J. Tiedemann, OpusFil- language left behind: Scaling human-centered mater: A configurable parallel corpus filtering tool- chine translation (2022). box, in: Proceedings of the 58th Annual Meeting [43] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, Y. Liu, Roof the Association for Computational Linguistics: former: Enhanced transformer with rotary position System Demonstrations, Association for Computa- embedding, Neurocomputing 568 (2024) 127063. tional Linguistics, 2020. [44] R. He, A. Ravula, B. Kanagal, J. Ainslie, Realformer: [36] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, Transformer likes residual attention, in: Findings S. Gunasekar, M. Harrison, R. J. Hewett, M. Java- of the Association for Computational Linguistics: heripi, P. Kaufmann, et al., Phi-4 technical report, ACL-IJCNLP 2021, 2021, pp. 929–943. arXiv preprint arXiv:2412.08905 (2024). [45] A. Henry, P. R. Dachapally, S. S. Pawar, Y. Chen, [37] R. Sennrich, B. Haddow, A. Birch, Improving neural Query-key normalization for transformers, in: Findmachine translation models with monolingual data, ings of the Association for Computational Linguisin: Proceedings of the 54th Annual Meeting of the tics: EMNLP 2020, 2020, pp. 4246–4253. Association for Computational Linguistics (Volume [46] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, 1: Long Papers), 2016, pp. 86–96. J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, [38] T. Kocmi, E. Avramidis, R. Bawden, O. Bojar, I. Alabdulmohsin, et al., Scaling vision transformers A. Dvorkovich, C. Federmann, M. Fishel, M. Freitag, to 22 billion parameters, in: International conferT. Gowda, R. Grundkiewicz, B. Haddow, M. Karpin- ence on machine learning, PMLR, 2023, pp. 7480– ska, P. Koehn, B. Marie, C. Monz, K. Murray, M. Na- 7512. gata, M. Popel, M. Popović, M. Shmatova, S. Ste- [47] X. Chen, C. Liang, D. Huang, E. Real, K. Wang, ingrímsson, V. Zouhar, Findings of the WMT24 H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y. Lu, et al., general machine translation shared task: The LLM Symbolic discovery of optimization algorithms, Adera is here but MT is not solved yet, in: B. Haddow, vances in neural information processing systems T. Kocmi, P. Koehn, C. Monz (Eds.), Proceedings 36 (2023) 49205–49233. of the Ninth Conference on Machine Translation, [48] C. Federmann, T. Kocmi, Y. Xin, NTREX-128 – news Association for Computational Linguistics, Miami, test references for MT evaluation of 128 languages, USA, 2024. in: The First Workshop on Scaling Up Multilingual [39] G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, Evaluation, Association for Computational LinguisM. Mitchell, C. Rafel, L. V. Werra, T. Wolf, The tics, Online, 2022. ifneweb datasets: Decanting the web for the finest [49] D. Deutsch, E. Briakou, I. Caswell, M. Finkelstein, text data at scale, in: The Thirty-eight Confer- R. Galor, J. Juraska, G. Kovacs, A. Lui, R. Rei, J. Riesa, ence on Neural Information Processing Systems et al., Wmt24++: Expanding the language coverage Datasets and Bench marks Track, 2024 . URL: https: of wmt24 to 55 languages & dialects, arXiv preprint //openreview.net/forum?id=n6SCkn2QaG. arXiv:2502.12404 (2025). [40] G. Penedo, H. Kydlíček, V. Sabolčec, B. Messmer, [50] L. Barrault, O. Bojar, M. R. Costa-jussà, C. FederN. Foroutan, A. H. Kargaran, C. Rafel, M. Jaggi, L. V. mann, M. Fishel, Y. Graham, B. Haddow, M. Huck, Werra, T. Wolf, Fineweb2: One pipeline to scale P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, them all – adapting pre-training data processing M. Post, M. Zampieri, Findings of the 2019 conto every language, 2025. URL: https://arxiv.org/abs/ ference on machine translation (WMT19), in: The 2506.20920. arXiv:2506.20920. Fourth Conference on Machine Translation, Associ[41] J. Tiedemann, The tatoeba translation challenge – ation for Computational Linguistics, Florence, Italy, realistic data sets for low resource and multilingual 2019.

Braga ,

Raganato , G. Pasi, AdaKron: An adapter-based parameter eficient model tuning with kronecker product , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 350 - 357 . URL: https://aclanthology.

[3]

Kasela ,

Braga , G. Pasi,

Perego , Personalized community question answering , in: [7]

Min ,

Ross ,

Sulem ,

A. P. B.

Veyseh , T. H.

[12]

Lison ,

Tiedemann , M. Kouylekov, OpensubNguyen,

Sainz ,

Agirre ,

Heintz ,

Roth , Re-