<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pranav Kasela</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Braga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Ghiotto</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Pilzer</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Viviani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Raganato</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAUIN Dipartimento di Automatica e Informatica</institution>
          ,
          <addr-line>Politecnico di Torino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics</institution>
          ,
          <addr-line>Systems and Communication - DISCo</addr-line>
          ,
          <institution>University of Milano-Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NVIDIA AI Technology Center</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Università degli Studi di Pavia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian-English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, webcrawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian-English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian-English machine translation: https://github.com/pkasela/DIETA-Machine-Translation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Translation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Italian-English Translations</kwd>
        <kwd>Parallel Corpus</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ing and releasing a specialized, small decoder-only
Transformer model optimized for high-quality Italian–English
translation; (ii) creating and publicly releasing a
largescale, carefully curated parallel corpus from diverse
sources, and generating a synthetic corpus through
backtranslation; (iii) introducing the new WikiNews-25
evaluation set to facilitate benchmarking on recent, human- 3.1. Parallel Training Corpus
corrected content; (iv) conducting thorough evaluations
using multiple MT metrics.</p>
      <sec id="sec-1-1">
        <title>This section outlines the creation of a large Italian–English sentence pair corpus and a synthetic dataset derived from Web News and crawled data.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
    </sec>
    <sec id="sec-3">
      <title>3. Data Collection and Preparation</title>
      <sec id="sec-3-1">
        <title>Publicly available bilingual corpora play a central role in</title>
        <p>the development and evaluation of Machine Translation
(MT) systems. Among these, OPUS [20, 16] is a
wellknown source of multilingual datasets that have been
widely used in both statistical and neural MT research.</p>
        <p>Large-scale web-crawled corpora such as ParaCrawl [11]
and NLLB [21] are particularly noteworthy for their
coverage and scale, making them important resources for
training state-of-the-art multilingual MT models.</p>
        <p>Recent Transformer models such as mBART-50 [22], Filtering prompt
NLLB-200 [21], MADLAD-400 [23], Tower [24], and
Gemma-2 [25] have showed that expanding language Given the English and Italian sentences below, are
coverage and model capacity can significantly enhance they translations of each other? Answer with yes or
many-to-many translation quality. However, the compu- no only.
tational demands of these massive models, and the
inherent competition for representational capacity across
hundreds of languages, often leave room for improve- Figure 1: Prompt issued to Phi-4 during quality filtering.
ment on specific language pairs such as English–Italian.</p>
        <p>For many language directions, the open OPUS-MT
family [17, 16] remains a widely used baseline, yet its more
compact architectures lag behind the newest LLM-based
systems in fluency and versatility.</p>
        <p>General-purpose models like the GPT and LLaMA
series, when prompted or instruction-tuned, achieve
impressive zero-shot MT results. Specialised variants, like
GemmaX2-28 [26], further narrow the gap with commer- Sample formatting
cial MT engines. Meanwhile, to strengthen the
representation of Italian within multilingual LLMs, several ENG: English sentence IT: Italian translation
initiatives have introduced Italian-focused systems. Mod- IT: Italian sentence ENG: English translation
els such as LLaMAntino [19], Minerva [27], Cerbero [28],
ModelloItalia [29], and DanteLLM [30] leverage hundreds
of billions of Italian tokens and human feedback to yield
substantial improvements in Italian generation and
understanding. Nonetheless, these models are designed as
general-purpose language models and are not optimised
specifically for the MT task.</p>
        <p>In this work, we introduce a compact, 0.5B-parameter
decoder-only model, trained from scratch on a total of
768 million parallel and synthetic sentence pairs,
delivering a purpose-built, open solution for English↔Italian
machine translation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>After cleaning, the corpus contains 207 864 437</title>
        <p>high-quality sentence pairs. For bidirectional training,
each pair is duplicated with explicit direction tags,
resulting in a total of 415 728 874 source–target examples,
as illustrated in Figure 2.
To build a decoder-only model for bidirectional English ↔
Italian translation, we make use of every public bitext for
the pair available in OPUS [20]. Sources span Web crawls
[31, 21, 11], Wikipedia [10, 32, 14], parliamentary/legal
proceedings [9, 33], and film/TV subtitles [ 12]. Because
the NLLB corpus [21] contains CCMatrix, we keep only
the NLLB portion to prevent duplication.</p>
        <sec id="sec-3-2-1">
          <title>Cleaning and quality control. We remove exact du</title>
          <p>plicates using OpusTools and OpusFilter [34, 35], then
pass each remaining sentence pair to the Phi-4 LLM [36]
with the binary prompt shown in Figure 1. Pairs that
receive no are discarded.
3.2. Synthetic Data via Back-Translation</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>To expand the parallel training corpus, we generated</title>
        <p>additional sentence pairs by back-translation [37]. As
monolingual sources we used the NewsCrawl1 corpora
[38] and the web-scale FineWeb collection [39, 40].</p>
        <p>NewsCrawl. We translated Italian articles from Model Architecture. DIETA is a decoder-only
Trans2008–2018 and English articles from 2023 with the former composed of six identical layers, each adopting a
OPUS-MT-TC-BIG model [17, 41, 16]. The remaining seg- post-norm configuration. Every layer features a hidden
ments (Italian 2019–2024 and English 2024) were trans- dimension of 2048 and 32 attention heads. The
feedlated with NLLB-200-3.3B [42]. In total, this yielded forward sub-layer uses a squared-ReLU activation and
144,189,087 synthetic sentence pairs, comprising 67.8 M expands the hidden representation by a factor of four
Italian and 76.3 M English sentences. before projecting it back to the residual stream. Token
positions are encoded using rotary embeddings [43]. The
FineWeb. From the multilingual FineWeb2 we trans- architecture further incorporates residual attention
acculated 108.5 M Italian sentences, and from the English mulation [44] and query-key normalization [45, 46].
FineWeb crawl we translated 100 M English sentences
resulting in a total of 208,516,318 sentences, using the
multilingual GemmaX2-28-9B-v0.1 model [26].</p>
        <p>All translations were generated with the
CTranslate22 toolkit in greedy decoding mode
for eficient inference with large Transformer models.
3.3. Training corpus summary</p>
      </sec>
      <sec id="sec-3-4">
        <title>Duplicating the OPUS parallel pairs to cover both trans</title>
        <p>lation directions (i.e., from English to Italian and vice
versa) yields 415,728,874 direction-specific examples.
When combined with the 144,195,695 NewsCrawl and
208,516,318 FineWeb synthetic pairs, the total training
set comprises 768,440,887 source–target examples. We
shufle the corpus once before mini-batch construction.
3.4. Evaluation Sets</p>
      </sec>
      <sec id="sec-3-5">
        <title>In addition to standard benchmarks, we release</title>
        <p>WikiNews-25, a 450-segment test set based on 2025
WikiNews sentences. Machine translations generated
by Google Translate were post-edited using English as
the source language, retaining only those sentences that
required substantive corrections.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>This section describes the tokenizer, the model architecture, and the training strategy adopted to develop our proposed models.</title>
      </sec>
      <sec id="sec-4-2">
        <title>Training Schedule. Our models are implemented us</title>
        <p>ing the x-Transformers framework.4 Training is
performed for a single epoch over the dataset described in
Section 3, utilizing the Lion optimizer [47] with a learning
rate of 2 × 10− 4 and a linear decay schedule preceded by
a warm-up phase covering the first 10% of training steps.
We release five variants of our trained model checkpoints:
• DIETA: trained from scratch on the high-quality
filtered parallel corpus (415.7M sentence pairs).
• DIETA+BT: trained on the parallel corpus plus</p>
        <p>NewsCrawl back-translations (total 559,924,569 pairs).
• DIETA+cont: continues DIETA for a second epoch on
the same 559,924,569-pair mixture.
• DIETA+nosynth: continues DIETA for a second epoch
on the original parallel data only.
• DIETA+allsynth: continues DIETA+cont for a third epoch
on the full corpus (parallel + NewsCrawl + FineWeb),
totalling 768,440,887 pairs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>We evaluate a broad range of translation systems, providing for each the parameter count, model architecture, and main language coverage:</title>
        <p>• EuroLLM-1.7B (utter-project/EuroLLM-1.7B-Instruct;
1.7 B, LLaMA-style dense Transformer) — trained on
∼ 4 T multilingual tokens and instruction-tuned on
EuroBlocks; covers 35 EU + major languages;
• EuroLLM-9B (utter-project/EuroLLM-9B-Instruct; 9.15</p>
        <p>B) —same recipe as above at larger scale;</p>
      </sec>
      <sec id="sec-5-2">
        <title>4https://github.com/lucidrains/x-transformers</title>
      </sec>
      <sec id="sec-5-3">
        <title>2https://github.com/OpenNMT/CTranslate2</title>
        <p>3sapienzanlp/Minerva-7B-instruct-v1.0
Tokenizer. We use the 51,200-entry SentencePiece
vocabulary from the Minerva family of models [27].3 Un- • LLaMAntino-8B
(swap-uniba/LLaMAntino-3-ANITAlike general-purpose multilingual tokenizers, Minerva’s 8B-Inst-DPO-ITA; 8 B, Meta-Llama-3 backbone) — EN
vocabulary was specifically trained on a balanced cor- ↔ IT instruction + DPO tuned;
pus of high-quality Italian and English texts, resulting
in optimized sub-word segments aligned closely to the • Maestrale v0.4 (mii-llm/maestrale-chat-v0.4-beta; 7.2
morphological and orthographic structures of both lan- B, Mistral-7B continued-pretrain + SFT + DPO on 1.7
guages. This choice ensures that our models efectively M Italian instructions);
capture nuances specific to the Italian–English language
pair.
• mBART-50
(facebook/mbart-large-50-many-to-manymmt; 0.61 B seq-to-seq Transformer) —50-language
many-to-many MT;
• opus-mt (small) EN→IT / IT→EN
(Helsinki</p>
        <p>NLP/opus-mt-*; ∼ 270 M, Marian-Transformer);
• Minerva-7B (sapienzanlp/Minerva-7B-instruct-v1.0; 7 on mT5 and attains state-of-the-art correlation at
WMTB, Mistral-like) —pre-trained on 2.5 T tokens (50 % IT, 24 (we make use of google/metricx-24-hybrid-xl-v2p6),
50 % EN) + safety tuning; and COMET trains an XLM-R encoder on millions of
human-scored triplets (we use
Unbabel/wmt22-comet• PhiMaestra-3 (LeonardPuettmann/PhiMaestra-3- da as the comet model for evaluation). The third group
Translation; 3.8 B, Phi-3 mini) —fine-tuned on 0.5 M dispenses with references: QE MetricX (a “-QE” flavour
Tatoeba EN↔IT pairs; of MetricX-24) and COMETKiwi infer absolute
transla• Cerbero-7B (galatolo/cerbero-7b; 7 B, Mistral-7B tion quality directly from the source–hypothesis pair,
base) —Italian-centric LLM trained on synthetic Cer- enabling evaluation in real-time or on data lacking gold
bero corpus; references, we make use of
Unbabel/wmt23-cometkiwida-xl. Using all three families lets us cross-check surface
• NLLB-200 (600 M / 1.3 B / 3.3 B) (facebook/nllb-200-*) accuracy, semantic adequacy and reference-free quality
Transformer family covering 200 languages; estimation within a single experimental framework. Due
to resource constraints we report only automatic
evaluation; we leave human assessment to future work.
• opus-mt-big EN→IT / IT→EN (Helsinki-NLP/opus- Datasets. We evaluate selected baselines and our
modmt-tc-big-*; ∼ 560 M Transformer model with back- els on four widely used test collections: NTREX-128
translation); [48], Tatoeba [41], WMT-24pp [49], and FLORES-200 [15].</p>
        <p>NTREX-128, which is based on WMT-19 [50], includes
• ModelloItalia-9B (sapienzanlp/modello-italia-9b; 9 B, 1,997 sentences translated from English into 128 target</p>
        <p>GPT-NeoX) —Italian LLM by iGenius/CINECA; languages, including Italian. Tatoeba is a
community• Llama-3.1-8B-ITA (DeepMount00/Llama-3.1-8b-ITA; sourced corpus that focuses on everyday conversational
8 B, Meta-Llama-3.1 fine-tuned for Italian); language and informal registers, allowing us to assess our
models’ robustness beyond formal contexts. WMT-24pp
• Tower-7B (Unbabel/TowerInstruct-7B-v0.2; 6.7 B, is a professionally translated extension of the WMT24
LLaMA-2 base) —10-language MT and post-editing dataset [38] on new languages, such as Italian.
FLOREStasks; 200 is composed of professionally translated
Wikipedia• Gemma-2B / 9B (ModelSpace/GemmaX2-28-{2B,9B}; based sentences per language, covering encyclopedic
con3.2 B / 10.2 B, Gemma-2 continued-pretrain + MT SFT tent distinct from the news domain.
for 28 languages); Additionally, to specifically evaluate translation
quality on recent texts, we introduce and use our new
benchmark, WikiNews-25, as described earlier in Section 3.
• MADLAD-3B / 7B (google/madlad400-{3b,7b}-mt; 3 B
/ 7.2 B, T5) —400+-language MT trained on up to 1 T
tokens.</p>
        <p>Automatic metrics. To assess the MT systems, we
grouped the evaluation metrics into three categories:
• Surface – overlap: sacrebleu (BLEU–4) and chrF ;
• Neural, reference–based: BLEURT, Google’s
MetricX</p>
        <p>24, and Unbabel’s COMET ;
• Neural, reference–free (QE): the QE MetricX variant
and COMETKiwi.</p>
      </sec>
      <sec id="sec-5-4">
        <title>The first group measures literal agreement with the</title>
        <p>reference: sacrebleu implements the standard BLEU
computation with canonical tokenisation for reproducible
scores, while chrF computes a character -gram
Fscore that is more robust to morphological variation.
The second group regresses directly towards human
Direct-Assessment/MQM ratings: BLEURT fine-tunes
BERT/RemBERT to predict adequacy and fluency, in
particular, we relied on BLEURT-20 model, MetricX-24 builds</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>Decoding policy. Unless otherwise indicated, system outputs were generated with greedy decoding. Whenever a model name ends with the sufix “ -b5” we used beam search with beam size 5.</title>
        <p>In what follows we comment on the outcomes obtained
by our DIETA model against the 15+ baselines introduced
in Section 5. We discuss one benchmark at a time, always
reporting the same seven automatic metrics and both
translation directions (EN→IT/IT→EN). With the
exception of metricx and qemetricx, higher is better.
6.1. NTREX-128
Table 1 reports NTREX-128 results. Overall performance
scales with size: Gemma-9B-b5 leads on every metric
(≈ 51/49 BLEU, 72/70 chrF, BLEURT 0.36/0.48, MetricX
1.60/2.43, COMET 0.90/0.89). Our compact DIETA+cont
reaches 36/43 BLEU, 62/66 chrF, BLEURT 0.20/0.41 and
NTREX-128 Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation.
lands just behind the largest models and surpasses
evB and OPUS-MT-big. The remaining gap appears chiefly
ery competitor below 3B, confirming that targeted
backfor the largest decoders.
in reference-free QE, where MetricX is ≈
0.3–0.5 higher
translation closes most of the size-related gap, remaining
while holding MetricX and COMET-Kiwi values on par
6.2. Tatoeba
ing Tower-7B and surpassing all models ≤</p>
      </sec>
      <sec id="sec-6-2">
        <title>Reference-based metrics echo this: DIETA+cont sits within</title>
        <p>0.01–0.02 COMET of Gemma-2B-b5, while BLEURT is
only 0.01–0.02 behind Madlad-7B. MetricX and
COMET</p>
      </sec>
      <sec id="sec-6-3">
        <title>Kiwi remain scale-sensitive, DIETA trails the 9 B tier by</title>
      </sec>
      <sec id="sec-6-4">
        <title>3 B parameters.</title>
        <p>∼ 0.9 MetricX points.
parameter</p>
      </sec>
      <sec id="sec-6-5">
        <title>DIETA model delivers mid-table performance—competitive with 7 B systems and clearly ahead</title>
        <p>Tatoeba Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation.
sacrebleu(↑)
chrf(↑)
bleurt(↑)
metricx(↓)
comet(↑)
qemetricx(↓)
cometkiwi(↑)
0.875/0.875). The largest gap remains in reference-free
quality estimation: MetricX for DIETA is ≈ 0.6 points
higher than the 9 B leader.
7B, and our DIETA+cont/ DIETA+allsynth, sits within ≈ 4
BLEU and 0.02 COMET. In particular, DIETA+all synth
scores 45.7 BLEU / 67.6 chrF (EN→IT) and 43.8 BLEU /
67.3 chrF (IT→EN), essentially matching Tower-7B and</p>
      </sec>
      <sec id="sec-6-6">
        <title>NLLB-3.3B despite being 14× smaller. Reference-based</title>
        <p>metrics mirror this parity (COMET 0.826/0.868), while</p>
      </sec>
      <sec id="sec-6-7">
        <title>MetricX and COMET-Kiwi still favour the largest decoders by roughly 0.3–0.4 points.</title>
        <p>6.6. Cross-benchmark Analysis</p>
        <sec id="sec-6-7-1">
          <title>Parameter eficiency.</title>
        </sec>
      </sec>
      <sec id="sec-6-8">
        <title>All five checkpoints share the</title>
        <p>same 0.5B backbone, yet DIETA+cont and DIETA+allsynth
typically rank in the second quartile of every leaderboard,
on par with 1–3B models and sometimes matching 7B
WMT24pp Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation.
systems, while using ≤ 6% of the parameters of the
stateof-the-art 9B baselines. Synthetic data provide clear gains:
relative to the parallel-only DIETA, DIETA+BT adds +1− 3</p>
      </sec>
      <sec id="sec-6-9">
        <title>BLEU on four suites, and the continued-training variants</title>
        <p>add a further +0.5− 2 BLEU at no increase in model size.</p>
        <sec id="sec-6-9-1">
          <title>Directionality.</title>
          <p>For four of the five test sets
(NTREX128, Tatoeba, WMT24pp, FLORES-200) the IT→EN
direction stays 2− 12 BLEU easier, reflecting richer target-side
data during training. WikiNews-25 is the only outlier:
here, EN→IT is slightly easier, reversing the usual trend.</p>
        </sec>
      </sec>
      <sec id="sec-6-10">
        <title>In all cases the gap between directions narrows as more</title>
        <p>back-translated Italian is introduced, indicating that the
synthetic signal helps balance morphological complexity.</p>
        <sec id="sec-6-10-1">
          <title>Summary.</title>
          <p>A single 0.5 B decoder can deliver robust
performance across news, conversational, encyclopaedic
and recency-sensitive domains when fed with 768 M
carefully curated sentence pairs. Continued training on
mixed parallel + BT data (DIETA+cont) is the best
allround recipe; an additional pass that folds in FineWeb</p>
        </sec>
      </sec>
      <sec id="sec-6-11">
        <title>BT (DIETA+allsynth) further strengthens out-of-domain</title>
        <p>generalisation (FLORES, WikiNews). Remaining
headroom lies almost entirely in reference-free QE metrics,
suggesting future work on QE-aware objectives rather
than larger models.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Works</title>
      <sec id="sec-7-1">
        <title>We presented a family of five</title>
      </sec>
      <sec id="sec-7-2">
        <title>DIETA variants, built on</title>
        <p>the same 0.5 B-parameter decoder-only Transformer and
trained on up to 768 M carefully curated parallel +
backtranslated sentence pairs.</p>
        <p>Across five diverse
benchmarks, the best variants, DIETA+cont and DIETA+allsynth,
consistently places in the second performance tier,
matching or surpassing models 2–3 × larger and trailing the
current 9 B state-of-the-art by only a few BLEU/COMET
points. This shows that data scale and task-specific
training can compensate for an order-of-magnitude
reduction in parameters, yielding models that fit on a single
consumer GPU while remaining competitive with much
larger LLMs. We also released WikiNews-25, a
humanpost-edited English–Italian test set built from 2025 news,
adding recent news to evaluation. As future work, we
plan to (i) reduce the reference-free QE gap through
QEaware fine-tuning, (ii) extend DIETA with
parametereficient scaling such as sparse MoE, and (iii) enable edge
deployment via distillation and 8/4-bit quantisation.
Flores Translation Results. The sufix -b5 indicates that beam search with 5 beams was used during generation.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We acknowledge the CINECA award under the ISCRA
initiative, for the availability of high-performance
computing resources and support. This work was partially
supported by the European Union – Next Generation EU
within the project NRPP M4C2, Investment 1.,3 DD. 341
15 march 2022 – FAIR – Future Artificial Intelligence
Research – Spoke 4 - PE00000013 - D53C22002380006, and
by the MUR under the grant “Dipartimenti di Eccellenza
2023-2027” of the Department of Informatics, Systems
and Communication of the University of Milano-Bicocca,</p>
      <sec id="sec-8-1">
        <title>Italy. This work was completed in part at the CINECA</title>
      </sec>
      <sec id="sec-8-2">
        <title>Open Hackathon, part of the Open Hackathons pro</title>
        <p>gram. The authors would like to acknowledge
OpenACC</p>
      </sec>
      <sec id="sec-8-3">
        <title>Standard.org for their support. We would also like to thank Daniele Di Bari for his helpful feedback and support.</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <sec id="sec-9-1">
        <title>During the preparation of this work, the authors used</title>
      </sec>
      <sec id="sec-9-2">
        <title>GPT3.5 and GPT-4 in order to: Grammar and spelling</title>
        <p>check, Paraphrase and reword. After using these
tools/services, the authors reviewed and edited the content as
needed and take full responsibility for the publication’s
content.
org/2024.lrec-main.32/.
[2]</p>
      </sec>
      <sec id="sec-9-3">
        <title>M. Braga, P. Kasela, A. Raganato, G. Pasi,</title>
        <p>
          Synthetic data generation with large language
models for personalized com
          <xref ref-type="bibr" rid="ref1">munity question
answering, in: 2024</xref>
          IEEE/WIC International Conference
on
        </p>
      </sec>
      <sec id="sec-9-4">
        <title>Web Intelligence and Intelligent Agent Technology (WI-IAT), 2024, pp. 360–366. doi:10.1109/</title>
        <p>WI-IAT62293.2024.00057.
Se-pqa:
Web
Con3589335.3651445.
[4]</p>
        <p>G. Peikos, P. Kasela, G. Pasi, Leveraging large
language models for medical information extraction
and query generation, in: 2024 IEEE/WIC
International Conference on Web Intelligence and
Intelligent Agent Technology (WI-IAT), 2024, pp. 367–372.
doi:10.1109/WI-IAT62293.2024.00058.
[5]</p>
      </sec>
      <sec id="sec-9-5">
        <title>A. Raganato, F. Bartoli, C. Crocamo, D. Cavaleri,</title>
      </sec>
      <sec id="sec-9-6">
        <title>G. Carrà, G. Pasi, M. Viviani, Leveraging prompt</title>
        <p>engineering and large language models for
automating madrs score computation for depression
severity assessment,</p>
        <p>in: Ital-IA 2024: 4th National</p>
      </sec>
      <sec id="sec-9-7">
        <title>Conference on Artificial Intelligence, organized by</title>
      </sec>
      <sec id="sec-9-8">
        <title>CINI., Naples, Italy, 2024. [6]</title>
      </sec>
      <sec id="sec-9-9">
        <title>A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,</title>
      </sec>
      <sec id="sec-9-10">
        <title>L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At</title>
        <p>tention is all you need, Advances in neural
information processing systems 30 (2017).</p>
        <p>Comput. Surv. 56 (2023).
[8]</p>
      </sec>
      <sec id="sec-9-11">
        <title>A. Hendy, M. Abdelrehim, A. Sharaf, V. Raunak,</title>
      </sec>
      <sec id="sec-9-12">
        <title>M. Gabr, H. Matsushita, Y. J. Kim, M. Afify, H. H.</title>
        <p>Awadalla, How good are gpt models at machine
translation? a comprehensive evaluation, arXiv
preprint arXiv:2302.09210 (2023).
[9] P. Koehn, Europarl: A parallel corpus for statistical
machine translation, in: Proceedings of machine
translation summit x: papers, 2005, pp. 79–86.
[10] J. Tiedemann, Parallel data, tools and interfaces in
opus, in: N. C. C. Chair), K. Choukri, T. Declerck,</p>
      </sec>
      <sec id="sec-9-13">
        <title>M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk,</title>
      </sec>
      <sec id="sec-9-14">
        <title>S. Piperidis (Eds.), Proceedings of the Eight Interna</title>
        <p>tional Conference on Language Resources and
Evaluation (LREC’12), European Language Resources</p>
      </sec>
      <sec id="sec-9-15">
        <title>Association (ELRA), Istanbul, Turkey, 2012. [11]</title>
      </sec>
      <sec id="sec-9-16">
        <title>M. Esplà-Gomis, M. L. Forcada, G. Ramírez-Sánchez,</title>
      </sec>
      <sec id="sec-9-17">
        <title>H. Hoang, Paracrawl: Web-scale parallel corpora</title>
        <p>for the languages of the eu, in: Proceedings of
Machine Translation Summit XVII: Translator, Project
and User Tracks, 2019, pp. 118–119.
large pre-trained language models: A survey, ACM
titles2018: Statistical rescoring of sentence
alignments in large, noisy parallel corpora, in: The</p>
      </sec>
      <sec id="sec-9-18">
        <title>Eleventh International Conference on Language</title>
        <p>Resources and Evaluation (LREC 2018), 2018. [23] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia,
[13] P. Lison, J. Tiedemann, OpenSubtitles2016: Ex- C. A. Choquette-Choo, K. Lee, D. Xin, A. Kusupati,
tracting large parallel corpora from movie and TV R. Stella, A. Bapna, O. Firat, Madlad-400: A
multisubtitles, in: Proceedings of the Tenth International lingual and document-level large audited dataset,
Conference on Language Resources and Evaluation 2023. arXiv:2309.04662.
(LREC’16), European Language Resources Associa- [24] D. M. Alves, J. Pombal, N. M. Guerreiro, P. H.
Martion (ELRA), Portorož, Slovenia, 2016, pp. 923–929. tins, J. Alves, A. Farajian, B. Peters, R. Rei, P.
Fernan[14] H. Schwenk, V. Chaudhary, S. Sun, H. Gong, des, S. Agrawal, et al., Tower: An open multilingual
F. Guzmán, Wikimatrix: Mining 135m parallel sen- large language model for translation-related tasks,
tences in 1620 language pairs from wikipedia, in: arXiv preprint arXiv:2402.17733 (2024).
The 16th Conference of the European Chapter of the [25] G. Team, M. Riviere, S. Pathak, P. G. Sessa,
Association for Computational Linguistics, 2021, pp. C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard,
1351–1361. B. Shahriari, A. Ramé, et al., Gemma 2: Improving
[15] M. R. Costa-Jussà, J. Cross, O. Çelebi, M. Elbayad, open language models at a practical size, arXiv
K. Heafield, K. Hefernan, E. Kalbassi, J. Lam, preprint arXiv:2408.00118 (2024).</p>
        <p>D. Licht, J. Maillard, et al., No language left be- [26] M. Cui, P. Gao, W. Liu, J. Luan, B. Wang,
Multihind: Scaling human-centered machine translation, lingual machine translation with open large
lanarXiv preprint arXiv:2207.04672 (2022). guage models at practical scale: An empirical
[16] J. Tiedemann, M. Aulamo, D. Bakshandaeva, study, 2025. URL: https://arxiv.org/abs/2502.02481.</p>
        <p>
          M. Boggia, S.-A. Grönroos, T. Nieminen, A. Ra- arXiv:2502.02481.
ganato, Y. Scherrer, R. Vázquez, S. Virpioja, De- [27] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S.
Comocratizing neural machine translation with opus- nia, E. Barba, S. Orlandini, G. Fiameni, R.
Navmt, Language Resources and Evaluation 58 (2024) igli, Minerva LLMs: The first family of large
713–755. language models trained from scratch on Italian
[17] J. Tiedemann, S. Thottingal, OPUS-MT – building data, in: F. Dell’Orletta, A. Lenci, S. Montemagni,
open translation services for the world, in: Pro- R. Sprugnoli (Eds.), Proceedings of the 10th Italian
ceedings of the 22nd Annual Conference of the Eu- Conference on Computational Linguistics
(CLiCropean Association for
          <xref ref-type="bibr" rid="ref1">Machine Translation, Euro- it 2024</xref>
          ), CEUR Workshop Proceedings, Pisa, Italy,
pean Association for
          <xref ref-type="bibr" rid="ref1">Machine Translation, Lisboa, 2024</xref>
          , pp. 707–719. URL: https://aclanthology.org/
Portugal, 2020. 2024.clicit-1.77/.
[18] R. Orlando, L. Moroni, P.-L. H. Cabot, S. Conia, [28] F. A. Galatolo, M. G. Cimino, Cerbero-7b: A leap
forE. Barba, S. Orlandini, G. Fiameni, R. Navigli, Min- ward in language-specific llms through enhanced
erva llms: The first family of large language models chat corpus generation and evaluation, arXiv
trained from scratch on italian data, in: Proceedings preprint arXiv:2311.15698 (2023).
of the 10th Italian Conference on Computational [29] R. Navigli, S. Conia, B. Ross, Biases in large
Linguistics (CLiC-it 2024), 2024, pp. 707–719. language models: Origins, inventory, and
discus[19] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, sion, J. Data and Information Quality 15 (2023).
        </p>
        <p>
          G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod- doi:10.1145/3597307.
els for efective text generation in italian language, [30] A. Bacciu, C. Campagnano, G. Trappolini, F.
SilarXiv preprint arXiv:2312.09993 (2023). vestri, DanteLLM: Let’s push Italian LLM research
[20] J. Tiedemann, OPUS – parallel corpora for ev- forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste,
eryone, in: Proceedings of the 19th Annual Con- A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of
ference of the European Association for
          <xref ref-type="bibr" rid="ref1">Machine the 2024</xref>
          Joint International Conference on
ComTranslation: Projects/Products, Baltic Journal of putational Linguistics, Language Resources and
Modern Computing, Riga, Latvia, 2016. URL: https: Evaluation (LREC-COLING 2024), ELRA and ICCL,
//aclanthology.org/2016.eamt-2.8/. Torino, Italia, 2024, pp. 4343–4355. URL: https:
[21] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, //aclanthology.org/2024.lrec-main.388/.
        </p>
        <p>S. Goyal, M. Baines, O. Celebi, G. Wenzek, V. Chaud- [31] H. Schwenk, G. Wenzek, S. Edunov, E. Grave,
hary, et al., Beyond english-centric multilingual A. Joulin, A. Fan, CCMatrix: Mining billions of
highmachine translation, Journal of Machine Learning quality parallel sentences on the web, in: C. Zong,
Research 22 (2021) 1–48. F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the
[22] Y. Tang, C. Tran, X. Li, P.-J. Chen, N. Goyal, 59th Annual Meeting of the Association for
ComV. Chaudhary, J. Gu, A. Fan, Multilingual trans- putational Linguistics and the 11th International
lation with extensible multilingual pretraining and Joint Conference on Natural Language Processing
ifnetuning, arXiv preprint arXiv:2008.00401 (2020). (Volume 1: Long Papers), Association for
Computational Linguistics, Online, 2021. MT, in: The Fifth Conference on Machine
Trans[32] K. Wołk, K. Marasek, Building subject-aligned com- lation, Association for Computational Linguistics,
parable corpora and mining it for truly parallel sen- Online, 2020.</p>
        <p>
          tence pairs, Procedia Technology 18 (2014) 126–132. [42] NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi,
[33] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, M. Elbayad, K. Heafield, K. Hefernan, E. Kalbassi,
T. Erjavec, D. Tufis, D. Varga, The jrc-acquis: A J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G.
Wenmultilingual aligned parallel corpus with 20+ lan- zek, A. Youngblood, B. Akula, L. Barrault, G.
Mejiaguages, in: LREC, 2006. Gonzalez, P. Hansanti, J. Hofman, S. Jarrett, K. R.
[34] M. Aulamo, U. Sulubacak, S. Virpioja, J. Tiedemann, Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews,
Opustools and parallel corpus diagnostics, in: Pro- N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao,
ceedings of the Twelfth Language Resources and V. Goswami, F. Guzmán, P. Koehn, A. Mourachko,
Evaluation Conference, 2020, pp. 3782–3789. C. Ropers, S. Saleem, H. Schwenk, J. Wang, No
[35] M. Aulamo, S. Virpioja, J. Tiedemann, OpusFil- language left behind: Scaling human-centered
mater: A configurable parallel corpus filtering tool- chine translation (2022).
box, in: Proceedings of the 58th Annual Meeting [43] J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, Y. Liu,
Roof the Association for Computational Linguistics: former: Enhanced transformer with rotary position
System Demonstrations, Association for Computa- embedding, Neurocomputing 568 (2024) 127063.
tional Linguistics, 2020. [44] R. He, A. Ravula, B. Kanagal, J. Ainslie, Realformer:
[36] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, Transformer likes residual attention, in: Findings
S. Gunasekar, M. Harrison, R. J. Hewett, M. Java- of the Association for Computational Linguistics:
heripi, P. Kaufmann, et al., Phi-4 technical report, ACL-IJCNLP 2021, 2021, pp. 929–943.
arXiv preprint arXiv:2412.08905 (2024). [45] A. Henry, P. R. Dachapally, S. S. Pawar, Y. Chen,
[37] R. Sennrich, B. Haddow, A. Birch, Improving neural Query-key normalization for transformers, in:
Findmachine translation models with monolingual data, ings of the Association for Computational
Linguisin: Proceedings of the 54th Annual Meeting of the tics: EMNLP 2020, 2020, pp. 4246–4253.
Association for Computational Linguistics (Volume [46] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski,
1: Long Papers), 2016, pp. 86–96. J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos,
[38] T. Kocmi, E. Avramidis, R. Bawden, O. Bojar, I. Alabdulmohsin, et al., Scaling vision transformers
A. Dvorkovich, C. Federmann, M. Fishel, M. Freitag, to 22 billion parameters, in: International
conferT. Gowda, R. Grundkiewicz, B. Haddow, M. Karpin- ence on machine learning, PMLR, 2023, pp. 7480–
ska, P. Koehn, B. Marie, C. Monz, K. Murray, M. Na- 7512.
gata, M. Popel, M. Popović, M. Shmatova, S. Ste- [47] X. Chen, C. Liang, D. Huang, E. Real, K. Wang,
ingrímsson, V. Zouhar, Findings of the WMT24 H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y. Lu, et al.,
general machine translation shared task: The LLM Symbolic discovery of optimization algorithms,
Adera is here but MT is not solved yet, in: B. Haddow, vances in neural information processing systems
T. Kocmi, P. Koehn, C. Monz (Eds.), Proceedings 36 (2023) 49205–49233.
of the Ninth Conference on Machine Translation, [48] C. Federmann, T. Kocmi, Y. Xin, NTREX-128 – news
Association for Computational Linguistics, Miami, test references for MT evaluation of 128 languages,
USA, 2024. in: The First Workshop on Scaling Up Multilingual
[39] G. Penedo, H. Kydlíček, L. B. allal, A. Lozhkov, Evaluation, Association for Computational
LinguisM. Mitchell, C. Rafel, L. V. Werra, T. Wolf, The tics, Online, 2022.
ifneweb datasets: Decanting the web for the finest [49] D. Deutsch, E. Briakou, I. Caswell, M. Finkelstein,
text data at scale, in: The Thirty-eight Confer- R. Galor, J. Juraska, G. Kovacs, A. Lui, R. Rei, J. Riesa,
ence on Neural Information Processing Systems et al., Wmt24++: Expanding the language coverage
Datasets and Bench
          <xref ref-type="bibr" rid="ref1">marks Track, 2024</xref>
          . URL: https: of wmt24 to 55 languages &amp; dialects, arXiv preprint
//openreview.net/forum?id=n6SCkn2QaG. arXiv:2502.12404 (2025).
[40] G. Penedo, H. Kydlíček, V. Sabolčec, B. Messmer, [50] L. Barrault, O. Bojar, M. R. Costa-jussà, C.
FederN. Foroutan, A. H. Kargaran, C. Rafel, M. Jaggi, L. V. mann, M. Fishel, Y. Graham, B. Haddow, M. Huck,
Werra, T. Wolf, Fineweb2: One pipeline to scale P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal,
them all – adapting pre-training data processing M. Post, M. Zampieri, Findings of the 2019
conto every language, 2025. URL: https://arxiv.org/abs/ ference on machine translation (WMT19), in: The
2506.20920. arXiv:2506.20920. Fourth Conference on Machine Translation,
Associ[41] J. Tiedemann, The tatoeba translation challenge – ation for Computational Linguistics, Florence, Italy,
realistic data sets for low resource and multilingual 2019.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Braga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          , G. Pasi,
          <string-name>
            <surname>AdaKron:</surname>
          </string-name>
          <article-title>An adapter-based parameter eficient model tuning with kronecker product</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>350</fpage>
          -
          <lpage>357</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kasela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braga</surname>
          </string-name>
          , G. Pasi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <article-title>Personalized community question answering</article-title>
          , in: [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P. B.</given-names>
            <surname>Veyseh</surname>
          </string-name>
          , T. H.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          , M. Kouylekov, OpensubNguyen,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Heintz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          , Re-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>