<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2024.acl-long.643</article-id>
      <title-group>
        <article-title>No longer left behind: Self-training Reasoning Models in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Ranaldi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univeristy of Roma Tor Vergata</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Univeristy of Edinburgh</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>11917</fpage>
      <lpage>11928</lpage>
      <abstract>
        <p>Although reasoning is, by nature, language-agnostic, the extent to which large language models (LLMs) can perform consistent multilingual reasoning remains limited. Their capacity to deliver step-wise explanations is largely constrained to the dominant languages present in their pre-training data, thereby limiting cross-lingual generalisation and hindering broader global applicability. While recent work has explored a range of strategies to extend reasoning capabilities beyond English, these eforts typically remain grounded in surface-level spoken language phenomena, which may not be optimal for abstract or formal reasoning tasks. In this study, we focus on Italian and English, two languages with markedly diferent syntactic and morphological properties, to assess whether advancements in multilingual reasoning remain consistent and transferable across typologically diverse settings. To this end, we introduce a modular framework that guides LLMs to abstract the reasoning process into a structured problem space before generating step-wise reasoning trajectories. The approach leverages self-training to enhance alignment and generalisation. Experimental results demonstrate stable and significant gains in multilingual reasoning across models and tasks, with improved consistency between English and Italian.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multilingual Reasoning</kwd>
        <kwd>Self-training</kwd>
        <kwd>Large Reasoning Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tions in a systematic way. However, they must also have
robust multilingual proficiency. Therefore, many works
In the era of large language models (LLMs), approaches rely on SFT techniques that maintain reduced costs when
such as Chain-of-Thought (CoT) and related methods used with specialised, smaller-scale LLMs. Secondly, they
seek to emulate human reasoning through language gen- require vast amounts of complex reasoning annotations
eration—an ability that, in principle, ought not to be con- and tremendous tuning eforts to get multilingual LLMs
strained by the particularities of any spoken language. capable of delivering reasoning through SFT and
preferYet, a growing body of evidence indicates that the rea- ence optimisation techniques.
soning capabilities of LLMs vary significantly across lan- To enhance multilingual reasoning in LLMs, we
proguages, largely as a consequence of imbalances in pre- pose a modular approach that first instructs the model to
training data. LLMs perform better in dominant lan- abstractly formalise the problem and then generate
strucguages, notably English, while exhibiting reduced rea- tured, step-by-step reasoning trajectories that converge
soning competence in less-represented languages. towards a consistent reasoning process across languages.</p>
      <p>Research advances in multilingual reasoning are in- Our approach decomposes problem solutions into a
creasingly aimed at closing the performance diferences sequence of formal, language-agnostic sub-problems
among languages, enhancing the models’ capabilities that are solved sequentially and can be more efectively
through in-context learning interventions [1, 2, 3], SFT utilised by models.
strategies that difer from language-specific augmenta- The decomposition consists of two high-level
modtion [4, 5] to task-oriented tuning [6], and preference ules: Formalisation and Reasoning Execution. As
illusoptimisation [7, 8]. Although these approaches have en- trated in Figure 1, we guide the models to: (i) identify
abled the development of efective methods for transfer- the relevant information within the problem,
formalisring and aligning multilingual reasoning capabilities, we ing variables and predicates while delivering symbolic
argue that several critical challenges continue to hinder transformations; (ii) generate a reasoning execution
traprogress. First and foremost, the benefits of in-context jectory in which the transformations are applied using
interventions appear to be confined to large-scale LLMs, symbolic representations that explicitly articulate the
sowhich are better equipped to interpret and follow instruc- lution, ultimately yielding an answer in the same query
language.</p>
      <p>CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- Previous works proposed English-based strategies that
tics, September 24 — 26, 2025, Cagliari, Italy operate via logical formalisms coupled with external
symn$amnea.msuer.snuarmnea@mee@d.aucn.uirko m(La.2R.iatn(Fa.ldRi)analdi); bolic solvers [9, 10]. Yet, fully symbolic approaches face
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License a key bottleneck: they require a complete translation
Attribution 4.0 International (CC BY 4.0).</p>
      <p>Un gruppo di 200 studenti
ha una varietà di hobby. 50
amano leggere, 29
preferiscono giocare a
cricket e il resto ama
bal are o cucinare. Quanti
studenti preferiscono
bal are se il numero di
quel i che preferiscono
cucinare è 2 meno del
doppio di quel i che
preferiscono giocare a
cricket?
“A group of 200 students has
various hobbies. 50 like to
read, 29 like to play cricket,
and the rest like to either
dance or bake. How many
like to dance if the number
that like to bake is 2 less than
twice the number that prefer
playing cricket?”
&lt;formalisation&gt;</p>
      <p>S=200
R=50
C=29
D=?
B=?
B=2C-2
5R0++C2+9D++DB+=BS=200
&lt;/formalisation&gt;
&lt;reasoning&gt;
1. Calcoliamo B: B = 2(29) - 2 = 58 - 2 = 56
2. Sostituiamo B = 56 nel ’equazione totale:</p>
      <p>50 + 29 + D + 56 = 200
34.. 1D3=52+0D0 -=123050= 65
&lt;/reasoning&gt;
&lt;answer&gt; La risposta è 65. &lt;/answer&gt;
&lt;formalisation&gt;</p>
      <p>S=200
R=50
C=29
D=?
B=?
B=2C-2
5R0++C2+9D++DB+=BS=200
&lt;/formalisation&gt;
&lt;reasoning&gt;
1. Compute B: B = 2(29) - 2 = 58 - 2 = 56
2. Sobstitute B = 56 to the total equation:</p>
      <p>50 + 29 + D + 56 = 200
34.. 1D3=52+0D0 -=123050= 65
&lt;/reasoning&gt;
&lt;answer&gt; The answer is 65. &lt;/answer&gt;</p>
      <p>LLM
(SFT)
Verification
Refinement</p>
      <p>Policy
Model</p>
      <p>Self-Training</p>
      <p>GRPO
o1 ...</p>
      <p>o1
8k
data</p>
      <p>LLM</p>
      <p>Annotation, Refinement
warm-up via SFT</p>
      <p>Self-improvement via RL
from natural to formal language, which can hinder both consistently strong performance across all
laneficiency and flexibility, introducing additional layers of guages. Conversely, relying solely on preference
complexity. optimisation can provide performance gains, but</p>
      <p>To achieve a better trade-of, we treat formalisations at the cost of significant computational overhead.
in an eclectic manner and propose methods to
disentangle content from logical reasoning without introducing • Our approach allows the disentanglement of
conrigorous formalisms. tent from logical reasoning, improving
multilin</p>
      <p>To this end, following Ranaldi and Pucci [11], we in- gual reasoning in LLMs, thus benefiting in
diferstruct larger LLMs to generate synthetic demonstrations ent language spaces.
through Structured Abstractive Generative Explanation
(SAGE), which are then used to perform Self-training on 2. Method
smaller LLMs.</p>
      <p>As part of the warm-up phase, we experiment with We propose a self-training framework that augments
multiple alignment strategies, ranging from supervised standard fine-tuning with a set of preference
optimisaifne-tuning (Instruction-Tuning) to preference optimisa- tion policies (§ 2.1) designed to improve self-refinment.
tion techniques (Reinforcement Learning). The approach iteratively alternates between
preference</p>
      <p>We conducted an extensive empirical evaluation to as- based optimisation (via reinforcement learning) and
susess the impact of diferent tuning and alignment strate- pervised fine-tuning, directing the model to abstract the
gies. underlying problem and articulate a step-wise, formal</p>
      <p>In multilingual reasoning tasks, our demonstrated sig- solution (§ 2.2). The iterative process terminates once
nificant improvements, resulting in an overall increase the model’s performance either converges or reaches a
in exact matching in proposed tasks, which led to the predefined maximum number of iterations.
following results and conclusions:
• Structuring multilingual reasoning in LLMs as
formal reasoning trajectories (SAGE), which
leverages language-agnostic reasoning logic, improves
accuracy and generates more verifiable outputs
through a transparent and structured.</p>
      <p>RL strategies operate preference estimation. This
generally involves aligning the policy model with preferences
using a reward model, which learns to predict preferences
based on comparisons and leads the optimisation process.
• Leveraging self-training heuristics that combine Although this approach is practical, it has problems with
both tuning and preference optimisation leads to generalisation, scalability, robustness, and alignment. In
more robust, generalisable, and language-aligned GRPO, rule-based reward models are used. While DPO
models. While tuning based on synthetic demon- is generally based on a series of naive string-matching
strations proves efective, it alone fails to yield functions with ground truth values, rules are explicitly</p>
      <sec id="sec-1-1">
        <title>2.1. Preference Estimation</title>
        <p>defined in GRPO. Accordingly, we define the following
preference policies:</p>
        <sec id="sec-1-1-1">
          <title>DPO Preference Estimation</title>
          <p>We adopt a
stringmatching function in line with existing approaches for
English [8, 12]. We then refine this procedure by
filtering out generations that do not adhere to the expected
structural pattern and well-formed format.</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>GRPO Preference Estimation</title>
          <p>Following Ranaldi and
Pucci [11] we define a rule-based metrics that control the
accuracy, the structure and the form of the generations.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>2.2. Self-training</title>
        <p>Conventional self-training begins by fine-tuning the base
model ℳ on the supervised dataset SFT, yielding an
updated model ℳ ′. At this stage, we assume that ℳ ′
has acquired the ability to address the target problem.
Specifically, when presented with a question , the model
generates a formal reasoning sequence ˆ together with
the corresponding answer ˆ.</p>
        <p>We begin by sampling multiple
compleSelf-training
tions ˆ from</p>
        <p>ℳ ′ in response to a set of questions 
drawn from the unlabelled pool  . We then apply
preference estimation heuristics to construct preference-based
samples according to diferent optimisation strategies:
dataset for the next-round tuning:
pairwise comparisons for DPO and grouped completions
for GRPO. These generations are compiled into a dataset
ℒGRPO), resulting in an updated model ℳ .
, which is subsequently used to further train the model
using the corresponding objective functions (ℒDPO and
Then we use ℳ  to generate a new pseudo-labeled
 = (, ˆ)| ∼  , ˆ ∼
 (·|).</p>
        <p>(1)</p>
        <p>After generation, the dataset  is refined by removing
the risk of overfitting.
quently, the resulting pseudo-labeled dataset, denoted
as  , is a subset of the original dataset, i.e., 

The final training dataset is constructed by combining
the original labeled dataset ℒ with the newly generated
pseudo-labeled dataset  . During this process, each
new dataset is used to train from the original base model
ℳ , rather than continually fine-tuning
ℳ , to mitigate</p>
      </sec>
      <sec id="sec-1-3">
        <title>2.3. Single-training</title>
        <p>Algorithm 1 Self-training [11]</p>
        <p>Input: pre-trained language model ℳ
Input: labeled dataset ℒ = {(, , ) 
Input: unlabeled dataset  = {(, )}=1

}=1
Input: mode ∈ {</p>
        <p>DPO, GRPO}
Output: fine-tuned model ℳ  ′
# Warm-up stage
2: repeat
1: Fine-tune ℳ on ℒ to get ℳ ′
3:
4:
5:
6:
7:
if mode = DPO then
Generate DPO dataset :
 = {
( , ,  )}=1</p>
        <p>where  ∼  and  , 
 ∼ ℳ
Tune ℳ ′ with ℒDPO on  to get ℳ 
 ′ (·| )
end if
where  ∼ 
if mode = GRPO then
 = {(, )}=1</p>
        <p>Generate GRPO dataset :
and  = {1, . . . , } ∼ ℳ
 ′ (·| )
Compute relative preferences within each group ,
assign pairwise relative scores to outputs in .
Tune ℳ ′ with ℒGRPO on  to get ℳ 
end if
# SFT step
Build pseudo-labeled dataset :
 = {(, ^, ^)
where  ∼  and ^

}=1, ^ ∼ ℳ   (·| )
ℳ  (·| )
Update ℒ ←</p>
        <p>∪ ℒ
Select  ⊂  when ^  =</p>
        <p>Train ℳ on ℒ to get a new
ℳ ′
8: until convergence or max iteration is reached</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiments</title>
      <p>As outlined in the introduction, our objective is to
develop a method for enhancing the reasoning
capabilities of LLMs beyond English, with a particular emphasis
on Italian. Our experiments are conducted on
multilintrained according to the procedure detailed in § 3.2, on
⊂  . two mathematical reasoning benchmarks (§ 3.3), using
the experimental configurations described in § 3.4.
incorrect answers and eliminating duplicates. Conse- gual reasoning tasks. We evaluate four models (§ 3.1),</p>
      <sec id="sec-2-1">
        <title>3.1. Models</title>
        <p>To conduct our study on diferent models and have a term
of comparison, we use Llama3-8B [13],
DeepSeekMath7B-Instruct [14] (DeepSeek-7B). Furthermore, to show
the scalability and efectiveness of our approach on
further models, we introduce additional smaller-scale
modFor comparative purposes, we conduct individual train- els: EuroLLM-1.7B and Velvet-2B.
ing operating only with SFT, DPO and GRPO.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Training Methods</title>
        <p>automatic translation phase disillusioned by qualified
annotators in 10 diferent languages. The dataset is
available on GitHub1 and HuggingFace2.</p>
        <p>Preference Optimisation RL
gingFace trainers (  and  ) to
ensure reproducibility. For DPO, we set the learning rate
to 1e-6 and  to 0.1. The optimisation process is set at
a maximum of 2000 steps, saving the checkpoint
corresponding to the lowest validation loss. For GRPO, we set
the learning rate to 5e-6 and  to . The optimisation
process is set at a maximum of 2000 steps, saving the
checkpoint corresponding to the lowest validation loss.</p>
        <p>Details in Appendix D.</p>
        <p>As introduced in §2, we use a iterative steps of SFT and
RL. We follow standard practice and perform a
warmup phase based on an SFT step using synthetic
demonstrations discussed in §3.3.2. Then, we conduct the self- 3.3.2. Training Set
training by progressively applying SFT and RL optimisa- Instead of using natural language rationale, we employ
tion algorithms. Following pilot studies (later discussed), synthetic demonstrations to train models to solve tasks
we set the total number of iterations to three (excluding following the two phases in Figure 1. Specifically, we
warm-up), the same for the settings where we use only instruct a robust model capable of addressing
multilinone between SFT and RL. gual mathematical tasks by formalising problems and
solving them in a language-agnostic manner. We
emWe employ the Hug- ploy GPT-4o as annotator, instructing it with the prompt
detailed in Appendix A (we define this procedure as
Self-training)</p>
        <p>Diferent works train an expert version of the same
model that is going to be refined for generating
synthetic demonstrations, which are subsequently used
for self-training (we define this procedure as Full
Self-training).</p>
        <p>Multilingual Demonstrations We annotate a subset
of the mSVAMP dataset containing 250 samples for all
Supervised Fine-tuning Regarding the SFT phase, languages to have in-domain demonstrations. After the
we employed 8-bit quantization and LoRA. We tune the annotation process, we check the quality of the
demonmodel for one epoch (warm-up) and for one epoch for strations using rule-based heuristics and GPT-4o-mini as
each iteration using the learning rates according to the an additional evaluator (details in Appendix C).
specific model configuration, as detailed in Appendix D.
3.3. Data</p>
        <sec id="sec-2-2-1">
          <title>3.3.1. Evaluation Set</title>
          <p>To study the reasoning performances of trained models,
we operate via mGSM, mSVAMP, and we introduce
mGSMSymbolic focusing on English and Italian.</p>
          <p>Mathematical Reasoning task We use the extension
of GSM8K and SVAMP. Respectively, Multilingual Grade
School Math (mGSM) and Multilingual Simple Variations
on Arithmetic Math word Problems (mSVAMP). In
original cases, the authors proposed a benchmark of English
mathematical problems with the following structure: a
word problem in natural language and a target answer in
numbers. For both versions, a subset of instances from
the oficial list of examples were translated into 11
different languages, maintaining the structure of the input
and output.
mGSM-Symbolic Mirzadeh et al. [15] improved
GSM8k (the ancestor of MGSM) by proposing
GSMSymbolic. This introduces symbolic patterns in GSM8k
that complexify the task and disadvantage the LLMs’
capabilities. We propose mGSM-Symbolic, the multilingual
GSM-Symbolic extension. In particular, we conduct an</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3.4. Experimental Setup</title>
        <p>In-context Learning We evaluate the baseline
models (without tuning) using a 6-shot strategy defined as
Direct and CoT. Moreover, we instruct the models to
solve the problem following SAGE.</p>
        <p>Training We assess the impact of the Self-training
approaches (§3) by conducting diferent tuning
configurations:
• SFT, RL We tune the models using the synthetic
demonstrations as detailed in Appendix B.
• Self-training We warm-up the models using the
synthetic demonstrations as detailed and conduct the
selftraining strategies using both policies.
• Full Self-training Finally, to observe the impact of
the self-generated demonstrations, we conduct both the
annotation, SFT (warm-up) and Full Self-train phase
completely on the self-generated data of the same expert
model.
1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results</title>
      <p>Reasoning can be efectively grounded in
languageagnostic form, which LLMs can leverage to enhance In-context Learning Table 2 presents the
performultilingual task performance. SAGE facilitates this by mance of SAGE applied to GPT-4o, showing clear
imguiding LLMs towards structured symbolic solutions, en- provements over previous prompting-based strategies
abling them to produce robust and consistent outputs such as Direct and CoT. The use of in-context
instrucacross languages. While SAGE yields strong results in tions encourages the model to organise problem-solving
GPT-4o, its benefits do not readily extend to smaller mod- in a structured manner, promoting step-wise reasoning
els. To address this, we adopt a self-training strategy and planning. This results in more consistent reasoning
that enables smaller models to acquire formal reasoning trajectories that are less influenced by language-specific
capabilities independently of explicit instruction, ulti- patterns, thereby reducing performance disparities across
mately achieving greater consistency than GPT-4o (§ 4.1). languages.</p>
      <p>Notably, self-training not only outperforms standalone
SFT and reinforcement learning approaches, but also
enables models to achieve stronger performance with 4.2. The Self-training Impact
substantially less training data (§ 4.2). Furthermore, we
demonstrate the scalability of this method by successfully
applying self-training to additional small-scale models
(§ 4.3).
the efectiveness of SAGE’s formalisation in supporting
multilingual reasoning.</p>
      <p>The impact of Full Self-training Current alignment
Multilingual Reasoning Table 1 presents results for strategies typically rely on demonstrations produced
SAGE with GPT-4o on mGSM-Symbolic, with a partic- by expert models belonging to the same model family.
ular focus on English and Italian. The performance re- Ranaldi and Freitas [6] demonstrate that in-family
learnmains consistent with that observed in mGSM, as indi- ing exerts a stronger influence on the performance of
stucated by the values in brackets. Notably, the Self-training dent models. In our work, we adopt the Full Self-training
strategy enhances the models’ abstraction capabilities, approach and show that self-generated demonstrations
allowing them to perform well even in the more formal can lead to more robust outcomes than those derived
and structured setting of mGSM-Symbolic, where typ- from GPT-4o. As illustrated in Figure 2, models trained
ical linguistic biases are reduced. In contrast, baseline with their own annotations exhibit greater consistency
methods yield substantially lower scores, underscoring</p>
      <sec id="sec-3-1">
        <title>4.1. Language-Agnostic Reasoning</title>
        <p>SAGE positively influences the models’ performance in
multilingual reasoning, getting substantial benefits on
the proposed tasks.</p>
        <sec id="sec-3-1-1">
          <title>Models</title>
          <p>GPT-4o
+SAGE
Llama3-8B
+Self-training
DeepSeek-7B
+Self-training
Velvet-2B
+Self-training
EuroLLM-1.7B
+Self-training</p>
          <p>The role of RL Table 2 reports the results obtained
using GRPO. As shown in Table 3, GRPO consistently
outperforms DPO, both when applied in isolation and
when integrated with SFT within the full Self-training
framework. As outlined in Section 2.1, GRPO does not
rely on an annotated dataset for supervision. Instead,
similar to prior work, a rule-based algorithm serves as
a proxy reward model. Unlike DPO, which operates at
the level of individual instances, GRPO is specifically
designed to optimise groups of completions across
languages, making it well-suited to the multilingual nature
of the proposed task.
4.3. Transferability in Smaller Models
To evaluate the transferability of Self-training and SAGE
to smaller-scale models, we extend our experiments
to include Llama-3-1B, EuroLLM-1.7B, and Velvet-2B.</p>
          <p>These models were selected based on three criteria: their
inherent multilingual design, their promising
performance in mathematical reasoning tasks, and their
relatively low parameter count, which enabled eficient
experimentation across training regimes.</p>
          <p>We adopt the experimental setup detailed in § 3.1,
applying SFT, GRPO, and our full Self-training procedure.</p>
          <p>Table 3 reports the average results obtained on the
mGSMSymbolic benchmark. Across all models, Self-training
with SAGE consistently outperforms both SFT and
RLbased baselines.
Supervised Fine-Tuning Supervised Fine-Tuning
(SFT) is a standard approach for adapting a model ℳ to
reasoning tasks using a labelled dataset ℒ. Each instance
in ℒ consists of a question , a corresponding
step-bystep explanation , and a final answer . The answer is
derived from the explanation using regular expressions.</p>
          <p>A generated rationale ˆ is deemed valid if the extracted
answer ˆ matches the reference answer . Formally, the
labelled dataset with  instances is defined as:</p>
          <p>Empirical studies have shown that the quality of
pseudo-labels plays a critical role in determining the
efectiveness of self-training. To address this, Wang et al. [12]
propose an iterative refinement procedure, wherein the
model ℳ is progressively improved, ensuring
increasingly accurate pseudo-labelled data across iterations.</p>
          <p>E(,)∼  [(, ) −  log
 (|)
SFT(|)
],</p>
          <p>(4)
where SFT denotes the original model trained via SFT,
and  serves as a regularization hyperparameter to
constrain policy updates.</p>
          <p>Direct Preference Optimisation Reinforcement
ℒ = (, , ) = 1. (2) Learning with Human Feedback (RLHF), particularly
through Proximal Policy Optimisation (PPO), has proven
SFT updates the parameters  of model ℳ by minimis- efective for aligning language models with human
prefing the negative log-likelihood of the target rationale: erences. However, it typically requires multiple auxiliary
components, including a reward model, making the
train[︃  ]︃ ing process computationally intensive and technically
ℒSFT() = E(, ) ∼ ℒ ∑︁ log  (|, 1:−1 ) , complex. To address this, Rafailov et al. [19] proposed
Di=1 (3) rect Preference Optimisation (DPO), which allows models
where  is the length of the rationale , and  denotes to be aligned directly with human preferences without
its -th token. the need to train a separate reward model.</p>
          <p>DPO begins with a warm-up phase based on
supervised fine-tuning. For a given input , the reference
policy ref generates two candidate completions:
Self-training Self-training refers to a family of
SFTbased methods that have recently gained renewed
interest for their efectiveness in enhancing reasoning
capabilities [16]. These methods typically follow a two-stage
process. First, a base model ℳ is fine-tuned on a
labelled subset ℒ to obtain a teacher model ℳ ′. This
teacher is then used to annotate an unlabelled dataset
 , producing a pseudo-labelled dataset ℒˆ. In the second
stage, a student model ℳ is trained on the
combination of the original data ℒ and the pseudo-labelled data
ˆ, with the aim of surpassing the performance of the
ℒ
teacher ℳ ′.</p>
          <p>1, 2 ∼  ref(· | ).</p>
          <p>These are then paired based on preference to form the
DPO training set:</p>
          <p>ℒ  = (, , ) = 1 ,
where  is the preferred response and  is the less
preferred one.</p>
          <p>The policy model ℳ is then optimised by minimising
the following objective:</p>
          <p>E (, , ) ∼  [− log  ((
|) − ( |))] , (7)
(5)
(6)</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Group Relative Policy Optimisation</title>
          <p>To overcome
these limitations, Shao et al. [21] introduced Group
Relative Policy Optimisation (GRPO), a refinement of PPO
that improves training stability by using group-based
reward estimation. Instead of relying on pairwise
comparisons, GRPO evaluates completions within groups and
assigns rewards based on relative performance within
those groups.</p>
          <p>Given a batch of responses from the policy model  ,
GRPO estimates relative advantages across the group and
applies the following optimisation objective:
E(, ) ∼  [rel(|) log   (|) −</p>
          <p>KL (  | ref)] ,
where   is the updated policy and  ref is the original
pre-trained policy. The KL divergence term prevents the
with the coeficient  determining the strength of this
regularisation.</p>
          <p>The relative advantage rel(|) is computed as:
rel(|) =
(|) − 

,
(8)
(9)
where the score function is defined as
 log
r ef((· |·|)) , and the parameter  regulates how far
(·|)</p>
          <p>=</p>
          <p>While DPO ofers a more streamlined alternative to
RLHF by avoiding explicit reward modelling, it is limited
by its reliance on fixed pairwise preference comparisons.</p>
          <p>This can hinder its capacity to generalise across tasks
that exhibit contextual or structural variation [20].
the new policy  may deviate from the reference policy. strategies have been proposed to enhance multilingual
Chen et al. [23] proposed mSVAMP, a multilingual
extension of SVAMP following the same approach. Multiple
reasoning in LLMs. These include translation-based
approaches [24], SFT [25], and preference-based alignment
methods [7], each of which demonstrates gains in
multilingual performance. Nonetheless, these methods rely
heavily on high-quality annotated data. SFT sufers from
forgetting and poor generalisation, while
preferencebased alignment adds computational overhead through
critic-based systems. Another line of research has
explored the use of in-context prompting, whereby LLMs
are instructed to reason step by step through carefully
designed prompts. Although this strategy has proven
useful in certain tasks [2], its reliance on English,
combined with its ineficacy for smaller models [ 1], limits its
applicability. Moreover, reasoning under this framework
is typically induced by the prompt’s structure, making it
dificult to generalise across languages or domains.</p>
          <p>While reasoning is inherently independent of language,
the extent to which LLMs demonstrate consistent
reasoning across linguistic boundaries remains limited. We
aim to disentangle logical reasoning from linguistic
surface forms by adopting a language-agnostic formalism.</p>
          <p>We propose converting problems expressed in any
lanmanipulable, and semantically grounded. Reasoning
operates over this intermediate form, with the final answer
rendered in the target language. To support this, we
instruct LLMs to abstract and solve problems via
selftraining, enabling scalable multilingual reasoning
without the need for prompt engineering.
where (|) denotes the reward assigned to the
response , and  and  are the mean and standard
deviation of the reward distribution within the group.</p>
          <p>GRPO has demonstrated particular eficacy in
multitask and multilingual reasoning contexts. By comparing
responses within structurally related groups, it allows
for more adaptive and robust policy updates, supporting
ical findings confirm that GRPO improves consistency,
robustness, and data eficiency when compared to
traditional PPO-based methods.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>5.2. Multilingual Reasoning</title>
        <p>updated policy from diverging excessively from its prior, guage into a shared formal representation that is abstract,</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion &amp; Future</title>
    </sec>
    <sec id="sec-5">
      <title>Works</title>
      <p>Although reasoning is inherently language-agnostic,
LLMs’ outputs often reflect biases towards dominant
pretraining languages, particularly English. While models
show strong multilingual capabilities, their step-wise
reaon English and Italian, we propose a modular approach
that abstracts the problem into a language-agnostic
formalism, followed by structured reasoning. Using
selftraining, we align reasoning performances, achieving
gains in both accuracy and consistency.</p>
      <p>This work contributes to a series of studies aimed
better generalisation and stability across tasks. Empir- soning remains inconsistent across languages. Focusing
Recent eforts to assess the capabilities of LLMs have
foat expanding the proficiency of LLMs beyond English.
cused on their performance in complex reasoning tasks, In our Research, we have explored interventions at
evparticularly in mathematical problem-solving.
Benchmark datasets such as GSM8K and SVAMP have been
widely adopted for this purpose. To extend such
evaluation to multilingual contexts, Shi et al. [22] introduced
mGSM, a multilingual variant of GSM8K, created by
manery stage—from pre-training [26, 27] and post-training
[4, 11] to inference methods [1, 2, 3], and recently on
multimodal reasoning [28]. In parallel, the aim is to
propose methodologies based on human-inspired
principles [29, 30, 31, 32] that aim to steer models away from
ually translating 250 test samples into various languages. heuristics that lead to verbatim-based [33] or
symbolicsemantic memorisation [34]. Our overarching goal is to
ensure that Italian is not left behind, applying
state-ofthe-art approaches to enhance generative capabilities,
linguistic proficiency, and other emerging competencies
of contemporary LLMs in Italian.
[6] L. Ranaldi, A. Freitas, Aligning large and small
language models via chain-of-thought reasoning,
in: Y. Graham, M. Purver (Eds.), Proceedings of the
18th Conference of the European Chapter of the
Association for Computational Linguistics (Volume
1: Long Papers), Association for Computational
Linguistics, St. Julian’s, Malta, 2024, pp. 1812–1827.</p>
      <p>URL: https://aclanthology.org/2024.eacl-long.109/.
[7] J. Dang, A. Ahmadian, K. Marchisio, J. Kreutzer,
[1] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti, A. Üstün, S. Hooker, RLHF can speak many
lanF. M. Zanzotto, A tree-of-thoughts to broaden guages: Unlocking multilingual preference
optimulti-step reasoning across languages, in: K. Duh, mization for LLMs, in: Y. Al-Onaizan, M. Bansal,
H. Gomez, S. Bethard (Eds.), Findings of the Associ- Y.-N. Chen (Eds.), Proceedings of the 2024
Conferation for Computational Linguistics: NAACL 2024, ence on Empirical Methods in Natural Language
Association for Computational Linguistics, Mex- Processing, Association for Computational
Linguisico City, Mexico, 2024, pp. 1229–1241. URL: https: tics, Miami, Florida, USA, 2024, pp. 13134–13156.
//aclanthology.org/2024.findings-naacl.78. doi: 10. URL: https://aclanthology.org/2024.emnlp-main.
18653/v1/2024.findings-naacl.78. 729/. doi:10.18653/v1/2024.emnlp-main.729.
[2] L. Ranaldi, G. Pucci, B. Haddow, A. Birch, Em- [8] L. Ranaldi, A. Freitas, Self-refine
instructionpowering multi-step reasoning across languages tuning for aligning reasoning in language
modvia program-aided language models, in: Y. Al- els, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen
Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceed- (Eds.), Proceedings of the 2024 Conference on
Emings of the 2024 Conference on Empirical Methods pirical Methods in Natural Language Processing,
in Natural Language Processing, Association for Association for Computational Linguistics, Miami,
Computational Linguistics, Miami, Florida, USA, Florida, USA, 2024, pp. 2325–2347. URL: https:
2024, pp. 12171–12187. URL: https://aclanthology. //aclanthology.org/2024.emnlp-main.139/. doi:10.
org/2024.emnlp-main.678. doi:10.18653/v1/2024. 18653/v1/2024.emnlp-main.139.
emnlp-main.678. [9] V. Gaur, N. Saunshi, Reasoning in large
lan[3] L. Ranaldi, B. Haddow, A. Birch, When natural guage models through symbolic math word
problanguage is not enough: The limits of in-context lems, in: Findings of the Association for
Comlearning demonstrations in multilingual reason- putational Linguistics: ACL 2023, Association
ing, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), for Computational Linguistics, Toronto, Canada,
Findings of the Association for Computational Lin- 2023, pp. 5889–5903. URL: https://aclanthology.
guistics: NAACL 2025, Association for Compu- org/2023.findings-acl.364. doi: 10.18653/v1/2023.
tational Linguistics, Albuquerque, New Mexico, findings-acl.364.
2025, pp. 7369–7396. URL: https://aclanthology.org/ [10] L. Pan, A. Albalak, X. Wang, W. Wang,
Logic2025.findings-naacl.412/. doi: 10.18653/v1/2025. LM: Empowering large language models with
symfindings-naacl.412. bolic solvers for faithful logical reasoning, in:
[4] L. Ranaldi, G. Pucci, Does the English matter? elicit H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the
cross-lingual abilities of large language models, in: Association for Computational Linguistics: EMNLP
D. Ataman (Ed.), Proceedings of the 3rd Workshop 2023, Association for Computational Linguistics,
on Multi-lingual Representation Learning (MRL), Singapore, 2023, pp. 3806–3824. URL: https://
Association for Computational Linguistics, Singa- aclanthology.org/2023.findings-emnlp.248/. doi: 10.
pore, 2023, pp. 173–183. URL: https://aclanthology. 18653/v1/2023.findings-emnlp.248.
org/2023.mrl-1.14. doi:10.18653/v1/2023.mrl-1. [11] L. Ranaldi, G. Pucci, Multilingual reasoning via
self14. training, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.),
[5] L. Ranaldi, G. Pucci, A. Freitas, Does the Order Proceedings of the 2025 Conference of the Nations
matter? Curriculum learning over languages, in: of the Americas Chapter of the Association for
ComN. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, putational Linguistics: Human Language
TechnoloN. Xue (Eds.), Proceedings of the 2024 Joint In- gies (Volume 1: Long Papers), Association for
Comternational Conference on Computational Linguis- putational Linguistics, Albuquerque, New Mexico,
tics, Language Resources and Evaluation (LREC- 2025, pp. 11566–11582. URL: https://aclanthology.
COLING 2024), ELRA and ICCL, Torino, Italia, 2024, org/2025.naacl-long.577/. doi:10.18653/v1/2025.
pp. 5212–5220. URL: https://aclanthology.org/2024. naacl-long.577.</p>
      <p>lrec-main.464/.
the Association for Computational Linguistics: ACL [31] M. Mastromattei, L. Ranaldi, F. Fallucchi, F. M.
Zan2024, Association for Computational Linguistics, zotto, Syntax and prejudice: ethically-charged
Bangkok, Thailand, 2024, pp. 7961–7973. URL: https: biases of a syntax-based hate speech recognizer
//aclanthology.org/2024.findings-acl.473/. doi: 10. unveiled, PeerJ Computer Science 8 (2022) e859.
18653/v1/2024.findings-acl.473. URL: http://dx.doi.org/10.7717/peerj-cs.859. doi:10.
[25] A. Üstün, V. Aryabumi, Z. Yong, W.-Y. Ko, 7717/peerj-cs.859.</p>
      <p>D. D’souza, G. Onilude, N. Bhandari, S. Singh, H.-L. [32] L. Ranaldi, Survey on the role of mechanistic
interOoi, A. Kayid, F. Vargus, P. Blunsom, S. Longpre, pretability in generative ai, Big Data and Cognitive
N. Muennighof, M. Fadaee, J. Kreutzer, S. Hooker, Computing 9 (2025). URL: https://www.mdpi.com/
Aya model: An instruction finetuned open-access 2504-2289/9/8/193. doi:10.3390/bdcc9080193.
multilingual language model, in: L.-W. Ku, [33] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi,
A. Martins, V. Srikumar (Eds.), Proceedings of C. Giannone, A. Favalli, R. Romagnoli, F. M.
Zanthe 62nd Annual Meeting of the Association for zotto, Investigating the impact of data
contamComputational Linguistics (Volume 1: Long Pa- ination of large language models in text-to-SQL
pers), Association for Computational Linguistics, translation, in: L.-W. Ku, A. Martins, V.
SrikuBangkok, Thailand, 2024, pp. 15894–15939. URL: mar (Eds.), Findings of the Association for
Comhttps://aclanthology.org/2024.acl-long.845/. doi:10. putational Linguistics: ACL 2024, Association for
18653/v1/2024.acl-long.845. Computational Linguistics, Bangkok, Thailand,
[26] L. Ranaldi, G. Pucci, F. M. Zanzotto, Modeling eas- 2024, pp. 13909–13920. URL: https://aclanthology.
iness for training transformers with curriculum org/2024.findings-acl.827/. doi: 10.18653/v1/2024.
learning, in: R. Mitkov, G. Angelova (Eds.), Pro- findings-acl.827.
ceedings of the 14th International Conference on [34] F. Ranaldi, A. Zugarini, L. Ranaldi, F. M.
ZanRecent Advances in Natural Language Processing, zotto, Protoknowledge shapes behaviour of llms
INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, in downstream tasks: Memorization and
general2023, pp. 937–948. URL: https://aclanthology.org/ ization with knowledge graphs, 2025. URL: https:
2023.ranlp-1.101/. //arxiv.org/abs/2505.15501. arXiv:2505.15501.
[27] L. Ranaldi, G. Pucci, F. M. Zanzotto, How far
does the sequence of compositions impact
multilingual pre-training?, in: F. Dell’Orletta, A. Lenci,
S. Montemagni, R. Sprugnoli (Eds.), Proceedings
of the 10th Italian Conference on Computational
Linguistics (CLiC-it 2024), CEUR Workshop
Proceedings, Pisa, Italy, 2024, pp. 796–804. URL: https:
//aclanthology.org/2024.clicit-1.86/.
[28] L. Ranaldi, F. Ranaldi, G. Pucci, R2-MultiOmnia:</p>
      <p>Leading multilingual multimodal reasoning via
selftraining, in: W. Che, J. Nabende, E. Shutova, M. T.</p>
      <p>Pilehvar (Eds.), Proceedings of the 63rd Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Association for
Computational Linguistics, Vienna, Austria, 2025,
pp. 8220–8234. URL: https://aclanthology.org/2025.
acl-long.402/. doi:10.18653/v1/2025.acl-long.</p>
      <p>402.
[29] L. Ranaldi, G. Pucci, Knowing knowledge:
Epistemological study of knowledge in
transformers, Applied Sciences 13 (2023). URL: https://
www.mdpi.com/2076-3417/13/2/677. doi:10.3390/
app13020677.
[30] G. Pucci, F. M. Zanzotto, L. Ranaldi, Animate,
or inanimate, that is the question for large
language models, Information 16 (2025). URL: https://
www.mdpi.com/2078-2489/16/6/493. doi:10.3390/
info16060493.</p>
    </sec>
    <sec id="sec-6">
      <title>A. SAGE Instruction Template</title>
      <p>#Role
You are an experienced expert skilled in multilingual mathematical reasoning problems.
#Task
You are presented with a mathematical reasoning problem in a given language. Follow the steps below
rigorously to formalise and solve it.
#Instructions
#Question
{question}
1) Formalisation (Language-Agnostic): Identify and define the key mathematical components of
the problem, such as variables, functions, operations, and constraints. Structure these components
in an abstract manner to ensure a clear and precise formulation. Label this step as
&lt;formalisation&gt;....&lt;/formalisation&gt;
2) Reasoning Execution: Solve the problem systematically by breaking it into logical steps. Clearly
justify each step using natural language explanations while maintaining logical rigor. Express the final
answer in the same language as the input query. Label this step as &lt;reasoning&gt;....&lt;/reasoning&gt;
Final Answer: Present the extracted answer in a concise format, marked as “The answer is: [num]” in
the same language as the query. Label this step as &lt;answer&gt;....&lt;/answer&gt;</p>
      <p>We use SAGE to generate synthetic demonstrations for
training smaller LLMs. We use GPT-4o as an annotator and
use the annotations to warm-up the models with the
proposed methodologies. We then conduct a complete
Self-training phase. Moreover, we conduct the Self-training
by using self-generated data (generated by the trained
models themselves). We define these configurations
‘Full’-Self-training. In both cases, the demonstrations are</p>
      <p>generated by prompting the models using instructions
detailed in Appendix A. However, while GPT-4o follows the
instructions well (in fact, we did not find any significant
issues), the other models generate outcomes that include
errors. To handle this, we evaluated the quality of the
generated demonstrations by filtering out inaccurate
examples to get a gold instruction set. In particular, we
removed all inaccurate answers (outputs that do not match</p>
      <p>the exact target string metric). Then, we control if the
demonstrations follow correctly the steps indicated in our
prompt (see Table 4) using GPT-4o-mini and the prompt in</p>
      <p>Appendix ??.</p>
    </sec>
    <sec id="sec-7">
      <title>C. Evaluation Metrics</title>
      <p>We used a double-check to assess the accuracy of the
responses delivered in the diferent experiments. In the first
step, we used an exact-match heuristic. However, since some
experiments required a more accurate response check, we
used GPT-4o-mini as a judge.</p>
      <p>D. Models and Hyperparameters
As evaluation sets, we use the tasks introduced in §3.3. These
tasks are used to assess the performance of LLMs, but they
do not have reserved sets for evaluation and training.
Therefore, to produce a training set, we split mSVAMP into
training and testing. Table 6 shows the instances of each
dataset in training and testing. To ensure the languages are
perfectly balanced, we translated 350 samples from English to
Telugu (language non-present in mSVAMP). This subset was
used for training purposes only.</p>
      <sec id="sec-7-1">
        <title>Task Total Test Train. Set # dim</title>
        <p>mGSM 0.5
mGSM-Symbolic 0.5
mSVAMP 2</p>
        <p>The data are perfectly balanced between the languages in the
proposed tasks. However, as described in Appendix B, the
qualities of the annotations are not perfect. Behind filtering
the annotations, we obtained a reduced dataset. To have fair,</p>
        <p>balanced subsets, we use 1k samples in total. We use 1k
samples when instructing the models for DPO and SFT. For
the Self-training, we used as the initial subset (§2.2) 60% of
the filtered samples balanced between all languages.</p>
        <p>Hyperparameters In §3.2, we described the standard
Self-training setting. However, we have proposed diferent G. Number of Iterations
experimental settings. In the Self-training experimental
setting, we conducted three iterations as proposed in [12, 14]. Following pilot experiments, we set the number of iterations
In the SFT-only and RL-only settings, we used warm-up and of self-tuning at three. Figure 7 shows the performance trend
four epochs and 8000 steps, respectively. We conducted this by increasing the number of iterations, epochs and steps after
study after the pilot experiments shown in the previous warm-up (wup).</p>
        <p>sections.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>E. Models Vesions</title>
      <sec id="sec-8-1">
        <title>Model</title>
        <p>Llama3-8(-instruct)
Phi-3(-mini-instruct)
DeepSeekMath-7B
GPT-4o
GPT-4o-mini</p>
        <p>Declaration on Generative AI</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>