1. Introduction

10.18653/v1/2024.acl-long.643

No longer left behind: Self-training Reasoning Models in Italian

Federico Ranaldi

Leonardo Ranaldi

Univeristy of Roma Tor Vergata

Univeristy of Edinburgh

2025

1 11917 11928

Although reasoning is, by nature, language-agnostic, the extent to which large language models (LLMs) can perform consistent multilingual reasoning remains limited. Their capacity to deliver step-wise explanations is largely constrained to the dominant languages present in their pre-training data, thereby limiting cross-lingual generalisation and hindering broader global applicability. While recent work has explored a range of strategies to extend reasoning capabilities beyond English, these eforts typically remain grounded in surface-level spoken language phenomena, which may not be optimal for abstract or formal reasoning tasks. In this study, we focus on Italian and English, two languages with markedly diferent syntactic and morphological properties, to assess whether advancements in multilingual reasoning remain consistent and transferable across typologically diverse settings. To this end, we introduce a modular framework that guides LLMs to abstract the reasoning process into a structured problem space before generating step-wise reasoning trajectories. The approach leverages self-training to enhance alignment and generalisation. Experimental results demonstrate stable and significant gains in multilingual reasoning across models and tasks, with improved consistency between English and Italian.

eol>Multilingual Reasoning Self-training Large Reasoning Models

1. Introduction

tions in a systematic way. However, they must also have robust multilingual proficiency. Therefore, many works In the era of large language models (LLMs), approaches rely on SFT techniques that maintain reduced costs when such as Chain-of-Thought (CoT) and related methods used with specialised, smaller-scale LLMs. Secondly, they seek to emulate human reasoning through language gen- require vast amounts of complex reasoning annotations eration—an ability that, in principle, ought not to be con- and tremendous tuning eforts to get multilingual LLMs strained by the particularities of any spoken language. capable of delivering reasoning through SFT and preferYet, a growing body of evidence indicates that the rea- ence optimisation techniques. soning capabilities of LLMs vary significantly across lan- To enhance multilingual reasoning in LLMs, we proguages, largely as a consequence of imbalances in pre- pose a modular approach that first instructs the model to training data. LLMs perform better in dominant lan- abstractly formalise the problem and then generate strucguages, notably English, while exhibiting reduced rea- tured, step-by-step reasoning trajectories that converge soning competence in less-represented languages. towards a consistent reasoning process across languages.

Research advances in multilingual reasoning are in- Our approach decomposes problem solutions into a creasingly aimed at closing the performance diferences sequence of formal, language-agnostic sub-problems among languages, enhancing the models’ capabilities that are solved sequentially and can be more efectively through in-context learning interventions [1, 2, 3], SFT utilised by models. strategies that difer from language-specific augmenta- The decomposition consists of two high-level modtion [4, 5] to task-oriented tuning [6], and preference ules: Formalisation and Reasoning Execution. As illusoptimisation [7, 8]. Although these approaches have en- trated in Figure 1, we guide the models to: (i) identify abled the development of efective methods for transfer- the relevant information within the problem, formalisring and aligning multilingual reasoning capabilities, we ing variables and predicates while delivering symbolic argue that several critical challenges continue to hinder transformations; (ii) generate a reasoning execution traprogress. First and foremost, the benefits of in-context jectory in which the transformations are applied using interventions appear to be confined to large-scale LLMs, symbolic representations that explicitly articulate the sowhich are better equipped to interpret and follow instruc- lution, ultimately yielding an answer in the same query language.

CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- Previous works proposed English-based strategies that tics, September 24 — 26, 2025, Cagliari, Italy operate via logical formalisms coupled with external symn$amnea.msuer.snuarmnea@mee@d.aucn.uirko m(La.2R.iatn(Fa.ldRi)analdi); bolic solvers [9, 10]. Yet, fully symbolic approaches face © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License a key bottleneck: they require a complete translation Attribution 4.0 International (CC BY 4.0).

Un gruppo di 200 studenti ha una varietà di hobby. 50 amano leggere, 29 preferiscono giocare a cricket e il resto ama bal are o cucinare. Quanti studenti preferiscono bal are se il numero di quel i che preferiscono cucinare è 2 meno del doppio di quel i che preferiscono giocare a cricket? “A group of 200 students has various hobbies. 50 like to read, 29 like to play cricket, and the rest like to either dance or bake. How many like to dance if the number that like to bake is 2 less than twice the number that prefer playing cricket?” <formalisation>

S=200 R=50 C=29 D=? B=? B=2C-2 5R0++C2+9D++DB+=BS=200 </formalisation> <reasoning> 1. Calcoliamo B: B = 2(29) - 2 = 58 - 2 = 56 2. Sostituiamo B = 56 nel ’equazione totale:

50 + 29 + D + 56 = 200 34.. 1D3=52+0D0 -=123050= 65 </reasoning> <answer> La risposta è 65. </answer> <formalisation>

S=200 R=50 C=29 D=? B=? B=2C-2 5R0++C2+9D++DB+=BS=200 </formalisation> <reasoning> 1. Compute B: B = 2(29) - 2 = 58 - 2 = 56 2. Sobstitute B = 56 to the total equation:

50 + 29 + D + 56 = 200 34.. 1D3=52+0D0 -=123050= 65 </reasoning> <answer> The answer is 65. </answer>

LLM (SFT) Verification Refinement

Policy Model

Self-Training

GRPO o1 ...

o1 8k data

LLM

Annotation, Refinement warm-up via SFT

Self-improvement via RL from natural to formal language, which can hinder both consistently strong performance across all laneficiency and flexibility, introducing additional layers of guages. Conversely, relying solely on preference complexity. optimisation can provide performance gains, but

To achieve a better trade-of, we treat formalisations at the cost of significant computational overhead. in an eclectic manner and propose methods to disentangle content from logical reasoning without introducing • Our approach allows the disentanglement of conrigorous formalisms. tent from logical reasoning, improving multilin

To this end, following Ranaldi and Pucci [11], we in- gual reasoning in LLMs, thus benefiting in diferstruct larger LLMs to generate synthetic demonstrations ent language spaces. through Structured Abstractive Generative Explanation (SAGE), which are then used to perform Self-training on 2. Method smaller LLMs.

As part of the warm-up phase, we experiment with We propose a self-training framework that augments multiple alignment strategies, ranging from supervised standard fine-tuning with a set of preference optimisaifne-tuning (Instruction-Tuning) to preference optimisa- tion policies (§ 2.1) designed to improve self-refinment. tion techniques (Reinforcement Learning). The approach iteratively alternates between preference

We conducted an extensive empirical evaluation to as- based optimisation (via reinforcement learning) and susess the impact of diferent tuning and alignment strate- pervised fine-tuning, directing the model to abstract the gies. underlying problem and articulate a step-wise, formal

In multilingual reasoning tasks, our demonstrated sig- solution (§ 2.2). The iterative process terminates once nificant improvements, resulting in an overall increase the model’s performance either converges or reaches a in exact matching in proposed tasks, which led to the predefined maximum number of iterations. following results and conclusions: • Structuring multilingual reasoning in LLMs as formal reasoning trajectories (SAGE), which leverages language-agnostic reasoning logic, improves accuracy and generates more verifiable outputs through a transparent and structured.

RL strategies operate preference estimation. This generally involves aligning the policy model with preferences using a reward model, which learns to predict preferences based on comparisons and leads the optimisation process. • Leveraging self-training heuristics that combine Although this approach is practical, it has problems with both tuning and preference optimisation leads to generalisation, scalability, robustness, and alignment. In more robust, generalisable, and language-aligned GRPO, rule-based reward models are used. While DPO models. While tuning based on synthetic demon- is generally based on a series of naive string-matching strations proves efective, it alone fails to yield functions with ground truth values, rules are explicitly

2.1. Preference Estimation

defined in GRPO. Accordingly, we define the following preference policies:

DPO Preference Estimation

We adopt a stringmatching function in line with existing approaches for English [8, 12]. We then refine this procedure by filtering out generations that do not adhere to the expected structural pattern and well-formed format.

GRPO Preference Estimation

Following Ranaldi and Pucci [11] we define a rule-based metrics that control the accuracy, the structure and the form of the generations.

2.2. Self-training

Conventional self-training begins by fine-tuning the base model ℳ on the supervised dataset SFT, yielding an updated model ℳ ′. At this stage, we assume that ℳ ′ has acquired the ability to address the target problem. Specifically, when presented with a question , the model generates a formal reasoning sequence ˆ together with the corresponding answer ˆ.

We begin by sampling multiple compleSelf-training tions ˆ from

ℳ ′ in response to a set of questions drawn from the unlabelled pool . We then apply preference estimation heuristics to construct preference-based samples according to diferent optimisation strategies: dataset for the next-round tuning: pairwise comparisons for DPO and grouped completions for GRPO. These generations are compiled into a dataset ℒGRPO), resulting in an updated model ℳ . , which is subsequently used to further train the model using the corresponding objective functions (ℒDPO and Then we use ℳ to generate a new pseudo-labeled = (, ˆ)| ∼ , ˆ ∼ (·|).

(1)

After generation, the dataset is refined by removing the risk of overfitting. quently, the resulting pseudo-labeled dataset, denoted as , is a subset of the original dataset, i.e., The final training dataset is constructed by combining the original labeled dataset ℒ with the newly generated pseudo-labeled dataset . During this process, each new dataset is used to train from the original base model ℳ , rather than continually fine-tuning ℳ , to mitigate

2.3. Single-training

Algorithm 1 Self-training [11]

Input: pre-trained language model ℳ Input: labeled dataset ℒ = {(, , ) Input: unlabeled dataset = {(, )}=1 }=1 Input: mode ∈ {

DPO, GRPO} Output: fine-tuned model ℳ ′ # Warm-up stage 2: repeat 1: Fine-tune ℳ on ℒ to get ℳ ′ 3: 4: 5: 6: 7: if mode = DPO then Generate DPO dataset : = { ( , , )}=1

where ∼ and , ∼ ℳ Tune ℳ ′ with ℒDPO on to get ℳ ′ (·| ) end if where ∼ if mode = GRPO then = {(, )}=1

Generate GRPO dataset : and = {1, . . . , } ∼ ℳ ′ (·| ) Compute relative preferences within each group , assign pairwise relative scores to outputs in . Tune ℳ ′ with ℒGRPO on to get ℳ end if # SFT step Build pseudo-labeled dataset : = {(, ^, ^) where ∼ and ^ }=1, ^ ∼ ℳ (·| ) ℳ (·| ) Update ℒ ←

∪ ℒ Select ⊂ when ^ =

Train ℳ on ℒ to get a new ℳ ′ 8: until convergence or max iteration is reached

3. Experiments

As outlined in the introduction, our objective is to develop a method for enhancing the reasoning capabilities of LLMs beyond English, with a particular emphasis on Italian. Our experiments are conducted on multilintrained according to the procedure detailed in § 3.2, on ⊂ . two mathematical reasoning benchmarks (§ 3.3), using the experimental configurations described in § 3.4. incorrect answers and eliminating duplicates. Conse- gual reasoning tasks. We evaluate four models (§ 3.1),

3.1. Models

To conduct our study on diferent models and have a term of comparison, we use Llama3-8B [13], DeepSeekMath7B-Instruct [14] (DeepSeek-7B). Furthermore, to show the scalability and efectiveness of our approach on further models, we introduce additional smaller-scale modFor comparative purposes, we conduct individual train- els: EuroLLM-1.7B and Velvet-2B. ing operating only with SFT, DPO and GRPO.

3.2. Training Methods

automatic translation phase disillusioned by qualified annotators in 10 diferent languages. The dataset is available on GitHub1 and HuggingFace2.

Preference Optimisation RL gingFace trainers ( and ) to ensure reproducibility. For DPO, we set the learning rate to 1e-6 and to 0.1. The optimisation process is set at a maximum of 2000 steps, saving the checkpoint corresponding to the lowest validation loss. For GRPO, we set the learning rate to 5e-6 and to . The optimisation process is set at a maximum of 2000 steps, saving the checkpoint corresponding to the lowest validation loss.

Details in Appendix D.

As introduced in §2, we use a iterative steps of SFT and RL. We follow standard practice and perform a warmup phase based on an SFT step using synthetic demonstrations discussed in §3.3.2. Then, we conduct the self- 3.3.2. Training Set training by progressively applying SFT and RL optimisa- Instead of using natural language rationale, we employ tion algorithms. Following pilot studies (later discussed), synthetic demonstrations to train models to solve tasks we set the total number of iterations to three (excluding following the two phases in Figure 1. Specifically, we warm-up), the same for the settings where we use only instruct a robust model capable of addressing multilinone between SFT and RL. gual mathematical tasks by formalising problems and solving them in a language-agnostic manner. We emWe employ the Hug- ploy GPT-4o as annotator, instructing it with the prompt detailed in Appendix A (we define this procedure as Self-training)

Diferent works train an expert version of the same model that is going to be refined for generating synthetic demonstrations, which are subsequently used for self-training (we define this procedure as Full Self-training).

Multilingual Demonstrations We annotate a subset of the mSVAMP dataset containing 250 samples for all Supervised Fine-tuning Regarding the SFT phase, languages to have in-domain demonstrations. After the we employed 8-bit quantization and LoRA. We tune the annotation process, we check the quality of the demonmodel for one epoch (warm-up) and for one epoch for strations using rule-based heuristics and GPT-4o-mini as each iteration using the learning rates according to the an additional evaluator (details in Appendix C). specific model configuration, as detailed in Appendix D. 3.3. Data

3.3.1. Evaluation Set

To study the reasoning performances of trained models, we operate via mGSM, mSVAMP, and we introduce mGSMSymbolic focusing on English and Italian.

Mathematical Reasoning task We use the extension of GSM8K and SVAMP. Respectively, Multilingual Grade School Math (mGSM) and Multilingual Simple Variations on Arithmetic Math word Problems (mSVAMP). In original cases, the authors proposed a benchmark of English mathematical problems with the following structure: a word problem in natural language and a target answer in numbers. For both versions, a subset of instances from the oficial list of examples were translated into 11 different languages, maintaining the structure of the input and output. mGSM-Symbolic Mirzadeh et al. [15] improved GSM8k (the ancestor of MGSM) by proposing GSMSymbolic. This introduces symbolic patterns in GSM8k that complexify the task and disadvantage the LLMs’ capabilities. We propose mGSM-Symbolic, the multilingual GSM-Symbolic extension. In particular, we conduct an

3.4. Experimental Setup

In-context Learning We evaluate the baseline models (without tuning) using a 6-shot strategy defined as Direct and CoT. Moreover, we instruct the models to solve the problem following SAGE.

Training We assess the impact of the Self-training approaches (§3) by conducting diferent tuning configurations: • SFT, RL We tune the models using the synthetic demonstrations as detailed in Appendix B. • Self-training We warm-up the models using the synthetic demonstrations as detailed and conduct the selftraining strategies using both policies. • Full Self-training Finally, to observe the impact of the self-generated demonstrations, we conduct both the annotation, SFT (warm-up) and Full Self-train phase completely on the self-generated data of the same expert model. 1

4. Results

Reasoning can be efectively grounded in languageagnostic form, which LLMs can leverage to enhance In-context Learning Table 2 presents the performultilingual task performance. SAGE facilitates this by mance of SAGE applied to GPT-4o, showing clear imguiding LLMs towards structured symbolic solutions, en- provements over previous prompting-based strategies abling them to produce robust and consistent outputs such as Direct and CoT. The use of in-context instrucacross languages. While SAGE yields strong results in tions encourages the model to organise problem-solving GPT-4o, its benefits do not readily extend to smaller mod- in a structured manner, promoting step-wise reasoning els. To address this, we adopt a self-training strategy and planning. This results in more consistent reasoning that enables smaller models to acquire formal reasoning trajectories that are less influenced by language-specific capabilities independently of explicit instruction, ulti- patterns, thereby reducing performance disparities across mately achieving greater consistency than GPT-4o (§ 4.1). languages.

Notably, self-training not only outperforms standalone SFT and reinforcement learning approaches, but also enables models to achieve stronger performance with 4.2. The Self-training Impact substantially less training data (§ 4.2). Furthermore, we demonstrate the scalability of this method by successfully applying self-training to additional small-scale models (§ 4.3). the efectiveness of SAGE’s formalisation in supporting multilingual reasoning.

The impact of Full Self-training Current alignment Multilingual Reasoning Table 1 presents results for strategies typically rely on demonstrations produced SAGE with GPT-4o on mGSM-Symbolic, with a partic- by expert models belonging to the same model family. ular focus on English and Italian. The performance re- Ranaldi and Freitas [6] demonstrate that in-family learnmains consistent with that observed in mGSM, as indi- ing exerts a stronger influence on the performance of stucated by the values in brackets. Notably, the Self-training dent models. In our work, we adopt the Full Self-training strategy enhances the models’ abstraction capabilities, approach and show that self-generated demonstrations allowing them to perform well even in the more formal can lead to more robust outcomes than those derived and structured setting of mGSM-Symbolic, where typ- from GPT-4o. As illustrated in Figure 2, models trained ical linguistic biases are reduced. In contrast, baseline with their own annotations exhibit greater consistency methods yield substantially lower scores, underscoring

4.1. Language-Agnostic Reasoning

SAGE positively influences the models’ performance in multilingual reasoning, getting substantial benefits on the proposed tasks.

Models

GPT-4o +SAGE Llama3-8B +Self-training DeepSeek-7B +Self-training Velvet-2B +Self-training EuroLLM-1.7B +Self-training

The role of RL Table 2 reports the results obtained using GRPO. As shown in Table 3, GRPO consistently outperforms DPO, both when applied in isolation and when integrated with SFT within the full Self-training framework. As outlined in Section 2.1, GRPO does not rely on an annotated dataset for supervision. Instead, similar to prior work, a rule-based algorithm serves as a proxy reward model. Unlike DPO, which operates at the level of individual instances, GRPO is specifically designed to optimise groups of completions across languages, making it well-suited to the multilingual nature of the proposed task. 4.3. Transferability in Smaller Models To evaluate the transferability of Self-training and SAGE to smaller-scale models, we extend our experiments to include Llama-3-1B, EuroLLM-1.7B, and Velvet-2B.

These models were selected based on three criteria: their inherent multilingual design, their promising performance in mathematical reasoning tasks, and their relatively low parameter count, which enabled eficient experimentation across training regimes.

We adopt the experimental setup detailed in § 3.1, applying SFT, GRPO, and our full Self-training procedure.

Table 3 reports the average results obtained on the mGSMSymbolic benchmark. Across all models, Self-training with SAGE consistently outperforms both SFT and RLbased baselines. Supervised Fine-Tuning Supervised Fine-Tuning (SFT) is a standard approach for adapting a model ℳ to reasoning tasks using a labelled dataset ℒ. Each instance in ℒ consists of a question , a corresponding step-bystep explanation , and a final answer . The answer is derived from the explanation using regular expressions.

A generated rationale ˆ is deemed valid if the extracted answer ˆ matches the reference answer . Formally, the labelled dataset with instances is defined as:

Empirical studies have shown that the quality of pseudo-labels plays a critical role in determining the efectiveness of self-training. To address this, Wang et al. [12] propose an iterative refinement procedure, wherein the model ℳ is progressively improved, ensuring increasingly accurate pseudo-labelled data across iterations.

E(,)∼ [(, ) − log (|) SFT(|) ],

(4) where SFT denotes the original model trained via SFT, and serves as a regularization hyperparameter to constrain policy updates.

Direct Preference Optimisation Reinforcement ℒ = (, , ) = 1. (2) Learning with Human Feedback (RLHF), particularly through Proximal Policy Optimisation (PPO), has proven SFT updates the parameters of model ℳ by minimis- efective for aligning language models with human prefing the negative log-likelihood of the target rationale: erences. However, it typically requires multiple auxiliary components, including a reward model, making the train[︃ ]︃ ing process computationally intensive and technically ℒSFT() = E(, ) ∼ ℒ ∑︁ log (|, 1:−1 ) , complex. To address this, Rafailov et al. [19] proposed Di=1 (3) rect Preference Optimisation (DPO), which allows models where is the length of the rationale , and denotes to be aligned directly with human preferences without its -th token. the need to train a separate reward model.

DPO begins with a warm-up phase based on supervised fine-tuning. For a given input , the reference policy ref generates two candidate completions: Self-training Self-training refers to a family of SFTbased methods that have recently gained renewed interest for their efectiveness in enhancing reasoning capabilities [16]. These methods typically follow a two-stage process. First, a base model ℳ is fine-tuned on a labelled subset ℒ to obtain a teacher model ℳ ′. This teacher is then used to annotate an unlabelled dataset , producing a pseudo-labelled dataset ℒˆ. In the second stage, a student model ℳ is trained on the combination of the original data ℒ and the pseudo-labelled data ˆ, with the aim of surpassing the performance of the ℒ teacher ℳ ′.

1, 2 ∼ ref(· | ).

These are then paired based on preference to form the DPO training set:

ℒ = (, , ) = 1 , where is the preferred response and is the less preferred one.

The policy model ℳ is then optimised by minimising the following objective:

E (, , ) ∼ [− log (( |) − ( |))] , (7) (5) (6)

Group Relative Policy Optimisation

To overcome these limitations, Shao et al. [21] introduced Group Relative Policy Optimisation (GRPO), a refinement of PPO that improves training stability by using group-based reward estimation. Instead of relying on pairwise comparisons, GRPO evaluates completions within groups and assigns rewards based on relative performance within those groups.

Given a batch of responses from the policy model , GRPO estimates relative advantages across the group and applies the following optimisation objective: E(, ) ∼ [rel(|) log (|) −

KL ( | ref)] , where is the updated policy and ref is the original pre-trained policy. The KL divergence term prevents the with the coeficient determining the strength of this regularisation.

The relative advantage rel(|) is computed as: rel(|) = (|) − , (8) (9) where the score function is defined as log r ef((· |·|)) , and the parameter regulates how far (·|)

While DPO ofers a more streamlined alternative to RLHF by avoiding explicit reward modelling, it is limited by its reliance on fixed pairwise preference comparisons.

This can hinder its capacity to generalise across tasks that exhibit contextual or structural variation [20]. the new policy may deviate from the reference policy. strategies have been proposed to enhance multilingual Chen et al. [23] proposed mSVAMP, a multilingual extension of SVAMP following the same approach. Multiple reasoning in LLMs. These include translation-based approaches [24], SFT [25], and preference-based alignment methods [7], each of which demonstrates gains in multilingual performance. Nonetheless, these methods rely heavily on high-quality annotated data. SFT sufers from forgetting and poor generalisation, while preferencebased alignment adds computational overhead through critic-based systems. Another line of research has explored the use of in-context prompting, whereby LLMs are instructed to reason step by step through carefully designed prompts. Although this strategy has proven useful in certain tasks [2], its reliance on English, combined with its ineficacy for smaller models [ 1], limits its applicability. Moreover, reasoning under this framework is typically induced by the prompt’s structure, making it dificult to generalise across languages or domains.

While reasoning is inherently independent of language, the extent to which LLMs demonstrate consistent reasoning across linguistic boundaries remains limited. We aim to disentangle logical reasoning from linguistic surface forms by adopting a language-agnostic formalism.

We propose converting problems expressed in any lanmanipulable, and semantically grounded. Reasoning operates over this intermediate form, with the final answer rendered in the target language. To support this, we instruct LLMs to abstract and solve problems via selftraining, enabling scalable multilingual reasoning without the need for prompt engineering. where (|) denotes the reward assigned to the response , and and are the mean and standard deviation of the reward distribution within the group.

GRPO has demonstrated particular eficacy in multitask and multilingual reasoning contexts. By comparing responses within structurally related groups, it allows for more adaptive and robust policy updates, supporting ical findings confirm that GRPO improves consistency, robustness, and data eficiency when compared to traditional PPO-based methods.

5.2. Multilingual Reasoning

updated policy from diverging excessively from its prior, guage into a shared formal representation that is abstract,

6. Conclusion & Future Works

Although reasoning is inherently language-agnostic, LLMs’ outputs often reflect biases towards dominant pretraining languages, particularly English. While models show strong multilingual capabilities, their step-wise reaon English and Italian, we propose a modular approach that abstracts the problem into a language-agnostic formalism, followed by structured reasoning. Using selftraining, we align reasoning performances, achieving gains in both accuracy and consistency.

This work contributes to a series of studies aimed better generalisation and stability across tasks. Empir- soning remains inconsistent across languages. Focusing Recent eforts to assess the capabilities of LLMs have foat expanding the proficiency of LLMs beyond English. cused on their performance in complex reasoning tasks, In our Research, we have explored interventions at evparticularly in mathematical problem-solving. Benchmark datasets such as GSM8K and SVAMP have been widely adopted for this purpose. To extend such evaluation to multilingual contexts, Shi et al. [22] introduced mGSM, a multilingual variant of GSM8K, created by manery stage—from pre-training [26, 27] and post-training [4, 11] to inference methods [1, 2, 3], and recently on multimodal reasoning [28]. In parallel, the aim is to propose methodologies based on human-inspired principles [29, 30, 31, 32] that aim to steer models away from ually translating 250 test samples into various languages. heuristics that lead to verbatim-based [33] or symbolicsemantic memorisation [34]. Our overarching goal is to ensure that Italian is not left behind, applying state-ofthe-art approaches to enhance generative capabilities, linguistic proficiency, and other emerging competencies of contemporary LLMs in Italian. [6] L. Ranaldi, A. Freitas, Aligning large and small language models via chain-of-thought reasoning, in: Y. Graham, M. Purver (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp. 1812–1827.

URL: https://aclanthology.org/2024.eacl-long.109/. [7] J. Dang, A. Ahmadian, K. Marchisio, J. Kreutzer, [1] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti, A. Üstün, S. Hooker, RLHF can speak many lanF. M. Zanzotto, A tree-of-thoughts to broaden guages: Unlocking multilingual preference optimulti-step reasoning across languages, in: K. Duh, mization for LLMs, in: Y. Al-Onaizan, M. Bansal, H. Gomez, S. Bethard (Eds.), Findings of the Associ- Y.-N. Chen (Eds.), Proceedings of the 2024 Conferation for Computational Linguistics: NAACL 2024, ence on Empirical Methods in Natural Language Association for Computational Linguistics, Mex- Processing, Association for Computational Linguisico City, Mexico, 2024, pp. 1229–1241. URL: https: tics, Miami, Florida, USA, 2024, pp. 13134–13156. //aclanthology.org/2024.findings-naacl.78. doi: 10. URL: https://aclanthology.org/2024.emnlp-main. 18653/v1/2024.findings-naacl.78. 729/. doi:10.18653/v1/2024.emnlp-main.729. [2] L. Ranaldi, G. Pucci, B. Haddow, A. Birch, Em- [8] L. Ranaldi, A. Freitas, Self-refine instructionpowering multi-step reasoning across languages tuning for aligning reasoning in language modvia program-aided language models, in: Y. Al- els, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceed- (Eds.), Proceedings of the 2024 Conference on Emings of the 2024 Conference on Empirical Methods pirical Methods in Natural Language Processing, in Natural Language Processing, Association for Association for Computational Linguistics, Miami, Computational Linguistics, Miami, Florida, USA, Florida, USA, 2024, pp. 2325–2347. URL: https: 2024, pp. 12171–12187. URL: https://aclanthology. //aclanthology.org/2024.emnlp-main.139/. doi:10. org/2024.emnlp-main.678. doi:10.18653/v1/2024. 18653/v1/2024.emnlp-main.139. emnlp-main.678. [9] V. Gaur, N. Saunshi, Reasoning in large lan[3] L. Ranaldi, B. Haddow, A. Birch, When natural guage models through symbolic math word problanguage is not enough: The limits of in-context lems, in: Findings of the Association for Comlearning demonstrations in multilingual reason- putational Linguistics: ACL 2023, Association ing, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), for Computational Linguistics, Toronto, Canada, Findings of the Association for Computational Lin- 2023, pp. 5889–5903. URL: https://aclanthology. guistics: NAACL 2025, Association for Compu- org/2023.findings-acl.364. doi: 10.18653/v1/2023. tational Linguistics, Albuquerque, New Mexico, findings-acl.364. 2025, pp. 7369–7396. URL: https://aclanthology.org/ [10] L. Pan, A. Albalak, X. Wang, W. Wang, Logic2025.findings-naacl.412/. doi: 10.18653/v1/2025. LM: Empowering large language models with symfindings-naacl.412. bolic solvers for faithful logical reasoning, in: [4] L. Ranaldi, G. Pucci, Does the English matter? elicit H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the cross-lingual abilities of large language models, in: Association for Computational Linguistics: EMNLP D. Ataman (Ed.), Proceedings of the 3rd Workshop 2023, Association for Computational Linguistics, on Multi-lingual Representation Learning (MRL), Singapore, 2023, pp. 3806–3824. URL: https:// Association for Computational Linguistics, Singa- aclanthology.org/2023.findings-emnlp.248/. doi: 10. pore, 2023, pp. 173–183. URL: https://aclanthology. 18653/v1/2023.findings-emnlp.248. org/2023.mrl-1.14. doi:10.18653/v1/2023.mrl-1. [11] L. Ranaldi, G. Pucci, Multilingual reasoning via self14. training, in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), [5] L. Ranaldi, G. Pucci, A. Freitas, Does the Order Proceedings of the 2025 Conference of the Nations matter? Curriculum learning over languages, in: of the Americas Chapter of the Association for ComN. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, putational Linguistics: Human Language TechnoloN. Xue (Eds.), Proceedings of the 2024 Joint In- gies (Volume 1: Long Papers), Association for Comternational Conference on Computational Linguis- putational Linguistics, Albuquerque, New Mexico, tics, Language Resources and Evaluation (LREC- 2025, pp. 11566–11582. URL: https://aclanthology. COLING 2024), ELRA and ICCL, Torino, Italia, 2024, org/2025.naacl-long.577/. doi:10.18653/v1/2025. pp. 5212–5220. URL: https://aclanthology.org/2024. naacl-long.577.

lrec-main.464/. the Association for Computational Linguistics: ACL [31] M. Mastromattei, L. Ranaldi, F. Fallucchi, F. M. Zan2024, Association for Computational Linguistics, zotto, Syntax and prejudice: ethically-charged Bangkok, Thailand, 2024, pp. 7961–7973. URL: https: biases of a syntax-based hate speech recognizer //aclanthology.org/2024.findings-acl.473/. doi: 10. unveiled, PeerJ Computer Science 8 (2022) e859. 18653/v1/2024.findings-acl.473. URL: http://dx.doi.org/10.7717/peerj-cs.859. doi:10. [25] A. Üstün, V. Aryabumi, Z. Yong, W.-Y. Ko, 7717/peerj-cs.859.

D. D’souza, G. Onilude, N. Bhandari, S. Singh, H.-L. [32] L. Ranaldi, Survey on the role of mechanistic interOoi, A. Kayid, F. Vargus, P. Blunsom, S. Longpre, pretability in generative ai, Big Data and Cognitive N. Muennighof, M. Fadaee, J. Kreutzer, S. Hooker, Computing 9 (2025). URL: https://www.mdpi.com/ Aya model: An instruction finetuned open-access 2504-2289/9/8/193. doi:10.3390/bdcc9080193. multilingual language model, in: L.-W. Ku, [33] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi, A. Martins, V. Srikumar (Eds.), Proceedings of C. Giannone, A. Favalli, R. Romagnoli, F. M. Zanthe 62nd Annual Meeting of the Association for zotto, Investigating the impact of data contamComputational Linguistics (Volume 1: Long Pa- ination of large language models in text-to-SQL pers), Association for Computational Linguistics, translation, in: L.-W. Ku, A. Martins, V. SrikuBangkok, Thailand, 2024, pp. 15894–15939. URL: mar (Eds.), Findings of the Association for Comhttps://aclanthology.org/2024.acl-long.845/. doi:10. putational Linguistics: ACL 2024, Association for 18653/v1/2024.acl-long.845. Computational Linguistics, Bangkok, Thailand, [26] L. Ranaldi, G. Pucci, F. M. Zanzotto, Modeling eas- 2024, pp. 13909–13920. URL: https://aclanthology. iness for training transformers with curriculum org/2024.findings-acl.827/. doi: 10.18653/v1/2024. learning, in: R. Mitkov, G. Angelova (Eds.), Pro- findings-acl.827. ceedings of the 14th International Conference on [34] F. Ranaldi, A. Zugarini, L. Ranaldi, F. M. ZanRecent Advances in Natural Language Processing, zotto, Protoknowledge shapes behaviour of llms INCOMA Ltd., Shoumen, Bulgaria, Varna, Bulgaria, in downstream tasks: Memorization and general2023, pp. 937–948. URL: https://aclanthology.org/ ization with knowledge graphs, 2025. URL: https: 2023.ranlp-1.101/. //arxiv.org/abs/2505.15501. arXiv:2505.15501. [27] L. Ranaldi, G. Pucci, F. M. Zanzotto, How far does the sequence of compositions impact multilingual pre-training?, in: F. Dell’Orletta, A. Lenci, S. Montemagni, R. Sprugnoli (Eds.), Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024), CEUR Workshop Proceedings, Pisa, Italy, 2024, pp. 796–804. URL: https: //aclanthology.org/2024.clicit-1.86/. [28] L. Ranaldi, F. Ranaldi, G. Pucci, R2-MultiOmnia:

Leading multilingual multimodal reasoning via selftraining, in: W. Che, J. Nabende, E. Shutova, M. T.

Pilehvar (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Vienna, Austria, 2025, pp. 8220–8234. URL: https://aclanthology.org/2025. acl-long.402/. doi:10.18653/v1/2025.acl-long.

402. [29] L. Ranaldi, G. Pucci, Knowing knowledge: Epistemological study of knowledge in transformers, Applied Sciences 13 (2023). URL: https:// www.mdpi.com/2076-3417/13/2/677. doi:10.3390/ app13020677. [30] G. Pucci, F. M. Zanzotto, L. Ranaldi, Animate, or inanimate, that is the question for large language models, Information 16 (2025). URL: https:// www.mdpi.com/2078-2489/16/6/493. doi:10.3390/ info16060493.

A. SAGE Instruction Template

#Role You are an experienced expert skilled in multilingual mathematical reasoning problems. #Task You are presented with a mathematical reasoning problem in a given language. Follow the steps below rigorously to formalise and solve it. #Instructions #Question {question} 1) Formalisation (Language-Agnostic): Identify and define the key mathematical components of the problem, such as variables, functions, operations, and constraints. Structure these components in an abstract manner to ensure a clear and precise formulation. Label this step as <formalisation>....</formalisation> 2) Reasoning Execution: Solve the problem systematically by breaking it into logical steps. Clearly justify each step using natural language explanations while maintaining logical rigor. Express the final answer in the same language as the input query. Label this step as <reasoning>....</reasoning> Final Answer: Present the extracted answer in a concise format, marked as “The answer is: [num]” in the same language as the query. Label this step as <answer>....</answer>

We use SAGE to generate synthetic demonstrations for training smaller LLMs. We use GPT-4o as an annotator and use the annotations to warm-up the models with the proposed methodologies. We then conduct a complete Self-training phase. Moreover, we conduct the Self-training by using self-generated data (generated by the trained models themselves). We define these configurations ‘Full’-Self-training. In both cases, the demonstrations are

generated by prompting the models using instructions detailed in Appendix A. However, while GPT-4o follows the instructions well (in fact, we did not find any significant issues), the other models generate outcomes that include errors. To handle this, we evaluated the quality of the generated demonstrations by filtering out inaccurate examples to get a gold instruction set. In particular, we removed all inaccurate answers (outputs that do not match

the exact target string metric). Then, we control if the demonstrations follow correctly the steps indicated in our prompt (see Table 4) using GPT-4o-mini and the prompt in

Appendix ??.

C. Evaluation Metrics

We used a double-check to assess the accuracy of the responses delivered in the diferent experiments. In the first step, we used an exact-match heuristic. However, since some experiments required a more accurate response check, we used GPT-4o-mini as a judge.

D. Models and Hyperparameters As evaluation sets, we use the tasks introduced in §3.3. These tasks are used to assess the performance of LLMs, but they do not have reserved sets for evaluation and training. Therefore, to produce a training set, we split mSVAMP into training and testing. Table 6 shows the instances of each dataset in training and testing. To ensure the languages are perfectly balanced, we translated 350 samples from English to Telugu (language non-present in mSVAMP). This subset was used for training purposes only.

Task Total Test Train. Set # dim

mGSM 0.5 mGSM-Symbolic 0.5 mSVAMP 2

The data are perfectly balanced between the languages in the proposed tasks. However, as described in Appendix B, the qualities of the annotations are not perfect. Behind filtering the annotations, we obtained a reduced dataset. To have fair,

balanced subsets, we use 1k samples in total. We use 1k samples when instructing the models for DPO and SFT. For the Self-training, we used as the initial subset (§2.2) 60% of the filtered samples balanced between all languages.

Hyperparameters In §3.2, we described the standard Self-training setting. However, we have proposed diferent G. Number of Iterations experimental settings. In the Self-training experimental setting, we conducted three iterations as proposed in [12, 14]. Following pilot experiments, we set the number of iterations In the SFT-only and RL-only settings, we used warm-up and of self-tuning at three. Figure 7 shows the performance trend four epochs and 8000 steps, respectively. We conducted this by increasing the number of iterations, epochs and steps after study after the pilot experiments shown in the previous warm-up (wup).

sections.

E. Models Vesions Model

Llama3-8(-instruct) Phi-3(-mini-instruct) DeepSeekMath-7B GPT-4o GPT-4o-mini

Declaration on Generative AI