1. Introduction

Xiv:

10.18653/v1/2023.mrl-1

The limits of Italian in Reasoning Tasks

Leonardo Ranaldi

Federico Ranaldi

Giulia Pucci

Elena Sofia Ruzzetti

Fabio Massimo Zanzotto

0 Department of Computing Science, University of Aberdeen , UK 1 Università degli Studi Roma "Tor Vergata" , Roma , Italy

2307

09288 3687 3692

Earlier works have been showing the eficacy of reasoning methods in eliciting step-wise reasoning of large language models (LLMs) by operating via in-context demonstrations. These strategies, exemplified by Chain-of-Thought (CoT) and ProgramAided Language Models (PAL), have been shown to reason well in monolingual contexts, primarily in English. However, there has been limited investigation into their capabilities in other languages, especially Italian. To gain a deeper understanding of the role of reasoning methods, we propose a multidimensional analysis tailored to Italian, focusing on arithmetic and symbolic reasoning tasks. Our findings indicate that the efectiveness of reasoning methods varies significantly beyond English. Expressly, CoT, which relies on natural language demonstrations, is limited to English. Conversely, the structured nature of PAL in-context demonstrations facilitates multilingual comprehension, enabling LLMs to generate programmatic answers in Italian as well. Finally, for a more complete overview, we observe that additional alignment methods do not improve downstream performances; in contrast, in some cases, they restrict the abilities of the original models.

eol>Large Language Models Reasoning Methods Multilingual Reasoning

1. Introduction

problem by proposing prompting mechanisms that translate the problem into English, while Ranaldi et al. [11] Large language models (LLMs) are able to tackle tasks elicit multi- and cross-lingual alignments for enabling using prompts formed by structured patterns, a process reasoning, or Ranaldi et al. [12] self-correction mechaknown as in-context learning [1]. This method allows the nisms. The focus is limited to proposing performance models to solve tasks without modifying their underlying solutions for a few languages, leaving behind the study parameters, relying solely on the provided inputs. The of the role and the impacts of languages such as Italian. success of in-context learning has consequently height- In this paper, we conduct an in-depth study to evaluate ened interest in analysing the factors that influence its the role of reasoning methods in Italian. Taking previous efectiveness [2, 3, 4]. work a step further, we study the operation of reasoning

Regarding reasoning methods, two efective strate- methods by analysing the efects of diferent types of gies have emerged: Chain-of-Thought (CoT) [5, 6] and reasoning methods on LLMs’ Italian reasoning capabilProgram-Aided Language Models (PAL) [7, 8]. CoT de- ities. This leads to the main research questions of this composes a reasoning task into a series of intermediate paper: (i) What role do natural language and structured steps using natural language, making it more general in-context demonstrations play in reasoning planning in and human-understandable. In contrast, PAL employs Italian? (ii) What are the impacts and limits of natural Python functions to provide reasoning solutions, with language demonstrations? (iii) Do Italian-aligned and its step-by-step programming approach leading to more Italian-centred models respond diferently to reasoning systematic and structured reasoning. methods?

Although earlier research primarily showcased the To answer these questions, we operate via CoT and functioning of reasoning methods in English, recent stud- PAL (shown in Table 1 and Table 2). For multilingual CoT, ies have expanded to explore multilingual approaches. we use natural language demonstrations both in English Shi et al. [9] shown that the efectiveness of CoT ratio- and in Italian following Shi et al. [9]. Instead, for PAL, nales is limited to the languages most represented in we propose a novel method by extending the original LLMs pre-training data. Huang et al. [10] addressed the in English [7]. We use reasoning tasks covering mathematical, commonsense reasoning, and natural language CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, inference tasks in original versions (English) and adapted Dec 04 — 06, 2024, Pisa, Italy to Italian (resources available). These tasks are MGSM [$na[mnea]m.[es]u.[rsnuarmnea]m@e]u@nierdo.maca.2u.kit((LF.. RRaannaallddii));; [9] and MSVAMP [13], which consist of mathematical [name].[surname]@abdn.ac.uk (G. Pucci) reasoning problems, and XCOPA [14], PAWS-X [15] and © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License XLNI [16] which consist of commonsense reasoning and Attribution 4.0 International (CC BY 4.0). natural language inference. presented in §2.1. These methods demonstrate their func

Finally, we select a range of diferent LLMs, we employ tionality in several tasks, but evaluations and further GPTs [17] models for the results obtained in multilingual studies are primarily conducted in English, leaving other tasks, Phi-3 [18], and Mixtral [19] for the results obtained languages unexplored (§2.2). To this end, we propose in Italian benchmarks, diferent versions of Llama-2 and a methodical study of the efect of reasoning methods Llama-3 [20] (adapted version for Italian, i.e., Llamantino- beyond English, mainly focusing on Italian (§2.3). 2 and -3 [21, 22]), EuroLLM [23] and finally two Italiancentered LLMs for the improvements achieved by smaller- 2.1. In-context Learning scale versions. We operate using the original models, and we propose aligned versions using state-of-the-art instruction-tuning methods based on synthetic data [24] transferred for multilingual cases [25, 26].

The main contribution and findings of our paper are:

Techniques like Chain-of-Thought (CoT) prompting [6]

and Program-Aided Language Models (PAL) [7] have improved LLMs’ performances by encouraging the generation of intermediate reasoning steps. However, while

CoT explanations are not always faithful to the actual • Reasoning methods improve performance in Ital- reasoning process of the model, with final answers that ian reasoning tasks as well as in English. How- may not logically follow from the reasoning chain, the ever, although both methods bring tangible ben- structured nature of PAL limits ambiguities and leads the efits, several limitations emerge in the natural LLMs to deliver structured generations. language demonstrations employed in CoT. On the other side of the coin, we observe that the 2.2. Multilingual Reasoning structured reasoning demonstrations (i.e., PAL) elicit the models to plan the solution in a more modularised way. Consequently, this benefits the final performance in both English and nonEnglish tasks.

Earlier research studied the performances of CoT prompting in diferent languages. Shi et al. [9] tested the efectiveness of native in-context CoT that are rationales in a specific language ( Native-CoT in Table 1). Qin et al. [27], inspired by [10] and [28], proposed two-step CoT prompt• We display the positive impact of structured in- ing. Finally, Ranaldi et al. [12] proposed a prompt-based context demonstrations on solution planning in self-correction strategy. However, these studies have Italian. We then demonstrate that since struc- focused on demonstrating the performance of CoT and tured reasoning demonstrations are less ambigu- derived methods on large English-focused LLMs. Thus, ous than natural language, they are more adapt- previous works left a gap in the study of the type of mulable for math reasoning tasks and have a more tilingual demonstrations and their impacts and efects on noticeable impact in more articulate languages reasoning on diferent scales of LLMs. such as Italian. • Finally, we show that the diferent LLMs analyzed in our contribution are able to understand problems in both English and Italian. However, performance in English is higher despite diferent approaches used to equate Italian and English proficiency. This reveals that the limitation is not derived from proficiency in a specific language but rather from the language’s intrinsic dificulty

To the best of our knowledge, this is the first work that investigates the impact of reasoning methods for the Italian and demonstrates how these strategies can consistently boost LLMs’ performance, equipping them with the ability to generate step-wise explanatory reasoning for their predictions. We share the data used at the following link.

2. Reasoning Methods

In-context reasoning methods elicit large language models (LLMs) in delivering step-wise reasoned answers, as

Q: Roger ha 5 palline da tennis. Ha

comprato altre 2 lattine di palline da tennis. Ogni barattolo contiene 3 palline da tennis. Quante palline da tennis ha ora? A: Roger inizia con 5 palline. 2 barattoli da 3 palline da tennis ciascuno fanno 6 palline da tennis. 5 + 6 = 11. La risposta è 11. 2.3. Reasoning in Italian

3. Experimental setup

We take the next step by proposing an in-depth evalua- 3.1. Data tion that studies the efect of in-context demonstrations used in the reasoning methods. Hence, we conduct our We introduce five diferent reasoning tasks: MGSM [ 9], analysis on diferent LLMs chosen by family, capabilities, MSVAMP [13], XNLI [16], and PAWS-X [15], XCOPA and scope of construction (§3.2) with reasoning tasks [14]; they have been constructed for multilingual evalua(§3.1). The goal is to examine the impact of various types tions and are described in detail in Appendix 7. of demonstrations in Italian, addressing the limitations and enhanced functionality these methods can ofer. 3.2. Models

Our experiments explore the following key points: a) constructing robust evaluation by extending PAL (see We select LLMs based on performance and the purpose Table 2) and applying Italian CoT methods on diferent of the construction. These models are best exemplified models using carefully designed benchmarking tasks; b) by the GPT [17] and Llama-2 and -3 [20] families for the investigating the efects of in-context demonstrations; performances shown in multilingual reasoning tasks [9], c) analysing the varying efects of in-context reasoning two models from the Mistral family [19], EuroLLM1 [23] methods across diferent models (e.g., models without and Phi-3 [18] for the proficiency shown in the Italian any further adaptation, and models adapted for the Italian leaderboard. Finally, discerning between the training language). types, we select Italian-aligned models (Llamantino-2 [21] and Llamantino-3 [22]) and Italian-centred models PAL beyond English To extend multilingual evalua- (modello-Italia, Minerva-3b, -1b). GPT-3.5 is used via API, tion to the PAL reasoning method, we propose a specially while the other models are available in open-source forconstructed language-specific version (showed in the fol- mat. Appendix 12 describes the parameters and versions lowing table) by transferring the prompts proposed in used in detail. (We released data & code at the following [9] into programs-like demonstrations as done in [7]. link).

Q: Roger ha 5 palline da tennis. Ha

comprato altre 2 lattine di palline da tennis. Ogni barattolo contiene 3 palline da tennis. Quante palline da tennis ha ora? A: # Roger ha 5 palline da tennis.

tennis_balls = 5 # compra 2 lattine, ciascuna ha 3 palline da tennis

bought_balls = 2 * 3 # Le palline totali sono

answer = tennis_balls + bought_balls # La risposta è 11 3.3. Prompting & Evaluation

We operate in two ways concerning mathematical and

understanding & commonsense tasks. For mathematical tasks, we align the original CoT and PAL to Italian. We use Native-CoT [9] (Table 1) and adapted method proposed in [27] (Appendix 10). Concerning PAL, we introduce Italian demonstrations as in Table 2. For understanding and commonsense tasks, we define input templates that lead LLMs to follow the instructions and aid generation. We construct prompts following [29], using the CoT prompting method to elicit multi-step generations. Finally, we evaluate performance using the accuracy score. Hence, we measure the exact match between generated outputs and labels2. We maintain the generation temperatures as recommended in the oficial papers. For the GPT-3.5, we use the API, while for the others, we used versions available on huggingface (in Appendix 12). 1NB we identify this model as Italian-centred even though it has been pre-trained on diferent European languages in the same way [23]. 2We extract target labels from the generated answers using regular expressions before calculating the exact match. For each task, we use Instruction Templates to guide the model to stable generations and facilitate evaluation.

4. Results & Discussions Large language models (LLMs) benefit from reasoning

methods in English and in Italian as well. As discussed in §4.1, the in-context demonstrations beyond English elicit the LLMs to deliver multilingual reasoned answers; however, the operation difers depending on the type of method.

Although demonstrations lead the models to generate more robust answers, improving Italian as well, the operation of these techniques appears to be efective only Figure 2: Diference between PAL and CoT (highlighted the in some models. As analysed in §4.2, in-context ratio- original and adapted models) nales in natural language have a diferent efect. On the other side of the coin, structured program-of-thoughts demonstrations lead the models to more stable generations. Hence, the impact of in-context demonstrations varies according to the quality and quantity of rationales and the scale of model parameters (§4.3).

Finally, in §4.4, we examine the efects of alignment approaches by discerning the factors that influence the generation of the final response and highlighting the matter of native language demonstrations. mainly positive, some phenomena emerge, such as differences (the baseline Direct outperforms the reasoning method) and a disparity between CoT and PAL between Original- and Italian-Aligned models. Specifically, (i) PAL (⋆) outperforms CoT (∙ ) in Figure 1 and (ii) the ItalianAligned models outperform the Original-Model in Italian task but not in English. To understand these dynamics in depth in §4.2, we explore how the demonstration structure impacts the models’ generations. 4.1. Reasoning in Italian In-context reasoning methods empower the LLMs’ mul- 4.2. Natural Language Efects tilingual performances in arithmetic and symbolic reasoning tasks. Figure 1 shows the diferences between Native-CoT and Native-PAL, and the baselines (Direct). The use of in-context Italian demonstrations brings clear benefits. GPT-3.5 and Llama-based models (Llama2-70 and Llamantino3) obtain noticeable benefits from Native-based prompting approaches (complete results in Appendix 14). Although these LLMs benefit the most from introducing reasoning methods in the prompting stage, further improvements are observable even in LLMs with fewer parameters (i.e., EuroLLM, Phi-3, Llama2-7, and Llama3-8 as well adapted versions Llamantino-2 and -3, complete results in Appendices 15, 16). These results demonstrate the sensitivity of Italian in-context prompting in understanding and commonsense reasoning (Appendix 17). However, although the averages are The efect of the reasoning method relies on the solution strategy. Structured in-context demonstrations in a program-like manner are more efective than natural language rationales. Figure 2 displays that the diferences between Native-PAL and Native-CoT are consistently positive. Moreover, the Italian-Aligned models (i.e., Llamantino-based) obtain better results of original models in Italian tasks when Native-PAL is used. Since the natural language of in-context rationales does not provide the same benefits as PAL, we examined the generations delivered to investigate the origin of the diferences.

The results indicate that even though the CoT incontext demonstrations in the Italian natural language are the same as those in English, the generations have diferent structures (Appendix 9, Table 7). In-depth, a relationship emerges between performance and the average number of steps required to get correct answers.

The number of , i.e., the steps to reach the final solution, represented by natural language sentences, are on average between 2 and 5 for the Italian answers and around 3 and 5 for English; in PAL, they are concentrated around 3 and 4. This shows that natural language, especially Italian, rich in intricate linguistic structures, is not the best for solving mathematical, symbolic tasks. In contrast, PAL seems more appropriate due to its rigid structure and better support for generative reasoning passages. 4.3. Demonstrations Impacts In-context demonstrations play a key role in complex tasks because they promote reasoning, as discussed in §4.1. We investigated the performance trend as in-context demonstrations increased, repeating the previous experiments focusing on MGSM using zero- from 6-shots. The results show that the impact of in-context demonstrations across the languages is related to the quality and quantity of demonstrations. A distinction emerges between models and the number of de facto useful demonstrations. GPT-3.5 with 4-shots achieves results comparable to 6-shots (average accuracies in Figure 6). This balance does not occur in Llama-based and Mixtral, which underperforms as in-context demonstrations increase. Finally, the smaller models have conspicuous improvements as the number of demonstrations increases.

Model follows: a) Reasoning methods work in Italian as well; however, there emerges a diference between rationalesbased methods (CoT) and program-like approaches (PAL). 4.4. Language of Reasoning makes the b) The nature of natural language demonstrations used in diference CoT does not fit best with rich languages such as Italian.

Instead, PALs’ programme structure limits ambiguity by Multilingual in-context demonstrations aid LLMs in ap- improving the ability to deliver reasoning in English and plying solution strategies; however, the language used to Italian. (c) Consequently, this analysis recommends opreason matters. By eliciting LLMs to deliver multi-step erating through structured in-context rationale instead English answers, we observed significant improvements of using natural language when interacting with LLMs, in accuracy. Complementing previous work, we used two especially when dealing with complex contexts such as strategies: (i) in-context demonstrations of reasoning an- reasoning. In the future, we would like to investigate the swers in a specific language ( Native-method). (ii) the internal dynamics that support the causal generations same in-context setting and then elicit the model to pro- of LLMs to identify gaps and improve multilingual genvide the solution in English (Cross-method). As in Table erative capabilities [30] by exploiting alignment [24] or 3, the Cross-methods provide tangible benefits both in self-refining approaches [ 31]. However, at the same time, PAL and CoT. These latter results emphasized the LLMs’ contamination data issues [32, 33, 34] understanding and production abilities.

5. Findings & Future Works The advances of reasoning methods emerge beyond the

We investigate the impact that reasoning methods cause English. Our analysis shows that properly elicited LLMs on final performance by expanding the study about the can deliver reasoned answers in Italian as well. By oprole and the limits of them in Italian. The main find- erating via CoT and PAL, we revealed that in-context ings and tangible recommendations can be outlined as demonstrations play a strategic role in improving per6. Conclusion formance in direct proportion to their quality and quan- [7] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, tity. Our research highlights the need for a customised J. Callan, G. Neubig, Pal: Program-aided language strategy for employing reasoning methods for LLMs. It models, arXiv preprint arXiv:2211.10435 (2022). supports the demand for a reasonable combination of [8] W. Chen, X. Ma, X. Wang, W. W. Cohen, Program model scale, reasoning technique, and strategic use of of thoughts prompting: Disentangling computation in-context learning to elicit the prospect of multilingual from reasoning for numerical reasoning tasks, 2023. LLMs. arXiv:2211.12588. [9] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, Acknowledgements D. Zhou, D. Das, J. Wei, Language models are multilingual chain-of-thought reasoners, 2022.

This work was funded by UK Research and Innovation arXiv:2210.03057. (UKRI) under the UK government’s Horizon Europe fund- [10] H. Huang, T. Tang, D. Zhang, W. X. Zhao, ing guarantee grant number 10039436 and PRIN 2022 T. Song, Y. Xia, F. Wei, Not all languages are creProject - Class-tAIs CUP: E53D230081000. ated equal in llms: Improving multilingual capability by cross-lingual-thought prompting, 2023.

References arXiv:2305.07004. [11] L. Ranaldi, G. Pucci, A. Freitas, Empowering cross[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, lingual abilities of instruction-tuned large language J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, models by translation-following demonstrations, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), FindG. Krueger, T. Henighan, R. Child, A. Ramesh, ings of the Association for Computational LinD. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, guistics ACL 2024, Association for Computational E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, Linguistics, Bangkok, Thailand and virtual meetC. Berner, S. McCandlish, A. Radford, I. Sutskever, ing, 2024, pp. 7961–7973. URL: https://aclanthology. D. Amodei, Language models are few-shot learners, org/2024.findings-acl.473. doi: 10.18653/v1/2024. 2020. arXiv:2005.14165. findings-acl.473. [2] O. Rubin, J. Herzig, J. Berant, Learning to retrieve [12] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti, prompts for in-context learning, in: M. Carpuat, F. M. Zanzotto, A tree-of-thoughts to broaden M.-C. de Marnefe, I. V. Meza Ruiz (Eds.), Pro- multi-step reasoning across languages, in: K. Duh, ceedings of the 2022 Conference of the North H. Gomez, S. Bethard (Eds.), Findings of the AssociAmerican Chapter of the Association for Com- ation for Computational Linguistics: NAACL 2024, putational Linguistics: Human Language Tech- Association for Computational Linguistics, Mexnologies, Association for Computational Linguis- ico City, Mexico, 2024, pp. 1229–1241. URL: https: tics, Seattle, United States, 2022, pp. 2655–2671. //aclanthology.org/2024.findings-naacl.78. doi: 10. URL: https://aclanthology.org/2022.naacl-main.191. 18653/v1/2024.findings-naacl.78. doi:10.18653/v1/2022.naacl-main.191. [13] N. Chen, Z. Zheng, N. Wu, M. Gong, Y. Song, [3] J. Zhao, Y. Xie, K. Kawaguchi, J. He, M. Xie, Auto- D. Zhang, J. Li, Breaking language barriers in mulmatic model selection with large language models tilingual mathematical reasoning: Insights and obfor reasoning, in: H. Bouamor, J. Pino, K. Bali (Eds.), servations, 2023. arXiv:2310.20246. Findings of the Association for Computational Lin- [14] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, guistics: EMNLP 2023, Association for Computa- I. Vulić, A. Korhonen, XCOPA: A multilingual tional Linguistics, Singapore, 2023, pp. 758–783. dataset for causal commonsense reasoning, in: URL: https://aclanthology.org/2023.findings-emnlp. B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceed55. doi:10.18653/v1/2023.findings-emnlp.55. ings of the 2020 Conference on Empirical Meth[4] Y. Zhang, S. Feng, C. Tan, Active example selection ods in Natural Language Processing (EMNLP), Asfor in-context learning, 2022. arXiv:2211.04486. sociation for Computational Linguistics, Online, [5] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwa- 2020, pp. 2362–2376. URL: https://aclanthology. sawa, Large language models are zero-shot reason- org/2020.emnlp-main.185. doi:10.18653/v1/2020. ers, 2023. arXiv:2205.11916. emnlp-main.185. [6] J. Wei, X. Wang, D. Schuurmans, M. Bosma, [15] Y. Yang, Y. Zhang, C. Tar, J. Baldridge, PAWSB. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of- X: A cross-lingual adversarial dataset for parathought prompting elicits reasoning in large lan- phrase identification, in: K. Inui, J. Jiang, V. Ng, guage models, 2023. arXiv:2201.11903. X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language [27] L. Qin, Q. Chen, F. Wei, S. Huang, W. Che, ing auto-regressive multi-layer artificial neural netCross-lingual prompting: Improving zero-shot works to predict financial time series, Information chain-of-thought reasoning across languages, in: 13 (2022). URL: https://www.mdpi.com/2078-2489/ H. Bouamor, J. Pino, K. Bali (Eds.), Proceed- 13/11/524. doi:10.3390/info13110524. ings of the 2023 Conference on Empirical Meth- [35] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti, ods in Natural Language Processing, Associa- F. M. Zanzotto, Empowering multi-step reasontion for Computational Linguistics, Singapore, ing across languages via tree-of-thoughts, 2024. 2023, pp. 2695–2709. URL: https://aclanthology. arXiv:2311.08097. org/2023.emnlp-main.163. doi:10.18653/v1/2023. [36] R. Li, L. B. Allal, Y. Zi, N. Muennighof, D. Kocetkov, emnlp-main.163. C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, [28] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene, S. Narang, A. Chowdhery, D. Zhou, Self-consistency M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O. Shliimproves chain of thought reasoning in language azhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, models, 2023. arXiv:2203.11171. L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, [29] K. Ahuja, H. Diddee, R. Hada, M. Ochieng, Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D. AbK. Ramesh, P. Jain, A. Nambi, T. Ganu, S. Se- ulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy, gal, M. Ahmed, K. Bali, S. Sitaram, MEGA: U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. VilMultilingual evaluation of generative AI, in: legas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, H. Bouamor, J. Pino, K. Bali (Eds.), Proceed- N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, ings of the 2023 Conference on Empirical Meth- J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J. ods in Natural Language Processing, Associa- Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, tion for Computational Linguistics, Singapore, D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferran2023, pp. 4232–4267. URL: https://aclanthology. dis, S. Hughes, T. Wolf, A. Guha, L. von Werra, org/2023.emnlp-main.258. doi:10.18653/v1/2023. H. de Vries, Starcoder: may the source be with you!, emnlp-main.258. 2023. arXiv:2305.06161. [30] L. Ranaldi, G. Pucci, B. Haddow, A. Birch, Empowering multi-step reasoning across languages via program-aided language models, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 12171–12187. URL: https://aclanthology.org/2024.

emnlp-main.678. [31] L. Ranaldi, A. Freitas, Self-refine instructiontuning for aligning reasoning in language models, 2024. URL: https://arxiv.org/abs/2405.00402.

arXiv:2405.00402. [32] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi,

C. Giannone, A. Favalli, R. Romagnoli, F. M. Zanzotto, Investigating the impact of data contamination of large language models in text-to-SQL translation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Linguistics ACL 2024, Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 2024, pp. 13909–13920. URL: https://aclanthology. org/2024.findings-acl.827. doi: 10.18653/v1/2024.

findings-acl.827. [33] L. Ranaldi, G. Pucci, Knowing knowledge: Epistemological study of knowledge in transformers, Applied Sciences 13 (2023). URL: https:// www.mdpi.com/2076-3417/13/2/677. doi:10.3390/ app13020677. [34] L. Ranaldi, M. Gerardi, F. Fallucchi, Cryptonet: Us

7. Proposed Task

Dataset MGSM

Task mathematical reasoning MSVAMP

mathematical reasoning XNLI

natural language inference XCOPA PAWS-X commonsense reasoning paraphrase identification Benchmark #Test

Final Prompt

Languages Bengali (bn), Chinese (zh), French (fr), Thai (th) German (de), Japanese (jp), Russian (ru), Telugu (te) Spanish (es), Swahili (sw), English (en) Bengali (be), Chinese (zh), French (fr), Thai (th) German (de), Japanese (jp), Russian (ru) Spanish (es), Swahili (sw), English (en) English (en), German (de), Russian (ru), French (fr), Spanish (es), Chinese (zh), Vietnamese (vi), Arabic (ar), Greek (el), Thai (th), Bulgarian (bg), Urdu (ur), Swahili (sw), Hindi (hi), Turkish (tr) Chinese (zh), Italian (it), Vietnamese (vi), Turkish (tr), Thai (th), Estonian (et), Tamil (ta), Swahili (sw), Haitian (ht), Quechua (qu), Indon. (in) English (en), German (de), Japanese (jp), French (fr), Spanish (es), Chinese (zh), Korean (ko), Italian (it) #Languages 11 10 15 11 8

8. In-context Demonstrations 9. Natural Language Structure

Analysing the composition of languages in the answers provided by the diferent models is useful to understand whether a certain model follows the in-context prompts by generating language-specific answers and, if so, what the error rate is. It is important to analyse the composition of the provided answers. To qualitatively estimate the generated responses, we propose the analysis of the phrases present in the responses generated by the models under study. Given an answer , composed of a set of sentences ({1, 2, . . . , }), we define as the number of sentences the models generate to deliver the solution. Since the in-context rationales provided have an average number of 4 (min value 3 and max value 5) [9], they do not include the final keyword “Answer:” or “The answer is:”, we do not consider the final keyword for a more realistic value as it often repeats the last sentence. Formally, let be composed of sentences and represent the final answer. The sum of sentences in gives the total number of . Hence, we compute this value for the generations of models analysed and report results in Table 7. 10. State-of-art Prompting

Methods

11. Program-Aided Language

Models Prompts

In this paper, as introduced in §3.3, we propose a novel Cross-lingual extension of the Program-Aided Language Models [7] (Cross-PAL) method. The following tables show

the prompts used for the final evaluation.

Program-Aided Language Models (PAL) Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 tennis balls.

tennis_balls = 5 2 cans of 3 tennis balls each is bought_balls = 2 * 3 tennis balls.

The answer is answer = tennis_balls + bought_balls

The answer is 11 Q: Kyle bought last year’s best-selling book for $19.50. This is with a 25% discount from the original price. What was the original price? A: Cross-ToT Simulate the collaboration of {} mathematicians answering a question in their mother tongue: 1, 2, ... and . They all start Step1 from a separate thought process, step by step, each explaining their thought process. Following Step1, each expert refines and develops their thought process by comparing themselves with others. This process continues until a definitive answer to the question is obtained.

Question: [Question in Language 1]

Answer: [num]. 12. Model and Hyperparameters In our experimental setting, as introduced in Section 3.2, we propose diferent LLMs: (i) one model from the GPT family [17]: GPT-3.5 (gpt-3.5-turbo-0125); (ii) three models from the Llama-2 family [20]: Llama2-7b, Llama2-70b, Llama-3-8-instruct; (iii) two models of the MistralAI family: Mistral-7b and Mixtral [19]; (iv) finally, Phi-3-mini [36].

In particular, GPTs models are used via API, while for the others, we used versions of the quantized to 4-bit models that use GPTQ (see detailed versions in Table 12) Furthermore, we have added additional LLMs. These models are three versions of Llama-based models adapted for Italian [21, 22] and three Italian-centered models: modello-Italia, Minerva-3b, and Minerva-1b.

As discussed in the limitations, our choices are related to reproducibility and the cost associated with non-open-source models. We use closed-source API and the 4-bit GPTQ quantized version of the model on 8 48GB NVIDIA RTXA600 GPUs for all experiments performed only in inference.

Finally, the generation temperature varies from = 0 of GPT models to = 0.5 of Llama2s. We choose these temperatures for (mostly) deterministic outputs, with a maximum token length of 256. The other parameters are left unchanged as recommended by the oficial resources. We will release the code and the dataset upon acceptance of the paper. 13. Models Vesions

Model Llama2-7 Llama2-70 Llama3-8 Phi-3-mini Mistral-7 Mixtral8x7 GPT-3.5-turbo Llamantino2-70 Llamantino2-7 Llamantino3-7 modello-italia Minerva-3b Minerva-1b EuroLLM

Version meta-llama/Llama-2-7b meta-llama/Llama-2-70b meta-llama/Meta-Llama-3-8B-Instruct microsoft/Phi-3-mini-128k-instruct mistralai/Mistral-7B-Instruct-v0.2 TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ OpenAI API (gpt-3.5-turbo-0125) swap-uniba/LLaMAntino-2-70b-hf-UltraChatITA swap-uniba/LLaMAntino-2-chat-7b-hfUltraChat-ITA swap-uniba/LLaMAntino-3-ANITA-8B-InstDPO-ITA sapienzanlp/modello-italia-9b-bf16 sapienzanlp/Minerva-3B-base-v1.0 sapienzanlp/Minerva-1B-base-v1.0 utter-project/EuroLLM-1.7B-Instruct 14. Results Arithmetic Reasoning Tasks - English and Italian 15. Results Arithmetic Reasoning Tasks - Italian-Aligned Models 16. Results Arithmetic Reasoning Tasks Italian-centred Models 17. Results Commonsense, Inference, and Understanding tasks