<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Xiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1/2023.mrl-1</article-id>
      <title-group>
        <article-title>The limits of Italian in Reasoning Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Ranaldi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Pucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Sofia Ruzzetti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing Science, University of Aberdeen</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi Roma "Tor Vergata"</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2307</year>
      </pub-date>
      <volume>09288</volume>
      <fpage>3687</fpage>
      <lpage>3692</lpage>
      <abstract>
        <p>Earlier works have been showing the eficacy of reasoning methods in eliciting step-wise reasoning of large language models (LLMs) by operating via in-context demonstrations. These strategies, exemplified by Chain-of-Thought (CoT) and ProgramAided Language Models (PAL), have been shown to reason well in monolingual contexts, primarily in English. However, there has been limited investigation into their capabilities in other languages, especially Italian. To gain a deeper understanding of the role of reasoning methods, we propose a multidimensional analysis tailored to Italian, focusing on arithmetic and symbolic reasoning tasks. Our findings indicate that the efectiveness of reasoning methods varies significantly beyond English. Expressly, CoT, which relies on natural language demonstrations, is limited to English. Conversely, the structured nature of PAL in-context demonstrations facilitates multilingual comprehension, enabling LLMs to generate programmatic answers in Italian as well. Finally, for a more complete overview, we observe that additional alignment methods do not improve downstream performances; in contrast, in some cases, they restrict the abilities of the original models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Reasoning Methods</kwd>
        <kwd>Multilingual Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>problem by proposing prompting mechanisms that
translate the problem into English, while Ranaldi et al. [11]
Large language models (LLMs) are able to tackle tasks elicit multi- and cross-lingual alignments for enabling
using prompts formed by structured patterns, a process reasoning, or Ranaldi et al. [12] self-correction
mechaknown as in-context learning [1]. This method allows the nisms. The focus is limited to proposing performance
models to solve tasks without modifying their underlying solutions for a few languages, leaving behind the study
parameters, relying solely on the provided inputs. The of the role and the impacts of languages such as Italian.
success of in-context learning has consequently height- In this paper, we conduct an in-depth study to evaluate
ened interest in analysing the factors that influence its the role of reasoning methods in Italian. Taking previous
efectiveness [2, 3, 4]. work a step further, we study the operation of reasoning</p>
      <p>Regarding reasoning methods, two efective strate- methods by analysing the efects of diferent types of
gies have emerged: Chain-of-Thought (CoT) [5, 6] and reasoning methods on LLMs’ Italian reasoning
capabilProgram-Aided Language Models (PAL) [7, 8]. CoT de- ities. This leads to the main research questions of this
composes a reasoning task into a series of intermediate paper: (i) What role do natural language and structured
steps using natural language, making it more general in-context demonstrations play in reasoning planning in
and human-understandable. In contrast, PAL employs Italian? (ii) What are the impacts and limits of natural
Python functions to provide reasoning solutions, with language demonstrations? (iii) Do Italian-aligned and
its step-by-step programming approach leading to more Italian-centred models respond diferently to reasoning
systematic and structured reasoning. methods?</p>
      <p>Although earlier research primarily showcased the To answer these questions, we operate via CoT and
functioning of reasoning methods in English, recent stud- PAL (shown in Table 1 and Table 2). For multilingual CoT,
ies have expanded to explore multilingual approaches. we use natural language demonstrations both in English
Shi et al. [9] shown that the efectiveness of CoT ratio- and in Italian following Shi et al. [9]. Instead, for PAL,
nales is limited to the languages most represented in we propose a novel method by extending the original
LLMs pre-training data. Huang et al. [10] addressed the in English [7]. We use reasoning tasks covering
mathematical, commonsense reasoning, and natural language
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, inference tasks in original versions (English) and adapted
Dec 04 — 06, 2024, Pisa, Italy to Italian (resources available). These tasks are MGSM
[$na[mnea]m.[es]u.[rsnuarmnea]m@e]u@nierdo.maca.2u.kit((LF.. RRaannaallddii));; [9] and MSVAMP [13], which consist of mathematical
[name].[surname]@abdn.ac.uk (G. Pucci) reasoning problems, and XCOPA [14], PAWS-X [15] and
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License XLNI [16] which consist of commonsense reasoning and
Attribution 4.0 International (CC BY 4.0).
natural language inference. presented in §2.1. These methods demonstrate their
func</p>
      <p>Finally, we select a range of diferent LLMs, we employ tionality in several tasks, but evaluations and further
GPTs [17] models for the results obtained in multilingual studies are primarily conducted in English, leaving other
tasks, Phi-3 [18], and Mixtral [19] for the results obtained languages unexplored (§2.2). To this end, we propose
in Italian benchmarks, diferent versions of Llama-2 and a methodical study of the efect of reasoning methods
Llama-3 [20] (adapted version for Italian, i.e., Llamantino- beyond English, mainly focusing on Italian (§2.3).
2 and -3 [21, 22]), EuroLLM [23] and finally two
Italiancentered LLMs for the improvements achieved by smaller- 2.1. In-context Learning
scale versions. We operate using the original models,
and we propose aligned versions using state-of-the-art
instruction-tuning methods based on synthetic data [24]
transferred for multilingual cases [25, 26].</p>
      <p>The main contribution and findings of our paper are:</p>
      <sec id="sec-1-1">
        <title>Techniques like Chain-of-Thought (CoT) prompting [6]</title>
        <p>and Program-Aided Language Models (PAL) [7] have
improved LLMs’ performances by encouraging the
generation of intermediate reasoning steps. However, while</p>
        <p>CoT explanations are not always faithful to the actual
• Reasoning methods improve performance in Ital- reasoning process of the model, with final answers that
ian reasoning tasks as well as in English. How- may not logically follow from the reasoning chain, the
ever, although both methods bring tangible ben- structured nature of PAL limits ambiguities and leads the
efits, several limitations emerge in the natural LLMs to deliver structured generations.
language demonstrations employed in CoT. On
the other side of the coin, we observe that the 2.2. Multilingual Reasoning
structured reasoning demonstrations (i.e., PAL)
elicit the models to plan the solution in a more
modularised way. Consequently, this benefits
the final performance in both English and
nonEnglish tasks.</p>
        <p>Earlier research studied the performances of CoT
prompting in diferent languages. Shi et al. [9] tested the
efectiveness of native in-context CoT that are rationales in a
specific language ( Native-CoT in Table 1). Qin et al. [27],
inspired by [10] and [28], proposed two-step CoT
prompt• We display the positive impact of structured in- ing. Finally, Ranaldi et al. [12] proposed a prompt-based
context demonstrations on solution planning in self-correction strategy. However, these studies have
Italian. We then demonstrate that since struc- focused on demonstrating the performance of CoT and
tured reasoning demonstrations are less ambigu- derived methods on large English-focused LLMs. Thus,
ous than natural language, they are more adapt- previous works left a gap in the study of the type of
mulable for math reasoning tasks and have a more tilingual demonstrations and their impacts and efects on
noticeable impact in more articulate languages reasoning on diferent scales of LLMs.
such as Italian.
• Finally, we show that the diferent LLMs analyzed
in our contribution are able to understand
problems in both English and Italian. However,
performance in English is higher despite diferent
approaches used to equate Italian and English
proficiency. This reveals that the limitation is not
derived from proficiency in a specific language
but rather from the language’s intrinsic dificulty</p>
        <p>To the best of our knowledge, this is the first work
that investigates the impact of reasoning methods for
the Italian and demonstrates how these strategies can
consistently boost LLMs’ performance, equipping them
with the ability to generate step-wise explanatory
reasoning for their predictions. We share the data used at
the following link.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Reasoning Methods</title>
      <p>In-context reasoning methods elicit large language
models (LLMs) in delivering step-wise reasoned answers, as</p>
      <sec id="sec-2-1">
        <title>Q: Roger ha 5 palline da tennis. Ha</title>
        <p>comprato altre 2 lattine di palline da
tennis. Ogni barattolo contiene 3 palline
da tennis. Quante palline da tennis ha ora?
A: Roger inizia con 5 palline. 2 barattoli
da 3 palline da tennis ciascuno fanno 6
palline da tennis. 5 + 6 = 11. La risposta
è 11.
2.3. Reasoning in Italian</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setup</title>
      <p>We take the next step by proposing an in-depth evalua- 3.1. Data
tion that studies the efect of in-context demonstrations
used in the reasoning methods. Hence, we conduct our We introduce five diferent reasoning tasks: MGSM [ 9],
analysis on diferent LLMs chosen by family, capabilities, MSVAMP [13], XNLI [16], and PAWS-X [15], XCOPA
and scope of construction (§3.2) with reasoning tasks [14]; they have been constructed for multilingual
evalua(§3.1). The goal is to examine the impact of various types tions and are described in detail in Appendix 7.
of demonstrations in Italian, addressing the limitations
and enhanced functionality these methods can ofer. 3.2. Models</p>
      <p>Our experiments explore the following key points: a)
constructing robust evaluation by extending PAL (see We select LLMs based on performance and the purpose
Table 2) and applying Italian CoT methods on diferent of the construction. These models are best exemplified
models using carefully designed benchmarking tasks; b) by the GPT [17] and Llama-2 and -3 [20] families for the
investigating the efects of in-context demonstrations; performances shown in multilingual reasoning tasks [9],
c) analysing the varying efects of in-context reasoning two models from the Mistral family [19], EuroLLM1 [23]
methods across diferent models (e.g., models without and Phi-3 [18] for the proficiency shown in the Italian
any further adaptation, and models adapted for the Italian leaderboard. Finally, discerning between the training
language). types, we select Italian-aligned models (Llamantino-2
[21] and Llamantino-3 [22]) and Italian-centred models
PAL beyond English To extend multilingual evalua- (modello-Italia, Minerva-3b, -1b). GPT-3.5 is used via API,
tion to the PAL reasoning method, we propose a specially while the other models are available in open-source
forconstructed language-specific version (showed in the fol- mat. Appendix 12 describes the parameters and versions
lowing table) by transferring the prompts proposed in used in detail. (We released data &amp; code at the following
[9] into programs-like demonstrations as done in [7]. link).</p>
      <sec id="sec-3-1">
        <title>Q: Roger ha 5 palline da tennis. Ha</title>
        <p>comprato altre 2 lattine di palline da
tennis. Ogni barattolo contiene 3 palline
da tennis. Quante palline da tennis ha ora?
A: # Roger ha 5 palline da tennis.</p>
        <p>tennis_balls = 5
# compra 2 lattine, ciascuna ha 3 palline
da tennis</p>
        <p>bought_balls = 2 * 3
# Le palline totali sono</p>
        <p>answer = tennis_balls + bought_balls
# La risposta è 11
3.3. Prompting &amp; Evaluation</p>
        <sec id="sec-3-1-1">
          <title>We operate in two ways concerning mathematical and</title>
          <p>understanding &amp; commonsense tasks. For
mathematical tasks, we align the original CoT and PAL to Italian.
We use Native-CoT [9] (Table 1) and adapted method
proposed in [27] (Appendix 10). Concerning PAL, we
introduce Italian demonstrations as in Table 2. For
understanding and commonsense tasks, we define input
templates that lead LLMs to follow the instructions and
aid generation. We construct prompts following [29],
using the CoT prompting method to elicit multi-step
generations. Finally, we evaluate performance using the
accuracy score. Hence, we measure the exact match
between generated outputs and labels2. We maintain the
generation temperatures as recommended in the oficial
papers. For the GPT-3.5, we use the API, while for the
others, we used versions available on huggingface (in
Appendix 12).
1NB we identify this model as Italian-centred even though it has
been pre-trained on diferent European languages in the same way
[23].
2We extract target labels from the generated answers using regular
expressions before calculating the exact match. For each task, we
use Instruction Templates to guide the model to stable generations
and facilitate evaluation.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results &amp; Discussions</title>
      <sec id="sec-4-1">
        <title>Large language models (LLMs) benefit from reasoning</title>
        <p>methods in English and in Italian as well. As discussed
in §4.1, the in-context demonstrations beyond English
elicit the LLMs to deliver multilingual reasoned answers;
however, the operation difers depending on the type of
method.</p>
        <p>Although demonstrations lead the models to generate
more robust answers, improving Italian as well, the
operation of these techniques appears to be efective only Figure 2: Diference between PAL and CoT (highlighted the
in some models. As analysed in §4.2, in-context ratio- original and adapted models)
nales in natural language have a diferent efect. On the
other side of the coin, structured program-of-thoughts
demonstrations lead the models to more stable
generations. Hence, the impact of in-context demonstrations
varies according to the quality and quantity of rationales
and the scale of model parameters (§4.3).</p>
        <p>Finally, in §4.4, we examine the efects of alignment
approaches by discerning the factors that influence the
generation of the final response and highlighting the
matter of native language demonstrations.
mainly positive, some phenomena emerge, such as
differences (the baseline Direct outperforms the reasoning
method) and a disparity between CoT and PAL between
Original- and Italian-Aligned models. Specifically, (i) PAL
(⋆) outperforms CoT (∙ ) in Figure 1 and (ii) the
ItalianAligned models outperform the Original-Model in Italian
task but not in English. To understand these
dynamics in depth in §4.2, we explore how the demonstration
structure impacts the models’ generations.
4.1. Reasoning in Italian
In-context reasoning methods empower the LLMs’ mul- 4.2. Natural Language Efects
tilingual performances in arithmetic and symbolic
reasoning tasks. Figure 1 shows the diferences
between Native-CoT and Native-PAL, and the baselines
(Direct). The use of in-context Italian demonstrations
brings clear benefits. GPT-3.5 and Llama-based models
(Llama2-70 and Llamantino3) obtain noticeable benefits
from Native-based prompting approaches (complete
results in Appendix 14). Although these LLMs benefit the
most from introducing reasoning methods in the
prompting stage, further improvements are observable even in
LLMs with fewer parameters (i.e., EuroLLM, Phi-3,
Llama2-7, and Llama3-8 as well adapted versions Llamantino-2
and -3, complete results in Appendices 15, 16). These
results demonstrate the sensitivity of Italian in-context
prompting in understanding and commonsense
reasoning (Appendix 17). However, although the averages are
The efect of the reasoning method relies on the
solution strategy. Structured in-context demonstrations in
a program-like manner are more efective than natural
language rationales. Figure 2 displays that the
diferences between Native-PAL and Native-CoT are
consistently positive. Moreover, the Italian-Aligned models
(i.e., Llamantino-based) obtain better results of original
models in Italian tasks when Native-PAL is used. Since
the natural language of in-context rationales does not
provide the same benefits as PAL, we examined the
generations delivered to investigate the origin of the
diferences.</p>
        <p>The results indicate that even though the CoT
incontext demonstrations in the Italian natural language
are the same as those in English, the generations have
diferent structures (Appendix 9, Table 7). In-depth, a
relationship emerges between performance and the
average number of steps required to get correct answers.</p>
        <p>The number of , i.e., the steps to reach the final
solution, represented by natural language sentences, are
on average between 2 and 5 for the Italian answers and
around 3 and 5 for English; in PAL, they are concentrated
around 3 and 4. This shows that natural language,
especially Italian, rich in intricate linguistic structures, is
not the best for solving mathematical, symbolic tasks. In
contrast, PAL seems more appropriate due to its rigid
structure and better support for generative reasoning
passages.
4.3. Demonstrations Impacts
In-context demonstrations play a key role in complex
tasks because they promote reasoning, as discussed in
§4.1. We investigated the performance trend as in-context
demonstrations increased, repeating the previous
experiments focusing on MGSM using zero- from 6-shots. The
results show that the impact of in-context demonstrations
across the languages is related to the quality and
quantity of demonstrations. A distinction emerges between
models and the number of de facto useful
demonstrations. GPT-3.5 with 4-shots achieves results comparable
to 6-shots (average accuracies in Figure 6). This balance
does not occur in Llama-based and Mixtral, which
underperforms as in-context demonstrations increase. Finally,
the smaller models have conspicuous improvements as
the number of demonstrations increases.</p>
        <p>Model
follows: a) Reasoning methods work in Italian as well;
however, there emerges a diference between
rationalesbased methods (CoT) and program-like approaches (PAL).
4.4. Language of Reasoning makes the b) The nature of natural language demonstrations used in
diference CoT does not fit best with rich languages such as Italian.</p>
        <p>Instead, PALs’ programme structure limits ambiguity by
Multilingual in-context demonstrations aid LLMs in ap- improving the ability to deliver reasoning in English and
plying solution strategies; however, the language used to Italian. (c) Consequently, this analysis recommends
opreason matters. By eliciting LLMs to deliver multi-step erating through structured in-context rationale instead
English answers, we observed significant improvements of using natural language when interacting with LLMs,
in accuracy. Complementing previous work, we used two especially when dealing with complex contexts such as
strategies: (i) in-context demonstrations of reasoning an- reasoning. In the future, we would like to investigate the
swers in a specific language ( Native-method). (ii) the internal dynamics that support the causal generations
same in-context setting and then elicit the model to pro- of LLMs to identify gaps and improve multilingual
genvide the solution in English (Cross-method). As in Table erative capabilities [30] by exploiting alignment [24] or
3, the Cross-methods provide tangible benefits both in self-refining approaches [ 31]. However, at the same time,
PAL and CoT. These latter results emphasized the LLMs’ contamination data issues [32, 33, 34]
understanding and production abilities.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Findings &amp; Future Works</title>
      <sec id="sec-5-1">
        <title>The advances of reasoning methods emerge beyond the</title>
        <p>We investigate the impact that reasoning methods cause English. Our analysis shows that properly elicited LLMs
on final performance by expanding the study about the can deliver reasoned answers in Italian as well. By
oprole and the limits of them in Italian. The main find- erating via CoT and PAL, we revealed that in-context
ings and tangible recommendations can be outlined as demonstrations play a strategic role in improving
per6. Conclusion
formance in direct proportion to their quality and quan- [7] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
tity. Our research highlights the need for a customised J. Callan, G. Neubig, Pal: Program-aided language
strategy for employing reasoning methods for LLMs. It models, arXiv preprint arXiv:2211.10435 (2022).
supports the demand for a reasonable combination of [8] W. Chen, X. Ma, X. Wang, W. W. Cohen, Program
model scale, reasoning technique, and strategic use of of thoughts prompting: Disentangling computation
in-context learning to elicit the prospect of multilingual from reasoning for numerical reasoning tasks, 2023.
LLMs. arXiv:2211.12588.
[9] F. Shi, M. Suzgun, M. Freitag, X. Wang, S.
Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
Acknowledgements D. Zhou, D. Das, J. Wei, Language models are
multilingual chain-of-thought reasoners, 2022.</p>
        <p>This work was funded by UK Research and Innovation arXiv:2210.03057.
(UKRI) under the UK government’s Horizon Europe fund- [10] H. Huang, T. Tang, D. Zhang, W. X. Zhao,
ing guarantee grant number 10039436 and PRIN 2022 T. Song, Y. Xia, F. Wei, Not all languages are
creProject - Class-tAIs CUP: E53D230081000. ated equal in llms: Improving multilingual
capability by cross-lingual-thought prompting, 2023.</p>
        <p>References arXiv:2305.07004.
[11] L. Ranaldi, G. Pucci, A. Freitas, Empowering
cross[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, lingual abilities of instruction-tuned large language
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, models by translation-following demonstrations,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),
FindG. Krueger, T. Henighan, R. Child, A. Ramesh, ings of the Association for Computational
LinD. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, guistics ACL 2024, Association for Computational
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, Linguistics, Bangkok, Thailand and virtual
meetC. Berner, S. McCandlish, A. Radford, I. Sutskever, ing, 2024, pp. 7961–7973. URL: https://aclanthology.
D. Amodei, Language models are few-shot learners, org/2024.findings-acl.473. doi: 10.18653/v1/2024.
2020. arXiv:2005.14165. findings-acl.473.
[2] O. Rubin, J. Herzig, J. Berant, Learning to retrieve [12] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti,
prompts for in-context learning, in: M. Carpuat, F. M. Zanzotto, A tree-of-thoughts to broaden
M.-C. de Marnefe, I. V. Meza Ruiz (Eds.), Pro- multi-step reasoning across languages, in: K. Duh,
ceedings of the 2022 Conference of the North H. Gomez, S. Bethard (Eds.), Findings of the
AssociAmerican Chapter of the Association for Com- ation for Computational Linguistics: NAACL 2024,
putational Linguistics: Human Language Tech- Association for Computational Linguistics,
Mexnologies, Association for Computational Linguis- ico City, Mexico, 2024, pp. 1229–1241. URL: https:
tics, Seattle, United States, 2022, pp. 2655–2671. //aclanthology.org/2024.findings-naacl.78. doi: 10.
URL: https://aclanthology.org/2022.naacl-main.191. 18653/v1/2024.findings-naacl.78.
doi:10.18653/v1/2022.naacl-main.191. [13] N. Chen, Z. Zheng, N. Wu, M. Gong, Y. Song,
[3] J. Zhao, Y. Xie, K. Kawaguchi, J. He, M. Xie, Auto- D. Zhang, J. Li, Breaking language barriers in
mulmatic model selection with large language models tilingual mathematical reasoning: Insights and
obfor reasoning, in: H. Bouamor, J. Pino, K. Bali (Eds.), servations, 2023. arXiv:2310.20246.
Findings of the Association for Computational Lin- [14] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu,
guistics: EMNLP 2023, Association for Computa- I. Vulić, A. Korhonen, XCOPA: A multilingual
tional Linguistics, Singapore, 2023, pp. 758–783. dataset for causal commonsense reasoning, in:
URL: https://aclanthology.org/2023.findings-emnlp. B. Webber, T. Cohn, Y. He, Y. Liu (Eds.),
Proceed55. doi:10.18653/v1/2023.findings-emnlp.55. ings of the 2020 Conference on Empirical
Meth[4] Y. Zhang, S. Feng, C. Tan, Active example selection ods in Natural Language Processing (EMNLP),
Asfor in-context learning, 2022. arXiv:2211.04486. sociation for Computational Linguistics, Online,
[5] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwa- 2020, pp. 2362–2376. URL: https://aclanthology.
sawa, Large language models are zero-shot reason- org/2020.emnlp-main.185. doi:10.18653/v1/2020.
ers, 2023. arXiv:2205.11916. emnlp-main.185.
[6] J. Wei, X. Wang, D. Schuurmans, M. Bosma, [15] Y. Yang, Y. Zhang, C. Tar, J. Baldridge,
PAWSB. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of- X: A cross-lingual adversarial dataset for
parathought prompting elicits reasoning in large lan- phrase identification, in: K. Inui, J. Jiang, V. Ng,
guage models, 2023. arXiv:2201.11903. X. Wan (Eds.), Proceedings of the 2019
Conference on Empirical Methods in Natural Language
[27] L. Qin, Q. Chen, F. Wei, S. Huang, W. Che, ing auto-regressive multi-layer artificial neural
netCross-lingual prompting: Improving zero-shot works to predict financial time series, Information
chain-of-thought reasoning across languages, in: 13 (2022). URL: https://www.mdpi.com/2078-2489/
H. Bouamor, J. Pino, K. Bali (Eds.), Proceed- 13/11/524. doi:10.3390/info13110524.
ings of the 2023 Conference on Empirical Meth- [35] L. Ranaldi, G. Pucci, F. Ranaldi, E. S. Ruzzetti,
ods in Natural Language Processing, Associa- F. M. Zanzotto, Empowering multi-step
reasontion for Computational Linguistics, Singapore, ing across languages via tree-of-thoughts, 2024.
2023, pp. 2695–2709. URL: https://aclanthology. arXiv:2311.08097.
org/2023.emnlp-main.163. doi:10.18653/v1/2023. [36] R. Li, L. B. Allal, Y. Zi, N. Muennighof, D. Kocetkov,
emnlp-main.163. C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu,
[28] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, E. Zheltonozhskii, T. Y. Zhuo, T. Wang, O. Dehaene,
S. Narang, A. Chowdhery, D. Zhou, Self-consistency M. Davaadorj, J. Lamy-Poirier, J. Monteiro, O.
Shliimproves chain of thought reasoning in language azhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee,
models, 2023. arXiv:2203.11171. L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov,
[29] K. Ahuja, H. Diddee, R. Hada, M. Ochieng, Z. Wang, R. Murthy, J. Stillerman, S. S. Patel, D.
AbK. Ramesh, P. Jain, A. Nambi, T. Ganu, S. Se- ulkhanov, M. Zocca, M. Dey, Z. Zhang, N. Fahmy,
gal, M. Ahmed, K. Bali, S. Sitaram, MEGA: U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P.
VilMultilingual evaluation of generative AI, in: legas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee,
H. Bouamor, J. Pino, K. Bali (Eds.), Proceed- N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf,
ings of the 2023 Conference on Empirical Meth- J. Ebert, T. Dao, M. Mishra, A. Gu, J. Robinson, C. J.
ods in Natural Language Processing, Associa- Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy,
tion for Computational Linguistics, Singapore, D. Fried, D. Bahdanau, Y. Jernite, C. M.
Ferran2023, pp. 4232–4267. URL: https://aclanthology. dis, S. Hughes, T. Wolf, A. Guha, L. von Werra,
org/2023.emnlp-main.258. doi:10.18653/v1/2023. H. de Vries, Starcoder: may the source be with you!,
emnlp-main.258. 2023. arXiv:2305.06161.
[30] L. Ranaldi, G. Pucci, B. Haddow, A. Birch,
Empowering multi-step reasoning across languages via
program-aided language models, in: Y. Al-Onaizan,
M. Bansal, Y.-N. Chen (Eds.), Proceedings of the
2024 Conference on Empirical Methods in
Natural Language Processing, Association for
Computational Linguistics, Miami, Florida, USA, 2024, pp.
12171–12187. URL: https://aclanthology.org/2024.</p>
        <p>emnlp-main.678.
[31] L. Ranaldi, A. Freitas, Self-refine
instructiontuning for aligning reasoning in language
models, 2024. URL: https://arxiv.org/abs/2405.00402.</p>
        <p>arXiv:2405.00402.
[32] F. Ranaldi, E. S. Ruzzetti, D. Onorati, L. Ranaldi,</p>
        <p>C. Giannone, A. Favalli, R. Romagnoli, F. M.
Zanzotto, Investigating the impact of data
contamination of large language models in text-to-SQL
translation, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),
Findings of the Association for Computational
Linguistics ACL 2024, Association for Computational
Linguistics, Bangkok, Thailand and virtual meeting,
2024, pp. 13909–13920. URL: https://aclanthology.
org/2024.findings-acl.827. doi: 10.18653/v1/2024.</p>
        <p>findings-acl.827.
[33] L. Ranaldi, G. Pucci, Knowing knowledge:
Epistemological study of knowledge in
transformers, Applied Sciences 13 (2023). URL: https://
www.mdpi.com/2076-3417/13/2/677. doi:10.3390/
app13020677.
[34] L. Ranaldi, M. Gerardi, F. Fallucchi, Cryptonet:
Us</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Proposed Task</title>
      <p>Dataset
MGSM</p>
      <p>Task
mathematical reasoning
MSVAMP</p>
      <p>mathematical reasoning
XNLI</p>
      <p>natural language inference
XCOPA
PAWS-X
commonsense reasoning
paraphrase identification
Benchmark
#Test</p>
      <p>Final Prompt</p>
      <p>Languages
Bengali (bn), Chinese (zh), French (fr), Thai (th)
German (de), Japanese (jp), Russian (ru), Telugu (te)
Spanish (es), Swahili (sw), English (en)
Bengali (be), Chinese (zh), French (fr), Thai (th)
German (de), Japanese (jp), Russian (ru)
Spanish (es), Swahili (sw), English (en)
English (en), German (de), Russian (ru), French (fr),
Spanish (es), Chinese (zh), Vietnamese (vi),
Arabic (ar), Greek (el), Thai (th), Bulgarian (bg),
Urdu (ur), Swahili (sw), Hindi (hi), Turkish (tr)
Chinese (zh), Italian (it), Vietnamese (vi),
Turkish (tr), Thai (th), Estonian (et), Tamil (ta),
Swahili (sw), Haitian (ht), Quechua (qu), Indon. (in)
English (en), German (de), Japanese (jp), French (fr),
Spanish (es), Chinese (zh), Korean (ko), Italian (it)
#Languages
11
10
15
11
8</p>
    </sec>
    <sec id="sec-7">
      <title>8. In-context Demonstrations</title>
    </sec>
    <sec id="sec-8">
      <title>9. Natural Language Structure</title>
      <p>Analysing the composition of languages in the answers provided by the diferent models is useful to understand whether a
certain model follows the in-context prompts by generating language-specific answers and, if so, what the error rate is. It is
important to analyse the composition of the provided answers. To qualitatively estimate the generated responses, we propose
the analysis of the phrases present in the responses generated by the models under study. Given an answer , composed of a
set of sentences ({1, 2, . . . , }), we define  as the number of sentences the models generate to deliver the solution.
Since the in-context rationales provided have an average number of 4  (min value 3 and max value 5) [9], they do not
include the final keyword “Answer:” or “The answer is:”, we do not consider the final keyword for a more realistic value as it
often repeats the last sentence. Formally, let  be composed of  sentences and represent the final answer. The sum of
sentences in  gives the total number of . Hence, we compute this value for the generations of models analysed and
report results in Table 7.
10. State-of-art Prompting</p>
    </sec>
    <sec id="sec-9">
      <title>Methods</title>
      <p>11. Program-Aided Language</p>
    </sec>
    <sec id="sec-10">
      <title>Models Prompts</title>
      <p>In this paper, as introduced in §3.3, we propose a novel
Cross-lingual extension of the Program-Aided Language
Models [7] (Cross-PAL) method. The following tables show</p>
      <p>the prompts used for the final evaluation.</p>
      <p>Program-Aided Language Models (PAL)
Q: Roger has 5 tennis balls. He buys 2 more
cans of tennis balls. Each can has 3 tennis
balls. How many tennis balls does he have now?
A: Roger started with 5 tennis balls.</p>
      <p>tennis_balls = 5
2 cans of 3 tennis balls each is
bought_balls = 2 * 3 tennis balls.</p>
      <p>The answer is
answer = tennis_balls + bought_balls</p>
      <p>The answer is 11
Q: Kyle bought last year’s best-selling book for
$19.50. This is with a 25% discount from the
original price. What was the original price?
A:
Cross-ToT
Simulate the collaboration of {} mathematicians
answering a question in their mother tongue:
1, 2, ... and . They all start Step1
from a separate thought process, step by step,
each explaining their thought process. Following
Step1, each expert refines and develops their
thought process by comparing themselves with
others. This process continues until a definitive
answer to the question is obtained.</p>
      <p>Question: [Question in Language 1]</p>
      <p>Answer: [num].
12. Model and Hyperparameters
In our experimental setting, as introduced in Section 3.2, we
propose diferent LLMs: (i) one model from the GPT family
[17]: GPT-3.5 (gpt-3.5-turbo-0125); (ii) three models from
the Llama-2 family [20]: Llama2-7b, Llama2-70b,
Llama-3-8-instruct; (iii) two models of the MistralAI family:
Mistral-7b and Mixtral [19]; (iv) finally, Phi-3-mini [36].</p>
      <p>In particular, GPTs models are used via API, while for the
others, we used versions of the quantized to 4-bit models that
use GPTQ (see detailed versions in Table 12)
Furthermore, we have added additional LLMs. These models
are three versions of Llama-based models adapted for Italian
[21, 22] and three Italian-centered models: modello-Italia,
Minerva-3b, and Minerva-1b.</p>
      <p>As discussed in the limitations, our choices are related to
reproducibility and the cost associated with non-open-source
models. We use closed-source API and the 4-bit GPTQ
quantized version of the model on 8 48GB NVIDIA RTXA600
GPUs for all experiments performed only in inference.</p>
      <p>Finally, the generation temperature varies from  = 0 of GPT
models to  = 0.5 of Llama2s. We choose these temperatures
for (mostly) deterministic outputs, with a maximum token
length of 256. The other parameters are left unchanged as
recommended by the oficial resources. We will release the
code and the dataset upon acceptance of the paper.
13. Models Vesions</p>
      <p>Model
Llama2-7
Llama2-70
Llama3-8
Phi-3-mini
Mistral-7
Mixtral8x7
GPT-3.5-turbo
Llamantino2-70
Llamantino2-7
Llamantino3-7
modello-italia
Minerva-3b
Minerva-1b
EuroLLM</p>
      <p>Version
meta-llama/Llama-2-7b
meta-llama/Llama-2-70b
meta-llama/Meta-Llama-3-8B-Instruct
microsoft/Phi-3-mini-128k-instruct
mistralai/Mistral-7B-Instruct-v0.2
TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
OpenAI API (gpt-3.5-turbo-0125)
swap-uniba/LLaMAntino-2-70b-hf-UltraChatITA
swap-uniba/LLaMAntino-2-chat-7b-hfUltraChat-ITA
swap-uniba/LLaMAntino-3-ANITA-8B-InstDPO-ITA
sapienzanlp/modello-italia-9b-bf16
sapienzanlp/Minerva-3B-base-v1.0
sapienzanlp/Minerva-1B-base-v1.0
utter-project/EuroLLM-1.7B-Instruct
14. Results Arithmetic Reasoning Tasks - English and Italian
15. Results Arithmetic Reasoning Tasks - Italian-Aligned Models
16. Results Arithmetic Reasoning Tasks Italian-centred Models
17. Results Commonsense, Inference, and Understanding tasks</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>