1. Introduction

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Gabriele Sarti

Tommaso Caselli

Malvina Nissim

Arianna Bisazza

Ali (wings)

0 Center for Language and Cognition (CLCG), University of Groningen, Oude Kijk in 't Jatstraat 26 Groningen , 9712EK , The Netherlands

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

eol>Large language models Sequential reasoning Puzzle Rebus Crosswords Enigmistica Italiana

1. Introduction

including open-source systems and proprietary models, via few-shot prompting. Moreover, we fine-tune a small but capable LLM on verbalized rebus solving, outperforming state-of-the-art systems by a wide margin. Finally, we conduct a fine-grained assessment of LLMs’ sequential reasoning steps, explaining model performance in terms of word complexity and memorization.

Beyond rebus solving, our evaluation sheds light on the ifne-tuning experiments. In our evaluation, we also adopt limits of current LLMs in multi-step reasoning settings, few-shot prompting [26] and chain-of-thought reasonhighlighting challenges with their application to complex ing [27], which were both shown to strongly improve sequential instruction-following scenarios.1 LLMs’ abilities when solving complex multi-step tasks.

2. Background and Related Work 3. Experimental Setup

Italian Enigmistica and Rebuses The Italian lan- Data We begin by extracting all rebuses’ first passes guage is characterized by a rich and long-standing tra- and solutions available on Eureka55, an online repository dition of puzzle games, including rebuses, dating back of Italian puzzles. We refer to the resulting dataset conto the 19th century [ 5 ]2 In Italian rebuses, a first pass taining 223k unique rebuses sourced from various publi(prima lettura) representing an intermediate solution of cations as EurekaRebus. For crossword definitions, we the puzzle is produced by combining graphemes with use ItaCW [20], containing 125k unique definition-word underlying image elements in a left-to-right direction pairs. We select only EurekaRebus examples in which (Figure 1). Then, the letters and words of the first pass all first pass words match an existing ItaCW definition undergo a re-segmentation (cesura) according to a solu- to enable verbalization, maintaining 83,157 examples for tion key (chiave di lettura3), which specifies the length of our modeling experiments.6 Since several ItaCW words words in the solution (frase risolutiva). The verbalized are associated with multiple definitions, we randomly rebuses we introduce in this work are variants of textual sample definitions to promote diversity in the resulting rebuses (rebus descritto or verbis), where the text-based verbalized rebuses. A test set of 2k examples7 is kept puzzle is crafted by replacing first pass words with their aside for evaluation, and the remaining 81k examples are crossword definitions in a templated format (Figure 1). used for model training.

Linguistic Puzzles as NLP Progress Metrics Lan- Models We fine-tune Phi-3 Mini 3.8B 4K [ 28], the most guage games have recently been adopted as challeng- capable LLM below 4B parameters for a wide range of Italing tasks for LLM evaluation [ 3, 9, 10 ]. While works ian language tasks8. We use quantized low-rank adapters in this area have historically focused on English cross- (QLoRA; 29, 30) for eficient fine-tuning with Unsloth 9 words [11, 12, 4, 13], recent tests focus on a more di- and Transformers [31], training the model for 5,000 steps verse set of games such as the New York Times’ “Con- with a batch size of 16 over 81k examples. For comparnections” [ 14 ] and “Wordle” [15]. Automatic crossword ing our model performances, we select GPT-4o [32] and solvers were also developed for French [16], German [ 17 ] Claude-3.5 Sonnet [33] as the current state-of-the-art and Italian [18, 19], while didactic crossword generators for proprietary LLMs and the instruction-tuned variants are available for Italian [20] and Turkish [21]. Relat- of Qwen-2 72B [34] and LLaMA-3 70B [35] as the bestedly, the Italian evaluation campaign EVALITA4 recently performing open-source LLMs according to the Invalsi hosted two shared tasks focusing on the word-guessing Italian benchmark [36]. These four systems are used as game “La Ghigliottina” (The Guillotine) [22, 23]. To our untrained baselines thanks to their instruction-following knowledge, our work is the first to attempt the computa- abilities and prompted for rebus solving in a few-shot tional modeling and evaluation of rebus-solving systems. setting.

Importantly, language games such as rebuses are not easily translatable into other languages due to their struc- Format Table 1 presents an example in the templated tural and cultural elements. This makes them a scarce format used for fine-tuning Phi-3. 10 The model is but valuable resource for language-specific evaluations prompted to reason step-by-step by 1) solving crossword of language processing systems. definitions sequentially ( definition resolution ); 2) producing a first pass copying letters and definitions’ words;

LLMs as Sequential Reasoners State-of-the-art

LLMs were shown to struggle to follow sequential instruc- 5http://www.eureka5.it, additional details in Appendix A. Rebus tions presented in a single query [24], but their perfor- illustrations are not available in Eureka5. mances improved significantly with ad-hoc training [ 25]. 6Since verbalized rebus are produced from textual contents only, This acts as an initial motivation for our rebus-solving [cTrowsoswsoocrcdedrepfinlaityioernss] misauyserdefteor rteopdreifesreennttthweowrdormde“awniinnggss”(ien.gF.igure 1 despite not matching the word sense “bird wings” of the 1Code, data and models are available on Github and Huggingface original image. This does not afect the validity of our task. 2Refer to Miola [6], Bartezzaghi [7], Ichino [ 8 ] for a comprehensive 7Composed by Test id and Test ood, described in Section 5 overview of peculiarities and norms in modern Italian rebuses. 8https://hf.co/spaces/FinancialSupport/open_ita_llm_leaderboard 3Referred to as diagramma in jargon. 9https://github.com/unslothai/unsloth 4https://www.evalita.it 10An English example is available in Table 9

Prompt Risolvi gli indizi tra parentesi per ottenere una prima lettura, e usa la chiave di lettura per ottenere la soluzione del rebus. Rebus: U [Lo è il passacavallo] LO [È fatta di vimini] F F [Decimi di chilo] S [Disusato soprabito] A [Un rampicante dei Tropici] Chiave di lettura: 3 6 12 8

• First Pass Words/Letter Accuracy: Proportion of correct words and letters in the generated first pass. Lower scores may indicate issues with assembling a first pass from previous information. • First Pass Exact Match (EM): Proportion of generated first passes matching the gold reference. • Solution Key Match: Proportion of generated solution words matching the lengths specified by the solution key. Lower scores may indicate dificulty in respecting the given length constraints. • Solution First Pass Match: Proportion of first pass characters employed to construct solution words. Lower scores indicate issues with using generated first pass characters in the solution. 11 • Solution Words Accuracy: Proportion of correct words in the generated solution. • Solution Exact Match (EM): Proportion of generated solutions matching the gold reference.

4. Results Model

LLaMA-3 70B Qwen-2 72B GPT-4o Claude-3.5 Sonnet Phi-3 3.8B (ours)

Setup

5-shot prompt 5-shot prompt 5-shot prompt 5-shot prompt fine-tuned

Def.

Metric GPT-4o Phi-3 (ours) we evaluate our fine-tuned model in out-of-distribution Test Test Test Test Test Test settings. For this evaluation, the 2k examples of the test id ood Δ id ood Δ set from previous sections are divided into two subsets: FP W. ID 0.52 0.51 -0.01 0.96 0.96 0.00 one in which all first pass words were seen during fineFP W. OOD - 0.44 - - 0.20 - tuning by Phi-3 (Test id, 1061 examples) and one in FP EM 0.16 0.14 -0.02 0.89 0.18 -0.71 which, for every example, at least one first pass word S W. ID 0.29 0.26 -0.03 0.92 0.49 -0.43 was unseen in training (Test ood, 939 examples). InS W. OOD 0.18 0.16 -0.02 0.63 0.20 -0.40 tuitively, if Phi-3 performance is mainly motivated by S EM 0.12 0.09 -0.03 0.82 0.16 -0.66 memorizing fine-tuning data, introducing OOD words should produce a significant drop in model performances.

Table 3 Results shown in Table 3 confirm that this is indeed the Model performances for test subsets containing only in- case. We find Phi-3 performances to be near-perfect on domain (Test ID), or some out-of-domain (Test OOD) first seen first pass words (FP W. ID = 0.96) in both test sets, pass words. W. ID and W. OOD are accuracies for ID and OOD with a major drop for OOD words (FP W. OOD = 0.20). wTeosrtdIsDfo-rTfeisrsttOpOasDs (pFePr)foarnmdasnocleu.tion (S) sequences. Test Δ = This produces second-order efects on subsequent steps, causing the FP EM results to drop by 71% (FP EM Test ∆ ), while significantly impacting downstream solution accuracies. On the contrary, GPT-4o few-shot prompting performances remain nearly identical on both splits, conifrming that these results are not the product of a skewed data selection process. Overall, these results strongly suggest that memorization is the main factor behind the strong rebus-solving performance of our fine-tuned LLM.

Word Complexity and Frequency Afects LLM Fine

tuning Performance For every word in the first passes and solutions of test set examples, we measure LLMs’ overall accuracy in predicting it for the full test set. We then correlate this score to various quantities that could motivate LLMs’ performances. More specifically, we use 1) the word frequency in the training set; 2) the word frequency in Paisà [38], a large web Italian corpus; and 3) the length of the word (number of characters). We find a significant positive correlation ( = 0.44) between first pass word prediction accuracy and training frequency for the fine-tuned Phi-3 model, suggesting that model performance is strongly related to training coverage. The length of characters is also found to negatively afect our model’s performance, albeit to a smaller extent ( = − 0.11). The performance of prompted models is unrelated to both properties for first pass words, indicating that these results are the product of fine-tuning. 12

LLM Fine-Tuning Fails to Generalize to Unseen

Words To further confirm the importance of finetuning word coverage in defining model performances, 12Paisà frequency is never found to correlate significantly. Full correlation results are available in Table 6.

Manual Inspection We conclude by manually evaluating some generations produced by the best-performing LLMs. Table 4 presents two examples with definitions (D) and solution (S) words predicted by three LLMs, with more examples provided in Appendix C. We use naw as short-hand for “Not A Word” to mark nonsensical terms.

In the first example, Phi-3 correctly predicts all first pass and solution words. On the contrary, other models make several mistakes in the first pass, leading to incorrect solutions. Both prompted models tend to ignore first pass words when these cannot be assembled to form sensical, length-fitting solution words. For example, for D1 GPT-4o predicts p (naw), which would lead to the solution word “SAPpTE” (naw), but the S8 = “Spettacolo” (show) is predicted instead by the model). In particular, GPT-4o appears to prioritize grammatically correct solutions at the cost of ignoring first pass words and solution key length constraints, while Claude 3.5S shows an improved ability to follow these constraints, as confirmed by Key/FP Match results of Table 2.

In the second example, the first pass word D2 = salice (willow) is OOD for Phi-3. Consequently, the model produces the incorrect prediction aro (naw), and the error is propagated to all solution words, as previously observed in the Test OOD column of Table 3. Prompted models also underperform in this example, with errors on D1 and D2 propagating to most solution words. However, we note that D1 and D2 incorrect predictions for Claude 3.5S satisfy the provided definitions, suggesting that access to more explicit information about the given constraints could further boost LLMs’ performance on this task.

Step

D1 D2 D3 D4 D5 D6

6. Discussion and Conclusion

This work introduced a verbalized rebus-solving task and dataset for evaluating LLMs’ sequential instruction following skills for the Italian language. We crafted a large collection of 83k verbalized rebuses by combining rebus transcriptions with crossword definitions and used it to evaluate the rebus-solving skills of state-of-the-art LLMs. Our experiments revealed the challenging nature of this task, with even the most capable prompted models achieving only 24% accuracy on solutions.

While fine-tuning a smaller LLM dramatically improved performance to 51% solution accuracy, our analysis uncovered that these gains were largely driven by memorization and do not generalize to out-ofdistribution examples. These results suggest important limitations in the generalization capabilities of current systems for sequential instruction following tasks. Our manual analysis further shows that LLMs seldom account for length constraints when solving definitions, despite the fundamental role of these cues in restricting the pool of possible words. These results suggest that searchbased approaches accounting for constraints more explicitly might improve puzzle structure adherence, as previously shown by Chen et al. [39]. Other augmentation techniques employing LLM reformulation skills can also be explored to mitigate overfitting.

Future work in this area should focus on expanding similar evaluations to a wider set of languages, input modalities, and puzzle categories, creating a comprehensive benchmark to test LLMs’ puzzle-solving skills. Importantly, the task of solving visual rebuses and their more convoluted variants13 remains far beyond the current capabilities of vision-language models. Hence, solving these puzzles automatically can be considered an important milestone in developing multimodal AI systems for constrained multi-step reasoning tasks. Our results confirm that the challenging nature of rebuses, even in their verbalized form, makes this task valuable for assessing future progress in LLMs’ linguistic proficiency and sequential reasoning abilities. Finally, our rebus-solving LLM can facilitate future interpretability work investigating the mechanisms behind factual recall and multi-step reasoning in transformer models [40]. Limitations Our analysis was limited to a relatively small set of models, and a single prompt template obtained after minimal tuning. Further experiments are needed to verify that memorization patterns after finetuning remain relevant for other model sizes, prompt formats, and training regimes, particularly for full-weight training approaches. 13For example, rebuses requiring first pass anagrams ( anarebus) or dynamic relations derived from multi-scene analysis (stereorebus)

Acknowledgments References

Gabriele Sarti and Arianna Bisazza acknowledge the support of the Dutch Research Council (NWO) for the project InDeep (NWA.1292.19.399). Arianna Bisazza is further supported by the NWO Talent Programme (VI.Vidi.221C.009). We are grateful to the Associazione Culturale “Biblioteca Enigmistica Italiana - G. Panini” for making its rebus collection freely accessible on the Eureka5 platform, and to Valeriya Zelenkova for her valuable comments on the first version of this work. We also thank the CLiC-it 2024 reviewers for their valuable feedback.

A. Additional Data Information

Dataset statistics Table 5 presents statistics for the EurekaRebus dataset and the filtered subset we use for composing verbalized rebuses. The ItaCW dataset contains a total of 125,202 definitions for 40,963 unique words, with the most frequent words having hundreds of diferent definitions, e.g. 173 for re (king), 155 for te (you). Definitions used for verbalization are randomly sampled from

Table 6 presents the correlations between model accuracy and the properties presented in Section 5. Table 7 presents the full ID/OOD performances for all tested models, showing consistent results with Table 3 for all prompted models. Table 8 presents Phi-3 Mini performances across rebus-solving fine-tuning steps. Metric FP W. ID FP W. OOD FP EM S W. ID S W. OOD S EM

C. Additional Model Generations Prompt Solve the clues provided between parentheses to obtain a first pass, and use the solution key to obtain the rebus’ solution. Rebus: M [Two attacking footballers] N [Used for eating icecream] [Barks and bites] NIA Solution key: 11 5 Generation Let’s proceed to solve the rebus step by step: Rebus: MOR [Musa della commedia] D1 L [Si accendono per vedere] D2 NO [Uomini di intelligenza superiore] D3 Chiave di lettura: 7 12 Step

D1 D2 D3 S7 S12 GPT-4o

Talia luci geni

Mortali allucinogeni

Claude 3.5S

Talia luci geni

Mortali allucinogeni Phi-3 Talia luci genii

Mortali allucinogeni

Soluzione: MORTali aLluciNOgeni

Phi-3 Mini.

Rebus: PRI [Ricoperto di sudore] D1 MIN [Gli altari del tempio] D2 DO [Un ordigno bellico] D3 [Possono essere “di serie” in certi tornei] D4 SSO

Chiave di lettura: 5 2 8 6 2 6

Rebus: B [Una figura geometrica] D1 [La si impugna per far girare un congegno] D2 DA [Le produce il rovo] D3

Milan , 2021 . [9]

Manna , M. P. di Buono, J. Monti, Riddle me

of the 10th Workshop on Games and Natural Lan-

guage Processing @ LREC-COLING 2024 , ELRA

and

ICCL

, Torino , Italia, 2024 , pp. 97 - 106 . URL:

https://aclanthology.org/ 2024 .games- 1 . 11 . [10]

Giadikiaroglou ,

Lymperaiou , G. Filandrianos,

language models: A survey , ArXiv ( 2024 ). URL:

https://arxiv.org/abs/2402.11291. [11]

M. L.

Littman ,

G. A.

Keim ,

Shazeer , A

probabilistic approach to solving crossword puz [1]

Silver ,

Huang ,

C. J.

Maddison , A . Guez, zles, Artificial Intelligence 134 ( 2002 ) 23 -

Sifre , G. van den Driessche, J. Schrittwieser, 55. URL: https://www.sciencedirect.com/science/

Antonoglou ,

Panneershelvam , M. Lanctot, article/pii/S000437020100114X. doi:https://doi.

Dieleman ,

Grewe ,

Nham , N. Kalchbrenner, org/10.1016/S0004- 3702 ( 01 ) 00114 - X .

Sutskever ,

Lillicrap ,

Leach ,

Kavukcuoglu , [12]

Ernandes , G. Angelini,

Gori , We-

529 ( 2016 ) 484 - 489 . doi: 10 .1038/nature16961. telligence, 2005 . URL: https://link.springer.com/ [2]

Silver ,

Hubert ,

Schrittwieser , I. Antonoglou, chapter/10.1007/11590323_ 37 .

Lai ,

Guez ,

Lanctot ,

Sifre ,

Kumaran , [13]

Boda , Sadallah,

Kotova ,

Kochmar , S. Yao,

A general reinforcement learning algorithm that K. N. 2023 ,

Yousefi ,

Betthauser , H. Hasan-

ence 362 ( 2018 ) 1140 - 1144 . doi: 10 .1126/science. garini, T. Röthenbacher,

Klede , M. Ernandes,

aar6404. B. M. Eskofier , D. Z.

2023 , Are llms good cryp[3]

Rozner ,

Potts ,

Mahowald , Decrypting tic crossword solvers?, ArXiv ( 2024 ). URL: https:

cryptic crosswords: Semantically complex word- //arxiv.org/abs/2403.12094.

play puzzles as a target for nlp , in: M. Ranzato, [14]

Todd ,

Merino ,

Earle ,

Togelius , Missed

(Eds.), Advances in Neural Information Processing guage models, Arxiv ( 2024 ). URL: https://arxiv.org/

Systems , volume 34 , Curran

Associates

, Inc., 2021 , abs/2404.11730.

pp. 11409 - 11421 . URL: https://proceedings. [15]

B. J.

Anderson , J. G. Meyer, Finding the optimal

neurips.cc/paper_files/paper/2021/ file/ human strategy for wordle using maximum cor-

5f1d3986fae10ed2994d14ecd89892d7-Paper.pdf . rect letter probabilities and reinforcement learning , [4]

Wallace ,

Tomlin ,

Xu ,

Yang , E. Pathak, Arxiv ( 2022 ). URL: https://arxiv.org/abs/2202.00557.

Ginsberg ,

Klein , Automated crossword solv- [16]

Angelini ,

Ernandes , T. laquinta, C. Stehl'e,

Proceedings of the 60th Annual Meeting of the As- The webcrow french crossword solver , in: In-

sociation for Computational Linguistics (Volume 1: telligent Technologies for Interactive Entertain-

Long

Papers)

, Association for Computational Lin- ment , 2023 . URL: https://link.springer.com/chapter/

guistics , Dublin, Ireland, 2022 , pp. 3073 - 3085 . URL: 10 .1007/978-3- 031 -55722-4_ 14 .

https://aclanthology.org/ 2022 . acl-long . 219 . doi:10. [17]

Zugarini ,

Rothenbacher ,

Klede , M. Ernan-

18653 /v1/ 2022 . acl-long . 219 . des , B. M.

Eskofier , D.

Zanca , Die rätselrevolu[5] D.

Tolosani , Enimmistica, Hoepli, Milan, 1901 . tion: Automated german crossword solving , in: [6]

Miola , Che cos'è un rebus , Carocci , 2020 . Proceedings of the 9th Italian Conference on Com [7]

Bartezzaghi , Parole in gioco: Per una semiotica putational Linguistics (CLiC-it 2023 ), 2023 . URL:

del gioco linguistico, Bompiani , 2017 . https://ceur-ws. org/ Vol- 3596 . [8]

Ichino , L'ora desiata vola: guida al mondo del [18]

Angelini ,

Ernandes ,

Gori , Solving ital-

Conference of the Italian Association for Artificial Inc ., 2020 , pp. 1877 - 1901 . URL: https://proceedings.

Intelligence , 2005 . URL: https://link.springer.com/ neurips.cc/paper_files/paper/2020/file/

chapter/10 .1007/11558590_ 40 . 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [19]

Zugarini ,

Zeinalipour ,

S. S.

Kadali ,

Maggini , [27]

Wei ,

Wang ,

Schuurmans , M. Bosma,

Xue (Eds.), Proceedings of the 2024 Joint In- A. Agarwal ,

Belgrave ,

Cho , A . Oh (Eds.),

tics , Language Resources and Evaluation (LREC- Systems , volume 35 , Curran

Associates

, Inc., 2022 ,

COLING

2024 ), ELRA and ICCL , Torino , Italia, 2024 , pp. 24824 - 24837 . URL: https://proceedings.

pp. 3347 - 3356 . URL: https://aclanthology.org/ 2024 . neurips.cc/paper_files/paper/2022/file/

lrec-main.297 . 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference. [20]

Zeinalipour ,

Iaquinta ,

Zanollo , G. Angelini, pdf.

Rigutini ,

Maggini ,

Gori , Italian crossword [28]

Abdin ,

S. A.

Jacobs ,

A. A.

Awan , J. Aneja,

tive word puzzles , in: Proceedings of the 9th Italian A . Bakhtiari , J.

Bao , H.

Behl , A.

Benhaim , M. Bilenko,

Conference on Computational Linguistics (CLiC-it J. Bjorck , S.

Bubeck , Q. C. et al., Phi- 3 techni -

2023 ), 2023 . URL: https://ceur-ws. org/ Vol- 3596 . cal report: A highly capable language model lo [21]

Zeinalipour ,

Y. G.

Keptig ,

Maggini , L. Rigutini, cally on your phone, Arxiv ( 2024 ). URL: https:

Gori , A turkish educational crossword puzzle //arxiv .org/abs/2404.14219.

generator, ArXiv abs/2405 .07035 ( 2024 ). URL: https: [29]

E. J.

Hu , yelong shen, P. Wallis,

Allen-Zhu ,

Li ,

//arxiv.org/abs/2405.07035v2.

Wang ,

Wang , W. Chen, LoRA: Low-rank adap[22]

Basile ,

Lovetere ,

Monti ,

Pascucci , F. San- tation of large language models , in: The Tenth

gati , L. Siciliani, Ghigliottin-ai@evalita2020: Eval- International Conference on Learning Representa-

uating artificial players for the language game tions (ICLR 2022), OpenReview , Online, 2022 . URL:

"la ghigliottina" (short paper) , EVALITA Evalua- https://openreview.net/forum?id=nZeVKeeFYf9.

tion of NLP and Speech Tools for Italian - Decem- [30]

Dettmers ,

Pagnoni ,

Holtzman , L. Zettle-

ber 17th , 2020 ( 2020 ). URL: https://doi.org/10.4000/ moyer, Qlora: Eficient finetuning of quantized

books.aaccademia. 7488. llms, in: A. Oh ,

Naumann ,

Globerson , [23]

Basile , M. de Gemmis, P. Lops, G. Semeraro, Solv- K. Saenko , M. Hardt , S. Levine (Eds.), Advances

based word associations discovery , IEEE Trans- volume 36 , Curran

Associates

, Inc., 2023 ,

actions on Computational Intelligence and AI in pp. 10088 - 10115 . URL: https://proceedings.

Games 8 ( 2016 ) 13 - 26 . doi: 10 .1109/TCIAIG. 2014 . neurips.cc/paper_files/paper/2023/file/

2355859. 1feb87871436031bdc0f2beaa62a049b-Paper-Conference. [24]

Chen ,

Liao ,

Qi ,

Eustratiadis , C. Monz, pdf.

Bisazza , M. de Rijke, The sifo benchmark: Inves- [31]

Wolf ,

Debut ,

Sanh ,

Chaumond , C. De-

ity of large language models , 2024 . URL: https: towicz, J. Davison,

Shleifer , P. von Platen, C. Ma,

//arxiv.org/abs/2406.19999. arXiv: 2406 .19999.

Jernite ,

Plu ,

Xu ,

T. Le

Scao , S. Gugger, [25]

Hu ,

Yu ,

Chen ,

E. M.

Ponti , Fine-tuning M. Drame , Q.

Lhoest , A.

Rush , Transformers:

Arxiv ( 2024 ). URL: https://arxiv.org/abs/2403.07794.

Liu , D. Schlangen (Eds.), Proceedings of the [26]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D. 2020

Conference on Empirical Methods in Natu-

Krueger ,

Henighan ,

Child , A . Ramesh, line, 2020 , pp. 38 - 45 . URL: https://aclanthology.

Ziegler ,

Wu ,

Winter ,

Hesse , M. Chen, org/ 2020 .emnlp-demos.6. doi: 10 .18653/v1/ 2020 .

Sigler ,

Litwin ,

Gray ,

Chess , J. Clark, emnlp-demos. 6 .

Berner ,

McCandlish ,

Radford , I. Sutskever , [ 32 ] OpenAI , Hello gpt-4o, Website, 2024 . URL: https:

Amodei , Language models are few-shot learners , //openai.com/index/hello-gpt-4o.

in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , [ 33 ] Anthropic , Claude 3 .5 sonnet, Website, 2024 .

Processing Systems , volume 33 , Curran

Associates

, claude -3-5-sonnet. [34]

Yang ,

Hui ,

Zheng ,

Yu ,

Zhou , Statistic EurekaRebus ItaCW-filtered

Li ,

Liu ,

Huang , G. Dong,

Wei ,

Lin , # examples 222089 83157

Tang ,

Wang ,

Yang ,

Tu ,

Zhang , J. Ma, # authors 8138 5046

Xu ,

Zhou ,

Bai ,

He ,

Lin ,

Dang ,

Lu , Year range 1800 - 2024 1869 - 2024

Bai ,

Tan ,

Zhu ,

Li ,

Liu ,

Ge ,

Deng , # unique words 38977 8960

Zhou ,

Ren ,

Zhang ,

Wei ,

Ren ,

Fan , Avg./SD words/ex. 3. 50 /1/48 3. 08 /1.00

Yao ,

Zhang ,

Wan ,

Chu ,

Liu ,

Cui , Avg. /SD word len. 6.51/1.96 5.70/1 .60

Zhang ,

Fan , Qwen2 technical report , 2024 . Avg./ SD FP len. 26.45/11.19 25.74/8 .73

URL: https://arxiv.org/abs/2407.10671. Solution [35] M. AI , Introducing meta llama 3: The most capable # unique words 75718 42558

openly available llm to date , Website, 2024 . URL: Avg./SD words/ ex. 3.02/1.60 2.80/1 .21

https://ai.meta.com/blog/meta-llama-3. Avg./ SD word len. 8.07/2.30 7.79/2 .23 [36]

Mercorio ,

Mezzanzanica ,

Potertì ,

Serino , Avg./SD Sol. len. 19.47/8.44 18.81/6 .06

ciency on the invalsi italian benchmark , 2024 . URL: TStaabtliseti5cs for the full EurekaRebus dataset and the crosswords-

https://arxiv.org/abs/2406.17535. filtered subset used in this work . Avg./SD = Average/standard [37]

Morris ,

Maier ,

Green , From wer and ril deviation .

connected speech recognition ., 2004 . [38]

Lyding ,

Stemle ,

Borghetti ,

Brunello , Model # Char. Paisà Freq. Train Freq.

Castagnoli ,

Dell'Orletta ,

Dittmann ,

Lenci , GPT-4o -0.01 0.01 0 . 02

Pirrelli , The PAISÀ corpus of Italian web texts, Claude-3.5 -0.02 -0.02 0 . 00

in: F. Bildhauer , R. Schäfer (Eds.), Proceedings of Phi-3 (ours) -0.11 -0.05 0 . 44

the 9th Web as Corpus Workshop (WaC-9) , Associ- GPT-4o -0.18 0.14 0 . 19

ation for Computational Linguistics , Gothenburg, Claude-3.5 -0.15 0.08 0 . 13

Sweden , 2014 , pp. 36 - 43 . URL: https://aclanthology. Phi-3 (ours) -0.02 0.08 0 . 22

org/W14-0406 . doi: 10 .3115/v1/ W14 -0406. [39]

Chen , J. Liu,

Jiang ,

Wang ,

Liang , Table 6

mated Planning and Scheduling 32 ( 2022 ) 35 - 43 . < 1 − 5 [41]

view/19783 . doi: 10 .1609/icaps.v32i1.19783. the pool of available definitions for every word . [40]

Ferrando ,

Sarti ,

Bisazza , M. R.

Costa-jussà,

A primer on the inner workings of transformer- First pass/Solution word distribution Figure 2

based language models , Arxiv ( 2024 ). URL: https: shows the distribution of first pass and solution words

//arxiv.org/abs/2405.00208. for the filtered EurekaRebus subset used in our work . [41]

Bonferroni , Teoria statistica delle classi e calcolo

Firenze 8 ( 1936 ) 3 - 62 . Results