Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses Gabriele Sarti1,* , Tommaso Caselli1 , Malvina Nissim1 and Arianna Bisazza1 1 Center for Language and Cognition (CLCG), University of Groningen, Oude Kijk in ’t Jatstraat 26 Groningen, 9712EK, The Netherlands Abstract Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models’ performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models’ linguistic proficiency and sequential instruction-following skills. Keywords Large language models, Sequential reasoning, Puzzle, Rebus, Crosswords, Enigmistica Italiana 1. Introduction Ali Complex games such as chess and Go have long been (wings) a source of inspiration to develop more flexible and ro- bust AI systems [1, 2]. Recent developments in NLP sug- gested that creative language games could be exploited as promising benchmarks for quantifying the ability of large language models (LLMs) to carry out multi-step knowledge-intensive reasoning tasks under pre-specified Cane (dog) constraints [3]. While crossword puzzles have been his- torically the main focus of such efforts [4], other cat- Coni (cones) egories of linguistic games received only marginal at- tention, especially for languages other than English. A prominent example of less-studied language games is the rebus, a visual puzzle combining images and graphic First Pass: M ali - N coni - cane NIA signs to encode a hidden phrase. Indeed, rebus solving is Verbalized Rebus: a complex, multi-step process requiring factual knowl- M [Due calciatori attaccanti] (Two attacking footballers) edge, contextual understanding, vocabulary usage, and N [Usati per mangiare il gelato] (Used for eating ice cream) reasoning within pre-defined constraints – a set of fun- [Abbaia e morde] (Barks and bites) NIA damental skills to address a variety of real-world tasks. In this work, we conduct the first open evaluation of Solution key (# of chars/word): 11 5 LLMs’ rebus-solving capabilities, focusing specifically Solution: Malinconica nenia (melancholic lullaby) on the Italian language. We propose a novel strategy to Figure 1: An example of a verbalized rebus crafted by combin- derive text-only verbalized rebuses from transcribed inter- ing a rebus first pass (intermediate solution) with crossword mediate rebus solutions and use it to produce a large col- definitions. We use verbalized rebuses to test LLMs’ sequen- lection with more than 80k verbalized rebuses. We then tial instruction following capabilities. Image from Settimana evaluate the rebus-solving skills of state-of-the-art LLMs, Enigmistica n. 4656, © Bresi S.r.l. CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy including open-source systems and proprietary models, * Corresponding author. via few-shot prompting. Moreover, we fine-tune a small $ g.sarti@rug.nl (G. Sarti); t.caselli@rug.nl (T. Caselli); but capable LLM on verbalized rebus solving, outperform- m.nissim@rug.nl (M. Nissim); a.bisazza@rug.nl (A. Bisazza) € https://gsarti.com (G. Sarti); https://cs.rug.nl/~bisazza ing state-of-the-art systems by a wide margin. Finally, we (A. Bisazza) conduct a fine-grained assessment of LLMs’ sequential  0000-0001-8715-2987 (G. Sarti); 0000-0003-2936-0256 (T. Caselli); reasoning steps, explaining model performance in terms 0000-0001-5289-0971 (M. Nissim); 0000-0003-1270-3048 (A. Bisazza) of word complexity and memorization. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Beyond rebus solving, our evaluation sheds light on the fine-tuning experiments. In our evaluation, we also adopt limits of current LLMs in multi-step reasoning settings, few-shot prompting [26] and chain-of-thought reason- highlighting challenges with their application to complex ing [27], which were both shown to strongly improve sequential instruction-following scenarios.1 LLMs’ abilities when solving complex multi-step tasks. 2. Background and Related Work 3. Experimental Setup Italian Enigmistica and Rebuses The Italian lan- Data We begin by extracting all rebuses’ first passes guage is characterized by a rich and long-standing tra- and solutions available on Eureka55 , an online repository dition of puzzle games, including rebuses, dating back of Italian puzzles. We refer to the resulting dataset con- to the 19th century [5]2 In Italian rebuses, a first pass taining 223k unique rebuses sourced from various publi- (prima lettura) representing an intermediate solution of cations as EurekaRebus. For crossword definitions, we the puzzle is produced by combining graphemes with use ItaCW [20], containing 125k unique definition-word underlying image elements in a left-to-right direction pairs. We select only EurekaRebus examples in which (Figure 1). Then, the letters and words of the first pass all first pass words match an existing ItaCW definition undergo a re-segmentation (cesura) according to a solu- to enable verbalization, maintaining 83,157 examples for tion key (chiave di lettura3 ), which specifies the length of our modeling experiments.6 Since several ItaCW words words in the solution (frase risolutiva). The verbalized are associated with multiple definitions, we randomly rebuses we introduce in this work are variants of textual sample definitions to promote diversity in the resulting rebuses (rebus descritto or verbis), where the text-based verbalized rebuses. A test set of 2k examples7 is kept puzzle is crafted by replacing first pass words with their aside for evaluation, and the remaining 81k examples are crossword definitions in a templated format (Figure 1). used for model training. Linguistic Puzzles as NLP Progress Metrics Lan- Models We fine-tune Phi-3 Mini 3.8B 4K [28], the most guage games have recently been adopted as challeng- capable LLM below 4B parameters for a wide range of Ital- ing tasks for LLM evaluation [3, 9, 10]. While works ian language tasks8 . We use quantized low-rank adapters in this area have historically focused on English cross- (QLoRA; 29, 30) for efficient fine-tuning with Unsloth9 words [11, 12, 4, 13], recent tests focus on a more di- and Transformers [31], training the model for 5,000 steps verse set of games such as the New York Times’ “Con- with a batch size of 16 over 81k examples. For compar- nections” [14] and “Wordle” [15]. Automatic crossword ing our model performances, we select GPT-4o [32] and solvers were also developed for French [16], German [17] Claude-3.5 Sonnet [33] as the current state-of-the-art and Italian [18, 19], while didactic crossword generators for proprietary LLMs and the instruction-tuned variants are available for Italian [20] and Turkish [21]. Relat- of Qwen-2 72B [34] and LLaMA-3 70B [35] as the best- edly, the Italian evaluation campaign EVALITA4 recently performing open-source LLMs according to the Invalsi hosted two shared tasks focusing on the word-guessing Italian benchmark [36]. These four systems are used as game “La Ghigliottina” (The Guillotine) [22, 23]. To our untrained baselines thanks to their instruction-following knowledge, our work is the first to attempt the computa- abilities and prompted for rebus solving in a few-shot tional modeling and evaluation of rebus-solving systems. setting. Importantly, language games such as rebuses are not eas- ily translatable into other languages due to their struc- Format Table 1 presents an example in the templated tural and cultural elements. This makes them a scarce format used for fine-tuning Phi-3.10 The model is but valuable resource for language-specific evaluations prompted to reason step-by-step by 1) solving crossword of language processing systems. definitions sequentially (definition resolution); 2) pro- ducing a first pass copying letters and definitions’ words; LLMs as Sequential Reasoners State-of-the-art LLMs were shown to struggle to follow sequential instruc- 5 http://www.eureka5.it, additional details in Appendix A. Rebus tions presented in a single query [24], but their perfor- illustrations are not available in Eureka5. 6 mances improved significantly with ad-hoc training [25]. Since verbalized rebus are produced from textual contents only, crossword definitions may refer to different word meanings (e.g. This acts as an initial motivation for our rebus-solving [Two soccer players] is used to represent the word “wings” in Fig- ure 1 despite not matching the word sense “bird wings” of the 1 Code, data and models are available on Github and Huggingface original image. This does not affect the validity of our task. 2 7 Refer to Miola [6], Bartezzaghi [7], Ichino [8] for a comprehensive Composed by Test id and Test ood, described in Section 5 8 overview of peculiarities and norms in modern Italian rebuses. https://hf.co/spaces/FinancialSupport/open_ita_llm_leaderboard 3 9 Referred to as diagramma in jargon. https://github.com/unslothai/unsloth 4 10 https://www.evalita.it An English example is available in Table 9 2 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Prompt • First Pass Words/Letter Accuracy: Proportion Risolvi gli indizi tra parentesi per ottenere una prima lettura, of correct words and letters in the generated first e usa la chiave di lettura per ottenere la soluzione del rebus. pass. Lower scores may indicate issues with as- sembling a first pass from previous information. Rebus: U [Lo è il passacavallo] LO [È fatta di vimini] F F • First Pass Exact Match (EM): Proportion of gen- [Decimi di chilo] S [Disusato soprabito] A [Un rampicante dei Tropici] erated first passes matching the gold reference. • Solution Key Match: Proportion of generated Chiave di lettura: 3 6 12 8 solution words matching the lengths specified by the solution key. Lower scores may indicate diffi- Generation Procediamo alla risoluzione del rebus passo per passo: culty in respecting the given length constraints. • Solution First Pass Match: Proportion of first -U=U ⎧ ⎪ pass characters employed to construct solution - [Lo è il passacavallo] = nave ⎪ ⎪ ⎪ ⎪ words. Lower scores indicate issues with using -LO=LO ⎪ generated first pass characters in the solution.11 ⎪ ⎪ ⎪ - [È fatta di vimini] = cesta ⎪ ⎪ ⎪ Definition ⎪ ⎨ -FF=FF • Solution Words Accuracy: Proportion of cor- resolution ⎪ - [Decimi di chilo] = etti rect words in the generated solution. ⎪ -S=S • Solution Exact Match (EM): Proportion of gen- ⎪ ⎪ ⎪ ⎪ - [Disusato soprabito] = tait ⎪ ⎪ ⎪ ⎪ erated solutions matching the gold reference. ⎩- A = A ⎪ ⎪ ⎪ - [Un rampicante dei Tropici] = liana {︂ 4. Results Prima lettura: U nave LO cesta F F etti S tait First pass A liana Table 2 presents our evaluation results. We observe that Ora componiamo la soluzione seguendo la all prompted models perform poorly on the task, with the chiave risolutiva: overall best prompted system (Claude 3.5 Sonnet) obtain- ⎧ ing the correct solution only for 24% of the 2k tested ⎪ 3 = Una examples. Notably, open-source systems perform signifi- Solution ⎪ 6 = veloce ⎨ segmen- cantly worse than proprietary ones, producing correct tation ⎪12 = staffettista ⎪ first passes only for 4% of the examples, and next to no 8 = italiana ⎩ correct solutions. Our fine-tuned system largely outper- {︀ forms all state-of-the-art prompted models, predicting Solution Soluzione: Una veloce staffettista italiana the correct solution in 51% of cases. From first pass met- Table 1 rics, it is evident these results can be largely explained by Template used in fine-tuning and prompting experiments with the poor word-guessing capabilities of the models, which highlighted reasoning stages. Example rebus by Il Piacentino, are greatly improved with fine-tuning. For prompted Settimana Enigmistica n. 2942 models, the slight decrease in scores between Def. and FP Words also highlights issues with copying predicted 3) re-segmenting it into solution words based on the solu- words in the expected format. Finally, we observe that tion key (solution segmentation); and finally 4) produc- fine-tuning strongly improves the constraint-following ing the solution by copying re-segmented words. We abilities of our system, with prompted systems being less automatically convert rebuses in this format by deriving strict with applying length and letter-choice constraints the solution key from solution word lengths and dynami- for their solutions (Key/FP Match). cally infilling the available information into the template. We use a similar format for prompting experiments, with five in-context step-by-step demonstrations and an ex- 5. What Motivates Model plicit instruction asking the model to stick to the previous Performances? examples’ format to streamline solution parsing. In light of the strong performances achieved by our rela- Metrics For our granular evaluation of rebus-solving tively small fine-tuned system, this section conducts an performance, we adopt the following set of metrics focus- in-depth investigation to identify factors motivating such ing on the first passes (FP) and solutions (S) generated performance improvements. by LLMs: 11 In practice, we define this as 1 − CER(FP, S), where CER is the • Definition (Def.): Proportion of correctly character error rate [37] between the two sequences (lowercased, guessed words during definition resolution. whitespace removed) computed with Jiwer 3 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 First Pass (FP) Solution (S) Model Setup Def. Words Letters EM Key Match FP Match Words EM LLaMA-3 70B 5-shot prompt 0.22 0.20 0.60 0.04 0.16 0.51 0.03 0.00 Qwen-2 72B 5-shot prompt 0.28 0.25 0.76 0.04 0.20 0.52 0.04 0.00 GPT-4o 5-shot prompt 0.55 0.51 0.83 0.15 0.53 0.74 0.27 0.11 Claude-3.5 Sonnet 5-shot prompt 0.66 0.62 0.90 0.28 0.83 0.82 0.43 0.24 Phi-3 3.8B (ours) fine-tuned 0.84 0.84 1.00 0.56 0.86 0.94 0.68 0.51 Table 2 Fine-grained verbalized rebus solving performances of various LLMs. Bold denotes best overall performances, and underline marks best training-free results. GPT-4o Phi-3 (ours) we evaluate our fine-tuned model in out-of-distribution Metric Test Test Test Test Test Test settings. For this evaluation, the 2k examples of the test id ood Δ id ood Δ set from previous sections are divided into two subsets: FP W. ID 0.52 0.51 -0.01 0.96 0.96 0.00 one in which all first pass words were seen during fine- FP W. OOD - 0.44 - - 0.20 - tuning by Phi-3 (Test id, 1061 examples) and one in FP EM 0.16 0.14 -0.02 0.89 0.18 -0.71 which, for every example, at least one first pass word S W. ID 0.29 0.26 -0.03 0.92 0.49 -0.43 was unseen in training (Test ood, 939 examples). In- S W. OOD 0.18 0.16 -0.02 0.63 0.20 -0.40 tuitively, if Phi-3 performance is mainly motivated by S EM 0.12 0.09 -0.03 0.82 0.16 -0.66 memorizing fine-tuning data, introducing OOD words should produce a significant drop in model performances. Table 3 Results shown in Table 3 confirm that this is indeed the Model performances for test subsets containing only in- case. We find Phi-3 performances to be near-perfect on domain (Test ID), or some out-of-domain (Test OOD) first seen first pass words (FP W. ID = 0.96) in both test sets, pass words. W. ID and W. OOD are accuracies for ID and OOD with a major drop for OOD words (FP W. OOD = 0.20). words for first pass (FP) and solution (S) sequences. Test Δ = This produces second-order effects on subsequent steps, Test ID - Test OOD performance. causing the FP EM results to drop by 71% (FP EM Test ∆), while significantly impacting downstream solution Word Complexity and Frequency Affects LLM Fine- accuracies. On the contrary, GPT-4o few-shot prompting tuning Performance For every word in the first performances remain nearly identical on both splits, con- passes and solutions of test set examples, we measure firming that these results are not the product of a skewed LLMs’ overall accuracy in predicting it for the full test data selection process. Overall, these results strongly set. We then correlate this score to various quantities suggest that memorization is the main factor behind the that could motivate LLMs’ performances. More specifi- strong rebus-solving performance of our fine-tuned LLM. cally, we use 1) the word frequency in the training set; 2) the word frequency in Paisà [38], a large web Ital- Manual Inspection We conclude by manually evalu- ian corpus; and 3) the length of the word (number of ating some generations produced by the best-performing characters). We find a significant positive correlation LLMs. Table 4 presents two examples with definitions (𝜌 = 0.44) between first pass word prediction accuracy (D) and solution (S) words predicted by three LLMs, with and training frequency for the fine-tuned Phi-3 model, more examples provided in Appendix C. We use naw as suggesting that model performance is strongly related short-hand for “Not A Word” to mark nonsensical terms. to training coverage. The length of characters is also In the first example, Phi-3 correctly predicts all first found to negatively affect our model’s performance, al- pass and solution words. On the contrary, other mod- beit to a smaller extent (𝜌 = −0.11). The performance of els make several mistakes in the first pass, leading to prompted models is unrelated to both properties for first incorrect solutions. Both prompted models tend to ig- pass words, indicating that these results are the product nore first pass words when these cannot be assembled of fine-tuning.12 to form sensical, length-fitting solution words. For ex- ample, for D1 GPT-4o predicts p (naw), which would LLM Fine-Tuning Fails to Generalize to Unseen lead to the solution word “SAPpTE” (naw), but the S8 = Words To further confirm the importance of fine- “Spettacolo” (show) is predicted instead by the model). In tuning word coverage in defining model performances, particular, GPT-4o appears to prioritize grammatically correct solutions at the cost of ignoring first pass words 12 Paisà frequency is never found to correlate significantly. Full and solution key length constraints, while Claude 3.5S correlation results are available in Table 6. 4 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Rebus: SAP [La porta della breccia] D1 TE [La pinza del 6. Discussion and Conclusion granchio] D2 SBA [Si legge su alcuni orologi] D3 G [Le sue coccole sono aromatiche] D4 V [Un gioco con dadi e This work introduced a verbalized rebus-solving task pedine] D5 D [Sono verdi in gioventù] D6 and dataset for evaluating LLMs’ sequential instruction Chiave di lettura: 8 3 2 12 7 5 following skills for the Italian language. We crafted a large collection of 83k verbalized rebuses by combining Step GPT-4o Claude 3.5S Phi-3 rebus transcriptions with crossword definitions and used D1 p one pia D2 chela chela chela it to evaluate the rebus-solving skills of state-of-the-art D3 ora data data LLMs. Our experiments revealed the challenging nature D4 ginepro lio ginepro of this task, with even the most capable prompted models D5 ludo oca oca achieving only 24% accuracy on solutions. D6 acerbi anni anni While fine-tuning a smaller LLM dramatically im- S8 Spettacolo Saponate Sappiate proved performance to 51% solution accuracy, our anal- S3 che che che ysis uncovered that these gains were largely driven S2 fa la la by memorization and do not generalize to out-of- S12 sognare sbadataggine sbadataggine distribution examples. These results suggest important S7 ogni vocando provoca limitations in the generalization capabilities of current S5 sera danni danni systems for sequential instruction following tasks. Our Soluzione: SAPpiaTE che la SBAdataGgine proVoca Danni manual analysis further shows that LLMs seldom account for length constraints when solving definitions, despite Rebus: STU [Si salva otturandolo] D1 S [Ha foglie the fundamental role of these cues in restricting the pool seghettate] D2 AL [Lo è l’operaio che lavora in cantiere] D3 of possible words. These results suggest that search- G [Un uomo... non all’ altezza] D4 based approaches accounting for constraints more ex- Chiave di lettura: 11 7 2 7 plicitly might improve puzzle structure adherence, as Step GPT-4o Claude 3.5S Phi-3 previously shown by Chen et al. [39]. Other augmenta- D1 tappo falla dente tion techniques employing LLM reformulation skills can D2 acero ortica aro also be explored to mitigate overfitting. D3 edile edile edile Future work in this area should focus on expanding D4 nano nano nano similar evaluations to a wider set of languages, input S11 Stupaccerone Stufallassor Studentesaro modalities, and puzzle categories, creating a comprehen- S7 salendo ticale aledile sive benchmark to test LLMs’ puzzle-solving skills. Im- S2 al di gi portantly, the task of solving visual rebuses and their S7 genano Legnano nanano more convoluted variants13 remains far beyond the cur- Soluzione: STUdenteSsa liceALe di LeGnano rent capabilities of vision-language models. Hence, solv- ing these puzzles automatically can be considered an Table 4 important milestone in developing multimodal AI sys- Examples of LLM generations for rebuses by Slam, Nuova tems for constrained multi-step reasoning tasks. Our Enigmistica Tascabile n. 2802 (top) and Grizzly, Domenica Quiz results confirm that the challenging nature of rebuses, n. 2 (bottom). Correct guesses and errors and denoted for even in their verbalized form, makes this task valuable predicted first pass definitions (D1,...,𝑁 ) and solution words (S𝑖 , with 𝑖 being the 𝑖-th solution key value). for assessing future progress in LLMs’ linguistic profi- ciency and sequential reasoning abilities. Finally, our rebus-solving LLM can facilitate future interpretability shows an improved ability to follow these constraints, as work investigating the mechanisms behind factual recall confirmed by Key/FP Match results of Table 2. and multi-step reasoning in transformer models [40]. In the second example, the first pass word D2 = salice (willow) is OOD for Phi-3. Consequently, the model pro- Limitations Our analysis was limited to a relatively duces the incorrect prediction aro (naw), and the error is small set of models, and a single prompt template ob- propagated to all solution words, as previously observed tained after minimal tuning. Further experiments are in the Test OOD column of Table 3. Prompted models needed to verify that memorization patterns after fine- also underperform in this example, with errors on D1 and tuning remain relevant for other model sizes, prompt for- D2 propagating to most solution words. However, we mats, and training regimes, particularly for full-weight note that D1 and D2 incorrect predictions for Claude 3.5S training approaches. satisfy the provided definitions, suggesting that access to more explicit information about the given constraints 13 could further boost LLMs’ performance on this task. For example, rebuses requiring first pass anagrams (anarebus) or dynamic relations derived from multi-scene analysis (stereorebus) 5 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Acknowledgments rebus per solutori (ancora) poco abili, Bompiani, Milan, 2021. Gabriele Sarti and Arianna Bisazza acknowledge the [9] R. Manna, M. P. di Buono, J. Monti, Riddle me support of the Dutch Research Council (NWO) for the this: Evaluating large language models in solving project InDeep (NWA.1292.19.399). Arianna Bisazza word-based games, in: C. Madge, J. Chamberlain, is further supported by the NWO Talent Programme K. Fort, U. Kruschwitz, S. Lukin (Eds.), Proceedings (VI.Vidi.221C.009). We are grateful to the Associazione of the 10th Workshop on Games and Natural Lan- Culturale “Biblioteca Enigmistica Italiana - G. Panini” guage Processing @ LREC-COLING 2024, ELRA for making its rebus collection freely accessible on the and ICCL, Torino, Italia, 2024, pp. 97–106. URL: Eureka5 platform, and to Valeriya Zelenkova for her valu- https://aclanthology.org/2024.games-1.11. able comments on the first version of this work. We also [10] P. Giadikiaroglou, M. Lymperaiou, G. Filandrianos, thank the CLiC-it 2024 reviewers for their valuable feed- G. Stamou, Puzzle solving using reasoning of large back. language models: A survey, ArXiv (2024). URL: https://arxiv.org/abs/2402.11291. [11] M. L. Littman, G. A. Keim, N. Shazeer, A References probabilistic approach to solving crossword puz- [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, zles, Artificial Intelligence 134 (2002) 23– L. Sifre, G. van den Driessche, J. Schrittwieser, 55. URL: https://www.sciencedirect.com/science/ I. Antonoglou, V. Panneershelvam, M. Lanctot, article/pii/S000437020100114X. doi:https://doi. S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, org/10.1016/S0004-3702(01)00114-X. I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, [12] M. Ernandes, G. Angelini, M. Gori, We- T. Graepel, D. Hassabis, Mastering the game of Go bcrow: A web-based system for crossword solv- with deep neural networks and tree search, Nature ing, in: AAAI Conference on Artificial In- 529 (2016) 484–489. doi:10.1038/nature16961. telligence, 2005. URL: https://link.springer.com/ [2] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, chapter/10.1007/11590323_37. M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, [13] A. Boda, Sadallah, D. Kotova, E. Kochmar, S. Yao, T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, A general reinforcement learning algorithm that K. N. 2023, S. Yousefi, L. Betthauser, H. Hasan- masters chess, shogi, and go through self-play, Sci- beig, R. Milliere, I. Momennejad, De-coding, A. Zu- ence 362 (2018) 1140–1144. doi:10.1126/science. garini, T. Röthenbacher, K. Klede, M. Ernandes, aar6404. B. M. Eskofier, D. Z. 2023, Are llms good cryp- [3] J. Rozner, C. Potts, K. Mahowald, Decrypting tic crossword solvers?, ArXiv (2024). URL: https: cryptic crosswords: Semantically complex word- //arxiv.org/abs/2403.12094. play puzzles as a target for nlp, in: M. Ranzato, [14] G. Todd, T. Merino, S. Earle, J. Togelius, Missed A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan connections: Lateral thinking puzzles for large lan- (Eds.), Advances in Neural Information Processing guage models, Arxiv (2024). URL: https://arxiv.org/ Systems, volume 34, Curran Associates, Inc., 2021, abs/2404.11730. pp. 11409–11421. URL: https://proceedings. [15] B. J. Anderson, J. G. Meyer, Finding the optimal neurips.cc/paper_files/paper/2021/file/ human strategy for wordle using maximum cor- 5f1d3986fae10ed2994d14ecd89892d7-Paper.pdf. rect letter probabilities and reinforcement learning, [4] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak, Arxiv (2022). URL: https://arxiv.org/abs/2202.00557. M. Ginsberg, D. Klein, Automated crossword solv- [16] G. Angelini, M. Ernandes, T. laquinta, C. Stehl’e, ing, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), F. Simoes, K. Zeinalipour, A. Zugarini, M. Gori, Proceedings of the 60th Annual Meeting of the As- The webcrow french crossword solver, in: In- sociation for Computational Linguistics (Volume 1: telligent Technologies for Interactive Entertain- Long Papers), Association for Computational Lin- ment, 2023. URL: https://link.springer.com/chapter/ guistics, Dublin, Ireland, 2022, pp. 3073–3085. URL: 10.1007/978-3-031-55722-4_14. https://aclanthology.org/2022.acl-long.219. doi:10. [17] A. Zugarini, T. Rothenbacher, K. Klede, M. Ernan- 18653/v1/2022.acl-long.219. des, B. M. Eskofier, D. Zanca, Die rätselrevolu- [5] D. Tolosani, Enimmistica, Hoepli, Milan, 1901. tion: Automated german crossword solving, in: [6] E. Miola, Che cos’è un rebus, Carocci, 2020. Proceedings of the 9th Italian Conference on Com- [7] S. Bartezzaghi, Parole in gioco: Per una semiotica putational Linguistics (CLiC-it 2023), 2023. URL: del gioco linguistico, Bompiani, 2017. https://ceur-ws.org/Vol-3596. [8] P. Ichino, L’ora desiata vola: guida al mondo del [18] G. Angelini, M. Ernandes, M. Gori, Solving ital- ian crosswords using the web, in: International 6 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Conference of the Italian Association for Artificial Inc., 2020, pp. 1877–1901. URL: https://proceedings. Intelligence, 2005. URL: https://link.springer.com/ neurips.cc/paper_files/paper/2020/file/ chapter/10.1007/11558590_40. 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. [19] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Maggini, [27] J. Wei, X. Wang, D. Schuurmans, M. Bosma, M. Gori, L. Rigutini, Clue-instruct: Text-based clue b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain- generation for educational crossword puzzles, in: of-thought prompting elicits reasoning in large N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, language models, in: S. Koyejo, S. Mohamed, N. Xue (Eds.), Proceedings of the 2024 Joint In- A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), ternational Conference on Computational Linguis- Advances in Neural Information Processing tics, Language Resources and Evaluation (LREC- Systems, volume 35, Curran Associates, Inc., 2022, COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 24824–24837. URL: https://proceedings. pp. 3347–3356. URL: https://aclanthology.org/2024. neurips.cc/paper_files/paper/2022/file/ lrec-main.297. 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference. [20] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini, pdf. L. Rigutini, M. Maggini, M. Gori, Italian crossword [28] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, generator: Enhancing education through interac- A. Awadallah, H. Awadalla, N. Bach, A. Bahree, tive word puzzles, in: Proceedings of the 9th Italian A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko, Conference on Computational Linguistics (CLiC-it J. Bjorck, S. Bubeck, Q. C. et al., Phi-3 techni- 2023), 2023. URL: https://ceur-ws.org/Vol-3596. cal report: A highly capable language model lo- [21] K. Zeinalipour, Y. G. Keptig, M. Maggini, L. Rigutini, cally on your phone, Arxiv (2024). URL: https: M. Gori, A turkish educational crossword puzzle //arxiv.org/abs/2404.14219. generator, ArXiv abs/2405.07035 (2024). URL: https: [29] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, //arxiv.org/abs/2405.07035v2. S. Wang, L. Wang, W. Chen, LoRA: Low-rank adap- [22] P. Basile, M. Lovetere, J. Monti, A. Pascucci, F. San- tation of large language models, in: The Tenth gati, L. Siciliani, Ghigliottin-ai@evalita2020: Eval- International Conference on Learning Representa- uating artificial players for the language game tions (ICLR 2022), OpenReview, Online, 2022. URL: "la ghigliottina" (short paper), EVALITA Evalua- https://openreview.net/forum?id=nZeVKeeFYf9. tion of NLP and Speech Tools for Italian - Decem- [30] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettle- ber 17th, 2020 (2020). URL: https://doi.org/10.4000/ moyer, Qlora: Efficient finetuning of quantized books.aaccademia.7488. llms, in: A. Oh, T. Naumann, A. Globerson, [23] P. Basile, M. de Gemmis, P. Lops, G. Semeraro, Solv- K. Saenko, M. Hardt, S. Levine (Eds.), Advances ing a complex language game by using knowledge- in Neural Information Processing Systems, based word associations discovery, IEEE Trans- volume 36, Curran Associates, Inc., 2023, actions on Computational Intelligence and AI in pp. 10088–10115. URL: https://proceedings. Games 8 (2016) 13–26. doi:10.1109/TCIAIG.2014. neurips.cc/paper_files/paper/2023/file/ 2355859. 1feb87871436031bdc0f2beaa62a049b-Paper-Conference. [24] X. Chen, B. Liao, J. Qi, P. Eustratiadis, C. Monz, pdf. A. Bisazza, M. de Rijke, The sifo benchmark: Inves- [31] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- tigating the sequential instruction following abil- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- ity of large language models, 2024. URL: https: towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, //arxiv.org/abs/2406.19999. arXiv:2406.19999. Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, [25] H. Hu, S. Yu, P. Chen, E. M. Ponti, Fine-tuning M. Drame, Q. Lhoest, A. Rush, Transformers: large language models with sequential instructions, State-of-the-art natural language processing, in: Arxiv (2024). URL: https://arxiv.org/abs/2403.07794. Q. Liu, D. Schlangen (Eds.), Proceedings of the [26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. 2020 Conference on Empirical Methods in Natu- Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, ral Language Processing: System Demonstrations, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, Association for Computational Linguistics, On- G. Krueger, T. Henighan, R. Child, A. Ramesh, line, 2020, pp. 38–45. URL: https://aclanthology. D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, org/2020.emnlp-demos.6. doi:10.18653/v1/2020. E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, emnlp-demos.6. C. Berner, S. McCandlish, A. Radford, I. Sutskever, [32] OpenAI, Hello gpt-4o, Website, 2024. URL: https: D. Amodei, Language models are few-shot learners, //openai.com/index/hello-gpt-4o. in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, [33] Anthropic, Claude 3.5 sonnet, Website, 2024. H. Lin (Eds.), Advances in Neural Information URL: https://www.anthropic.com/news/ Processing Systems, volume 33, Curran Associates, claude-3-5-sonnet. 7 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 [34] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, Statistic EurekaRebus ItaCW-filtered C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, # examples 222089 83157 J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, # authors 8138 5046 J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, Year range 1800 - 2024 1869 - 2024 K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, First pass P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, # unique words 38977 8960 X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Avg./SD words/ex. 3.50/1/48 3.08/1.00 Avg./SD word len. 6.51/1.96 5.70/1.60 Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Avg./SD FP len. 26.45/11.19 25.74/8.73 Z. Zhang, Z. Fan, Qwen2 technical report, 2024. URL: https://arxiv.org/abs/2407.10671. Solution [35] M. AI, Introducing meta llama 3: The most capable # unique words 75718 42558 openly available llm to date, Website, 2024. URL: Avg./SD words/ex. 3.02/1.60 2.80/1.21 https://ai.meta.com/blog/meta-llama-3. Avg./SD word len. 8.07/2.30 7.79/2.23 [36] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino, Avg./SD Sol. len. 19.47/8.44 18.81/6.06 A. Seveso, Disce aut deficere: Evaluating llms profi- Table 5 ciency on the invalsi italian benchmark, 2024. URL: Statistics for the full EurekaRebus dataset and the crosswords- https://arxiv.org/abs/2406.17535. filtered subset used in this work. Avg./SD = Average/standard [37] A. Morris, V. Maier, P. Green, From wer and ril deviation. to mer and wil: improved evaluation measures for connected speech recognition., 2004. Model # Char. Paisà Freq. Train Freq. [38] V. Lyding, E. Stemle, C. Borghetti, M. Brunello, S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci, GPT-4o -0.01 0.01 0.02 V. Pirrelli, The PAISÀ corpus of Italian web texts, Claude-3.5 -0.02 -0.02 0.00 in: F. Bildhauer, R. Schäfer (Eds.), Proceedings of Phi-3 (ours) -0.11 -0.05 0.44 the 9th Web as Corpus Workshop (WaC-9), Associ- GPT-4o -0.18 0.14 0.19 ation for Computational Linguistics, Gothenburg, Claude-3.5 -0.15 0.08 0.13 Sweden, 2014, pp. 36–43. URL: https://aclanthology. Phi-3 (ours) -0.02 0.08 0.22 org/W14-0406. doi:10.3115/v1/W14-0406. [39] L. Chen, J. Liu, S. Jiang, C. Wang, J. Liang, Table 6 Y. Xiao, S. Zhang, R. Song, Crossword puzzle Spearman’s correlation with average word accuracies for resolution via monte carlo tree search, Proceed- metrics computed on first pass (top) and solution (bottom) words. Bold scores are significant with Bonferroni-corrected ings of the International Conference on Auto- 𝑝 < 1𝑒 − 5 [41] mated Planning and Scheduling 32 (2022) 35–43. URL: https://ojs.aaai.org/index.php/ICAPS/article/ view/19783. doi:10.1609/icaps.v32i1.19783. the pool of available definitions for every word. [40] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-jussà, A primer on the inner workings of transformer- First pass/Solution word distribution Figure 2 based language models, Arxiv (2024). URL: https: shows the distribution of first pass and solution words //arxiv.org/abs/2405.00208. for the filtered EurekaRebus subset used in our work. [41] C. Bonferroni, Teoria statistica delle classi e calcolo delle probabilita, Pubblicazioni del R. Istituto Su- periore di Scienze Economiche e Commericiali di B. Additional Experimental Firenze 8 (1936) 3–62. Results Table 6 presents the correlations between model accu- racy and the properties presented in Section 5. Table 7 A. Additional Data Information presents the full ID/OOD performances for all tested Dataset statistics Table 5 presents statistics for the Eu- models, showing consistent results with Table 3 for all rekaRebus dataset and the filtered subset we use for com- prompted models. Table 8 presents Phi-3 Mini perfor- posing verbalized rebuses. The ItaCW dataset contains a mances across rebus-solving fine-tuning steps. total of 125,202 definitions for 40,963 unique words, with the most frequent words having hundreds of different definitions, e.g. 173 for re (king), 155 for te (you). Defini- tions used for verbalization are randomly sampled from 8 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Word Frequency re (6091) ali (3068), in (2793) est (2365) ante (1916) tori (1748) accetta (139) Word Frequency di (8449) d’ (2910) Una (2111) a (1821) amore (684) pesante (172) importante (81) Word Figure 2: Word frequencies for words in first passes (top) and solutions (bottom) for the selected subset of EurekaRebus used for training and evaluation. Words are colored according to their length, and the most frequent examples per frequency bin are highlighted. LLaMA-3 Qwen-2 GPT-4o Claude-3.5S Phi-3 (ours) Metric Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test id ood Δ id ood Δ id ood Δ id ood Δ id ood Δ FP W. ID 0.20 0.19 -0.01 0.26 0.25 -0.01 0.52 0.51 -0.01 0.65 0.63 -0.02 0.96 0.96 0.00 FP W. OOD - 0.18 - - 0.24 - - 0.44 - - 0.54 - - 0.20 - FP EM 0.03 0.04 0.01 0.03 0.05 0.02 0.16 0.14 -0.02 0.30 0.25 -0.05 0.89 0.18 -0.71 S W. ID 0.03 0.04 0.01 0.04 0.05 0.01 0.29 0.26 -0.03 0.48 0.40 -0.08 0.92 0.49 -0.43 S W. OOD 0.01 0.00 -0.01 0.02 0.00 -0.02 0.18 0.16 -0.02 0.41 0.30 -0.11 0.63 0.20 -0.40 S EM 0.00 0.00 0.00 0.00 0.00 0.00 0.12 0.09 -0.03 0.27 0.22 -0.05 0.82 0.16 -0.66 Table 7 Full model performances for test subsets containing only in-domain (Test ID), or some out-of-domain (Test OOD) first pass words. W. ID and W. OOD are accuracies for ID and OOD words for first pass (FP) and solution (S) sequences. Test Δ = Test ID - Test OOD performance. C. Additional Model Generations Tables 10 and 11 provide additional example of LLM gen- erations for tested rebuses, with the example from Ta- Table 9 presents an English translation of Figure 1 ex- ble 11 (bottom) being OOD due to the manovella (crank) ample using the prompt format adopted in this study. word in D2, and the others being ID for the fine-tuned First Pass (FP) Solution (S) # Train Steps Def. Words Letters EM Key Match FP Match Words EM 500 0.64 0.63 0.97 0.25 0.66 0.86 0.36 0.16 1000 0.74 0.74 1.00 0.38 0.72 0.89 0.48 0.28 1500 0.78 0.77 0.99 0.42 0.78 0.91 0.55 0.34 2000 0.80 0.79 1.00 0.47 0.81 0.93 0.59 0.40 2500 0.81 0.81 1.00 0.49 0.81 0.92 0.62 0.42 3000 0.82 0.82 1.00 0.51 0.82 0.92 0.63 0.44 3500 0.82 0.82 1.00 0.53 0.83 0.93 0.65 0.47 4000 0.83 0.83 1.00 0.53 0.85 0.93 0.66 0.47 4500 0.83 0.83 1.00 0.55 0.85 0.93 0.67 0.49 5072 (final) 0.84 0.84 1.00 0.56 0.86 0.94 0.68 0.51 Table 8 Fine-grained verbalized rebus solving performances of the Phi-3 Mini model on the test set including both ID and OOD examples at various fine-tuned stages. Bold denotes best overall performances. 9 Gabriele Sarti et al. CEUR Workshop Proceedings 1–10 Prompt Solve the clues provided between parentheses to obtain a first Rebus: PRI [Ricoperto di sudore] D1 MIN [Gli altari del pass, and use the solution key to obtain the rebus’ solution. tempio] D2 DO [Un ordigno bellico] D3 [Possono essere “di serie” in certi tornei] D4 SSO Rebus: M [Two attacking footballers] N [Used for eating ice- Chiave di lettura: 5 2 8 6 2 6 cream] [Barks and bites] NIA Step GPT-4o Claude 3.5S Phi-3 Solution key: 11 5 D1 sudato madido madido D2 are are are Generation Let’s proceed to solve the rebus step by step: D3 bomba mina mina D4 teste teste teste -M=M ⎧ ⎪ - [Two attacking footballers] = ali (wings) ⎪ ⎪ ⎪ ⎪ S5 Prima Prima Prima -N=N ⎨ Definition S2 di di di resolution ⎪ ⎪ - [Used for eating icecream] = coni (cones) S8 sudare minaccia dominare ⎩- [Barks and bites] = cane (dog) ⎪ ⎪ ⎪ S6 molto teste dominate -NIA=NIA S2 di di se {︀ S6 testa dosso stesso First pass First pass: M ali N coni cane NIA Soluzione: PRIma di doMINare DOmina te steSSO Let’s now compose the solution by following Rebus: AT [Si alzano nel camping] D1 [Emoziona pescatori e the solution key: navigatori] D2 [Come una nota Foresta] D3 MEN [Quadro ad olio] D4 S [Atteggiamento da modella] D5 {︂ Solution 11 = Malinconica (melancholic) segmen- Chiave di lettura: 9 11 2 5 5 = nenia (lullaby) tation {︂ Solution: Malinconica nenia (melancholic Step GPT-4o Claude 3.5S Phi-3 Solution lullaby) D1 tende tende tende D2 marea mare rete Table 9 D3 nera nera nera English translation of the template used in fine-tuning and D4 dipinto tela tela prompting experiments with highlighted reasoning stages, D5 posa posa posa with translated first pass and solution words in parentheses. S9 Attenderemo Attendere Attendere Example rebus by A. Rizzi, Settimana Enigmistica n. 4656 S11 mareanera marenamente teneramente Rebus: MOR [Musa della commedia] D1 L [Si accendono per S2 di la la vedere] D2 NO [Uomini di intelligenza superiore] D3 S5 posa posa sposa Chiave di lettura: 7 12 Soluzione: ATtendere teneraMENte la Sposa Rebus: B [Una figura geometrica] D1 [La si impugna per far Step GPT-4o Claude 3.5S Phi-3 girare un congegno] D2 DA [Le produce il rovo] D3 D1 Talia Talia Talia D2 luci luci luci Chiave di lettura: 10 7 1’ 5 D3 geni geni genii Step GPT-4o Claude 3.5S Phi-3 S7 Mortali Mortali Mortali D1 cerchio rombo ellissi S12 allucinogeni allucinogeni allucinogeni D2 manovella manovella leva Soluzione: MORTali aLluciNOgeni D3 more more more Table 10 S10 Bcerchiomanovella Bromomanov Bellissile Examples of LLM generations for a rebus by De Vico C., S7 elladam vadamore Domenica Quiz n. 5. Correct guesses and errors and de- S1’ d’ o’ ’ noted for predicted first pass definitions (D1,...,𝑁 ) and so- S5 amore more remo lution words (S𝑖 , with 𝑖 being the 𝑖-th solution key value). Soluzione: Bellissima novella D’ Amore Phi-3 Mini. Table 11 Examples of LLM generations for rebuses by Baruffa, Rebus n. 12 (top), Contini C., La Settimana Enigmistica n. 4102 (mid) and Liosca, La Settimana Enigmistica n. 4581 (bottom). Correct guesses and errors and denoted for predicted first pass defini- tions (D1,...,𝑁 ) and solution words (S𝑖 , with 𝑖 being the 𝑖-th solution key value). 10