EurekaRebus -Verbalized Rebus Solving with LLMs: A CALAMITA Challenge

EurekaRebus -Verbalized Rebus Solving with LLMs: A CALAMITA Challenge GabrieleSarti g.sarti@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

TommasoCaselli t.caselli@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

AriannaBisazza a.bisazza@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

MalvinaNissim m.nissim@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

EurekaRebus -Verbalized Rebus Solving with LLMs: A CALAMITA Challenge 1613-0073 0B7CE57944EC2733BD3C52282D569247 GROBID - A machine learning software for extracting information from scholarly documents Large language models Sequential reasoning Puzzle Rebus Crosswords Enigmistica Italiana CALAMITA 1. Challenge: Introduction and Motivation

Language games can be valuable resources for testing the ability of large language models (LLMs) to conduct challenging multi-step, knowledge-intensive inferences while respecting predefined constraints. Our proposed challenge prompts LLMs to reason step-by-step to solve verbalized variants of rebus games recently introduced with the EurekaRebus dataset [1]. Verbalized rebuses replace visual cues with crossword definitions to create an encrypted first pass, making the problem entirely text-based. We introduce a simplified task variant with word length hints and adopt a comprehensive set of metrics to obtain a granular overview of models' performance in knowledge recall, constraints adherence, and re-segmentation abilities across reasoning steps.

Language games were adopted as testbeds for measuring NLP progress in recent years [2,3,4], with a particular focus on (cryptic) crossword solving English [5,6,7,8,9]. For the Italian language, initial efforts focused on crossword solving and generation [10,11] and clue-based word guessing [12,13,9]. Recently, Sarti et al. [1] introduced an extensive collection of text-adapted Italian rebus puzzles to evaluate large language models' (LLMs) knowledge and sequential reasoning abilities. Rebuses are complex puzzles combining visual elements and graphic signs to encode a hidden phrase. Italian can boast a rich and long-standing rebus tradition dating back to the 19th century [14], popularized by high-diffusion magazines such as La Settimana Enigmistica 1 . The structure of Italian rebuses has, with time, been formalized into beauty canons [15], and their peculiarities and design principles were analyzed by several authors [16,17,18].

In Italian rebuses, rebus solving begins by combining derived by combining graphemes with their underlying visual elements in a left-to-right fashion, composing a first pass (prima lettura) representing an intermediate solution of the puzzle. Then, first pass elements are re- This work proposes to adopt the EurekaRebus introduced by Sarti et al. [1] to extend their evaluation of LLMs' multi-step reasoning and linguistic/cultural awareness to the systems evaluated as part of the CALAMITA evaluation campaign [19]. We believe the task is particularly relevant since the crossword definitions that compose verbalized rebuses rely heavily on idiomatic expressions, wordplay, and cultural references specific to Italian. Hence, the results of this task could provide valuable insights into the linguistic and cultural competence of LLMs trained on the Italian language. Moreover, the task is especially appealing since it is framed in a templated reasoning format, enabling us to disentangle the various components required to successfully solve a verbalized rebus step-by-step. More specifically, several metrics will be employed to assess LLMs' factual recall, textual concatenation and re-segmentation capabilities and, finally, constraint satisfaction given the provided cues.

In light of the results reported by [1] for state-of-theart proprietary LLMs, we expect all tested open-source systems to perform very poorly, with final solution accuracies well below 30%. We also note that the highest reported overall performance in previous work 2 was found by the original authors to be primarily the product of memorization. We anticipate that this challenge will highlight significant limitations in LLMs' current factual recall and multi-step reasoning ability and act as a catalyst for future improvements in these areas.

Challenge: Description

The proposed challenge aims to evaluate the capabilities of existing LLMs in solving verbalized Italian rebuses via prompting at various granularity levels. More specifically, LLMs will be evaluated in a few-shot prompting setting with two fixed in-context learning examples pre-selected at random from the available pool of verbalized rebuses in EurekaRebus, in two settings:

• Regular, matching the example in table 1 and the original input format used by Sarti et al. [1]. • Hints, in which the number of characters for every hidden word is provided alongside definitions in the verbalized rebus to help the model in identifying the correct choice. This variant was not tested by Sarti et al. [1].

Refer to section 3.3 for the respective example formats. Models will be evaluated on their performance at each step required to successfully solve the verbalized rebus and their overall ability to produce correct final solutions. 2 Namely 58% Solution Exact Match for a LLaMA-3.1 8B model LoRAtuned on 80k EurekaRebus examples [20,21]

Data description

Origin of data

The dataset used for this challenge is an extended version of EurekaRebus [1], a collection of 222,089 unique Italian rebuses extracted from Eureka5 platform 3 , an open database of rebuses and other linguistic puzzles maintained by the Associazione Culturale "Biblioteca Enigmistica Italiana -G. Panini"4 . Among these, 83,157 were converted by the original authors in verbalized form by leveraging the crossword definitions from the ItaCW collection [10], including 125,202 definition-solution pairs. While Sarti et al. [1] evaluated the performances of prompted and tuned LLMs on rebuses up to June 17th, 2024, the current test set include 168 new unseen examples released on Eureka5 after that date.

Annotation details

We employ the same procedure of Sarti et al. [1] for verbalizing available rebuses. More specifically, only rebuses having all lowercased or camel-cased words among ItaCW solutions are selected, and every word is replaced by sampling one of the available crossword definitions for it at random. 5 Moreover, only regular rebuses containing at least two hidden words are selected, avoiding examples requiring a single definition-solving step and those with more complex templates (e.g., anarebuses using anagrams of hidden words for the solution).

Data format

Each example in the dataset consists of:

• The verbalized rebus (verbalized_rebus) containing letters from the original rebus and crossword-style definitions enclosed in square brackets. • The whitespace-separated solution words obtained after resegmenting the first pass according to the solution key, provided in a semicolon-separated string in order of occurrence (solution_words). • The solution of the verbalized rebus used as the final prediction target for the LLM (solution).

An example is provided in Listing 1.

Prompting

Table 1 shows the 2-shot prompting template adopted for generating a templated solution with the tested LLMs.

The second in-context example used in the template, omitted for brevity, corresponds to the one shown in Listing 1.

The task description provided to the model was derived from a trial-and-error process starting from the original prompt by Sarti et al. [1]. Notably, compared to the original authors the task description provides more detailed descriptions of individual components of the rebus to provide a clearer overview of the task to the LLM. We opted for a 2-shot setting as opposed to the 5-shot prompting employed by Sarti et al. [1] to accommodate the limited context length of some of the tested LLMs, thus ensuring that the total length after model generation does not exceed 1024 tokens 6 . The two examples provided remain the same shown here to simplify evaluation and ensure consistent results.

Verbalized rebus solving steps Table 1 provide labels for the steps necessary to solve the verbalized rebus that are considered in this challenge task. The model receives a problem input including a verbalized rebus (possibly with length hints) and a solution key (chiave di lettura). The first step involves resolving crossword definitions in order (Definition resolution), exploiting only the model's parametric knowledge to accomplish 6 The LLaMA 3 tokenizer was used to perform this estimate the task. Then, the resolved words need to be infilled into the original rebus to compose the first pass, and re-segmented in the Solution segmentation step. Finally, the individual solution words are reassembled into a single solution string. While prompted models should obtain similar performances across all test subsets, the aformentioned division will enable further comparisons with previously trained systems.

Detailed data statistics

Metrics

The challenge employs a comprehensive set of metrics adapted from the original evaluation of [1]:

Prompt template

Sei un'esperto risolutore di giochi enigmistici. Il seguente gioco contiene una frase (Rebus) nella quale alcune parole sono state sostituite da indizi tra parentesi quadre. I numeri in ogni indizio rappresentano la lunghezza della parola nascosta.

Il tuo compito è quello di identificare le parole nascoste e sostituirle agli indizi nel Rebus, producendo una prima lettura dalla quale poi si deriverà una frase risolutiva. La chiave di lettura è una sequenza di numeri che rappresentano la rispettive lunghezze delle parole che compongono la frase risolutiva. La tua risposta deve essere una frase risolutiva sensata e che rispetti le lunghezze definite nella chiave di lettura.

First example# Esempio 1: Problem input ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩

Rebus: AC [Un mollusco nell'insalata di mare (5)] GLI [Lo è l'operaio che lavora in cantiere (5)] S TO [Soldati da trincea (5)]

Chiave di lettura: 11 2 10 Procediamo alla risoluzione del rebus passo per passo: ... (same format as the first example) The Solution Match metric will be used as a primary metric of correctness, since it captures the model ability to fully solve the verbalized rebus. While no baseline evaluation was conducted for the new test set used in this challenge, we expect the performances of most capable open-source systems to align with those of 5shot prompted LLaMA-3 70B and Qwen-2 72B models reported by Sarti et al. [

Definition resolution ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ -A C = A C -[Un mollusco nell'insalata di mare] = cozza -G L I = G L I -[Lo è l'

operaio che lavora in cantiere] = edile -S T O = S T O -[Soldati da trincea] = fanti

Table 3

Baseline results for LLaMA-3 70B and Qwen-2 72B for the original test set, adapted from Sarti et al. [1].

to complete the task primarily due to incorrect word guesses, with errors propagating across resolution steps and ultimately resulting in a final accuracy of 0%.

Limitations

Several limitations should be considered when interpreting the results of this challenge:

Verbalization Simplification The use of verbalized rebuses, while necessary for text-based LLMs, simplifies the original visual puzzle. This does not fully capture the complexity of solving traditional rebuses, which rely on visual cues and cultural knowledge, making verbalized rebus solving a much simpler proxy to the multi-step reasoning required for regular rebuses.

Cultural Specificity

The selected rebuses and crossword definitions rely heavily on Italian-specific linguistic and cultural background. Performance on this task may not generalize to other languages or puzzle types, and it might be unrealistic to expect general-purpose LLMs to possess the specific lexicon and knowledge used for rebus solving.

Prompt Sensitivity While the selected prompt template was observed to perform well for capable proprietary LLMs in preliminary tests, there are no guarantees that the instructions provided in the prompt are sufficient for smaller open-source models to perform verbalized rebus solving proficiently. Moreover, alternative prompt formulations could lead to potentially better results.

Lack of Human Baseline

The challenge currently lacks a clear human performance baseline, which would be valuable for contextualizing model performance on verbalized rebus solving.

Ethical issues

While this challenge focuses on a relatively benign task of puzzle-solving, there are some ethical considerations to keep in mind. First, the dataset captures a very narrow subset of Italian language and culture. Hence, evaluation findings should not be overgeneralized to Italian language competence as a whole or to other cultures. This dataset's rebuses and crossword definitions are derived from commercially available published sources. While efforts have been made to ensure this data's exclusive, fair usage for research purposes, there may be copyright considerations to address.

Data license and copyright issues

As reported by the original EurekaRebus dataset license, the data is redistributed for research purposes only with the explicit approval of the Associazione Culturale "Biblioteca Enigmistica Italiana -G. Panini" (here onwards referred to as the Association), and the rights to each entry in the EurekaRebus collection are the property of the respective copyright holders. The usage and redistribution of these data is allowed only for users providing appropriate attribution to the original copyright holders and the Association, and the creation of derivative works is permitted only for research purposes, using terms no less restrictive than the EurekaRebus license. Researchers are encouraged to contact the challenge organizers with any questions or concerns about data usage and licensing.

• The Associazione Culturale "Biblioteca Enigmistica Italiana -G. Panini" for making their rebus collection freely accessible on Eureka5. • The creators of the ItaCW dataset for enabling the creation of verbalized rebuses. • The puzzle creators whose work is represented in this dataset.

Gabriele Sarti and Arianna Bisazza acknowledge the support of the Dutch Research Council (NWO) for the project InDeep (NWA.1292.19.399). Arianna Bisazza is further supported by the NWO Talent Programme (VI.Vidi.221C.009). We hope this challenge will contribute to the diffusion of the art of Italian enigmistica among computational linguistics and artificial intelligence researchers.

9 Figure 1 :91Figure 1: Example of a verbalized rebus crafted by combining a rebus first pass (intermediate solution) with crossword definitions. Rebus by Lionello, art by Laura Neri.

Table 2 from2Sarti et al.[1] reports statistics for the full and verbalized subsets of the EurekaRebus dataset.Train set contents The training set contains 80,158examples, which are ignored for the purpose of theCALAMITA campaign provided that no adaptation meth-ods are evaluated.Test set contents The test set contains 3,167 examplesdivided as follows, in order of appearance:• 2000 examples matching the in-domain settingfor models trained by [1], i.e. containing only firstpass words seen by all available trained models.• 999 examples matching the out-of-distributionsetting for models trained by [1], i.e. containingat least one first pass word unseen during trainingby available trained models.• 168 new verbalized rebuses added in EurekaRe-bus v1.1, added to the Eureka5 platform afterJune 17th, 2024. These can be either in-domainor out-of-distribution for models trained on theEurekaRebus's training set.

Table 22Statistics for the full EurekaRebus dataset and the crosswordsfiltered subset used in this work. Avg./SD = Average/standard deviation. Table adapted from Sarti et al. [1].Answer# Ora tocca a te!prefixCompleta il rebus seguendo il procedimentodescritto, rispondendo esattamente nellostesso formato utilizzato dagli esempi prece-denti.Rebus: {{verbalized_rebus}} or {{verbal-ized_rebus_with_length_hints}}Chiave di lettura: {{solution_key}}Table 12-shot prompt used for the CALAMITA evaluation. Blue textrepresent additions for the evaluation in the Hints setting.Template elements are highlighted next to the first in-contextexample. Example rebus by Parodi E., Domenica Quiz n. 7

Model Word Acc. FP Acc. Solution Word Acc. Solution Word Len. Solution Acc.1], which we summarize in Section 4. The results show that current models struggleLLaMA-3 70B0.220.040.030.160.00Qwen-2 72B0.280.040.040.200.00

http://www.eureka5.it http://www.enignet.it/home Words in ItaCW can be associated to multiple definitions.

Acknowledgments

We would like to express our gratitude to the following individuals and organizations:

(M. Nissim) https://gsarti.com (G. Sarti); https://cs.rug.nl/~bisazza (A. Bisazza); https://malvinanissim.github.io (M. Nissim) 0000-0001-8715-2987 (G. Sarti); 0000-0003-2936-0256 (T. Caselli); 0000-0003-1270-3048 (A. Bisazza); 0000-0001-5289-0971 (M. Nissim)

Non verbis, sed rebus: Large language models are weak solvers of italian rebuses GSarti TCaselli MNissim ABisazza Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024) FDell'orletta ALenci SMontemagni RSprugnoli the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Pisa, Italy

2024 Puzzle solving using reasoning of large language models: A survey PGiadikiaroglou MLymperaiou GFilandrianos GStamou ArXiv 2024 Finding the optimal human strategy for wordle using maximum correct letter probabilities and reinforcement learning BJAnderson JGMeyer 2022 Arxiv GTodd TMerino SEarle JTogelius Missed connections: Lateral thinking puzzles for large language models 2024 Arxiv Webcrow: A web-based system for crossword solving MErnandes GAngelini MGori 10.1007/11590323_37 AAAI Conference on Artificial Intelligence 2005 Decrypting cryptic crosswords: Semantically complex wordplay puzzles as a target for nlp JRozner CPotts KMahowald Advances in Neural Information Processing Systems MRanzato ABeygelzimer YDauphin PLiang JWVaughan Curran Associates, Inc 2021 34 Automated crossword solving EWallace NTomlin AXu KYang EPathak MGinsberg DKlein 10.18653/v1/2022.acl-long.219 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics Clue-instruct: Text-based clue generation for educational crossword puzzles AZugarini KZeinalipour SSKadali MMaggini MGori LRigutini Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 Riddle me this: Evaluating large language models in solving word-based games RManna MPDi Buono JMonti Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024 CMadge JChamberlain KFort UKruschwitz SLukin the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

Torino, Italia

ELRA and ICCL 2024 Italian crossword generator: Enhancing education through interactive word puzzles KZeinalipour TIaquinta AZanollo GAngelini LRigutini MMaggini MGori Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023) the 9th Italian Conference on Computational Linguistics (CLiC-it 2023) 2023 Solving italian crosswords using the web GAngelini MErnandes MGori 10.1007/11558590_40 International Conference of the Italian Association for Artificial Intelligence 2005 Ghigliottin-ai@evalita2020: Evaluating artificial players for the language game "la ghigliottina PBasile MLovetere JMonti APascucci FSangati LSiciliani 10.4000/books.aaccademia.7488 EVALITA Evaluation of NLP and Speech Tools for Italian -December 17th 2020. 2020 short paper Solving a complex language game by using knowledgebased word associations discovery PBasile MDe Gemmis PLops GSemeraro 10.1109/TCIAIG.2014.2355859 IEEE Transactions on Computational Intelligence and AI in Games 8 2016 DTolosani Enimmistica

Hoepli, Milan

1901 GBrighenti I canoni di bellezza nel rebus, Labirinto -Mensile di cultura enigmistica 1974 EMiola Che cos'è un rebus Carocci 2020 Parole in gioco: Per una semiotica del gioco linguistico SBartezzaghi 2017 Bompiani L'ora desiata vola: guida al mondo del rebus per solutori (ancora) poco abili PIchino 2021 Bompiani Milan CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian GAttanasio PBasile FBorazio DCroce MFrancis JGili EMusacchio MNissim VPatti MRinaldi DScalena Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) CEUR Workshop Proceedings the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

Pisa, Italy

December 4 -December 6, 2024. 2024 MAi Introducing meta llama 3: The most capable openly available llm to date Website 2024 LoRA: Low-rank adaptation of large language models EJHu PShen ZWallis YAllen-Zhu SLi LWang WWang Chen The Tenth International Conference on Learning Representations (ICLR 2022)

OpenReview, Online

2022