=Paper=
{{Paper
|id=Vol-3878/96_main_long
|storemode=property
|title=Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses
|pdfUrl=https://ceur-ws.org/Vol-3878/96_main_long.pdf
|volume=Vol-3878
|authors=Gabriele Sarti,Tommaso Caselli,Malvina Nissim,Arianna Bisazza
|dblpUrl=https://dblp.org/rec/conf/clic-it/SartiCNB24
}}
==Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses==
Non Verbis, Sed Rebus:
Large Language Models are Weak Solvers of Italian Rebuses
Gabriele Sarti1,* , Tommaso Caselli1 , Malvina Nissim1 and Arianna Bisazza1
1
Center for Language and Cognition (CLCG), University of Groningen, Oude Kijk in ’t Jatstraat 26
Groningen, 9712EK, The Netherlands
Abstract
Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In
this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving
capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform
poorly on this task, ad-hoc fine-tuning seems to improve models’ performance. However, we find that performance gains
from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to
evaluate large language models’ linguistic proficiency and sequential instruction-following skills.
Keywords
Large language models, Sequential reasoning, Puzzle, Rebus, Crosswords, Enigmistica Italiana
1. Introduction
Ali
Complex games such as chess and Go have long been (wings)
a source of inspiration to develop more flexible and ro-
bust AI systems [1, 2]. Recent developments in NLP sug-
gested that creative language games could be exploited
as promising benchmarks for quantifying the ability of
large language models (LLMs) to carry out multi-step
knowledge-intensive reasoning tasks under pre-specified Cane
(dog)
constraints [3]. While crossword puzzles have been his-
torically the main focus of such efforts [4], other cat- Coni
(cones)
egories of linguistic games received only marginal at-
tention, especially for languages other than English. A
prominent example of less-studied language games is the
rebus, a visual puzzle combining images and graphic First Pass: M ali - N coni - cane NIA
signs to encode a hidden phrase. Indeed, rebus solving is Verbalized Rebus:
a complex, multi-step process requiring factual knowl- M [Due calciatori attaccanti] (Two attacking footballers)
edge, contextual understanding, vocabulary usage, and N [Usati per mangiare il gelato] (Used for eating ice cream)
reasoning within pre-defined constraints – a set of fun- [Abbaia e morde] (Barks and bites) NIA
damental skills to address a variety of real-world tasks.
In this work, we conduct the first open evaluation of Solution key (# of chars/word): 11 5
LLMs’ rebus-solving capabilities, focusing specifically Solution: Malinconica nenia (melancholic lullaby)
on the Italian language. We propose a novel strategy to
Figure 1: An example of a verbalized rebus crafted by combin-
derive text-only verbalized rebuses from transcribed inter- ing a rebus first pass (intermediate solution) with crossword
mediate rebus solutions and use it to produce a large col- definitions. We use verbalized rebuses to test LLMs’ sequen-
lection with more than 80k verbalized rebuses. We then tial instruction following capabilities. Image from Settimana
evaluate the rebus-solving skills of state-of-the-art LLMs, Enigmistica n. 4656, © Bresi S.r.l.
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy including open-source systems and proprietary models,
*
Corresponding author. via few-shot prompting. Moreover, we fine-tune a small
$ g.sarti@rug.nl (G. Sarti); t.caselli@rug.nl (T. Caselli);
but capable LLM on verbalized rebus solving, outperform-
m.nissim@rug.nl (M. Nissim); a.bisazza@rug.nl (A. Bisazza)
https://gsarti.com (G. Sarti); https://cs.rug.nl/~bisazza ing state-of-the-art systems by a wide margin. Finally, we
(A. Bisazza) conduct a fine-grained assessment of LLMs’ sequential
0000-0001-8715-2987 (G. Sarti); 0000-0003-2936-0256 (T. Caselli); reasoning steps, explaining model performance in terms
0000-0001-5289-0971 (M. Nissim); 0000-0003-1270-3048 (A. Bisazza) of word complexity and memorization.
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
1
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Beyond rebus solving, our evaluation sheds light on the fine-tuning experiments. In our evaluation, we also adopt
limits of current LLMs in multi-step reasoning settings, few-shot prompting [26] and chain-of-thought reason-
highlighting challenges with their application to complex ing [27], which were both shown to strongly improve
sequential instruction-following scenarios.1 LLMs’ abilities when solving complex multi-step tasks.
2. Background and Related Work 3. Experimental Setup
Italian Enigmistica and Rebuses The Italian lan- Data We begin by extracting all rebuses’ first passes
guage is characterized by a rich and long-standing tra- and solutions available on Eureka55 , an online repository
dition of puzzle games, including rebuses, dating back of Italian puzzles. We refer to the resulting dataset con-
to the 19th century [5]2 In Italian rebuses, a first pass taining 223k unique rebuses sourced from various publi-
(prima lettura) representing an intermediate solution of cations as EurekaRebus. For crossword definitions, we
the puzzle is produced by combining graphemes with use ItaCW [20], containing 125k unique definition-word
underlying image elements in a left-to-right direction pairs. We select only EurekaRebus examples in which
(Figure 1). Then, the letters and words of the first pass all first pass words match an existing ItaCW definition
undergo a re-segmentation (cesura) according to a solu- to enable verbalization, maintaining 83,157 examples for
tion key (chiave di lettura3 ), which specifies the length of our modeling experiments.6 Since several ItaCW words
words in the solution (frase risolutiva). The verbalized are associated with multiple definitions, we randomly
rebuses we introduce in this work are variants of textual sample definitions to promote diversity in the resulting
rebuses (rebus descritto or verbis), where the text-based verbalized rebuses. A test set of 2k examples7 is kept
puzzle is crafted by replacing first pass words with their aside for evaluation, and the remaining 81k examples are
crossword definitions in a templated format (Figure 1). used for model training.
Linguistic Puzzles as NLP Progress Metrics Lan- Models We fine-tune Phi-3 Mini 3.8B 4K [28], the most
guage games have recently been adopted as challeng- capable LLM below 4B parameters for a wide range of Ital-
ing tasks for LLM evaluation [3, 9, 10]. While works ian language tasks8 . We use quantized low-rank adapters
in this area have historically focused on English cross- (QLoRA; 29, 30) for efficient fine-tuning with Unsloth9
words [11, 12, 4, 13], recent tests focus on a more di- and Transformers [31], training the model for 5,000 steps
verse set of games such as the New York Times’ “Con- with a batch size of 16 over 81k examples. For compar-
nections” [14] and “Wordle” [15]. Automatic crossword ing our model performances, we select GPT-4o [32] and
solvers were also developed for French [16], German [17] Claude-3.5 Sonnet [33] as the current state-of-the-art
and Italian [18, 19], while didactic crossword generators for proprietary LLMs and the instruction-tuned variants
are available for Italian [20] and Turkish [21]. Relat- of Qwen-2 72B [34] and LLaMA-3 70B [35] as the best-
edly, the Italian evaluation campaign EVALITA4 recently performing open-source LLMs according to the Invalsi
hosted two shared tasks focusing on the word-guessing Italian benchmark [36]. These four systems are used as
game “La Ghigliottina” (The Guillotine) [22, 23]. To our untrained baselines thanks to their instruction-following
knowledge, our work is the first to attempt the computa- abilities and prompted for rebus solving in a few-shot
tional modeling and evaluation of rebus-solving systems. setting.
Importantly, language games such as rebuses are not eas-
ily translatable into other languages due to their struc- Format Table 1 presents an example in the templated
tural and cultural elements. This makes them a scarce format used for fine-tuning Phi-3.10 The model is
but valuable resource for language-specific evaluations prompted to reason step-by-step by 1) solving crossword
of language processing systems. definitions sequentially (definition resolution); 2) pro-
ducing a first pass copying letters and definitions’ words;
LLMs as Sequential Reasoners State-of-the-art
LLMs were shown to struggle to follow sequential instruc- 5
http://www.eureka5.it, additional details in Appendix A. Rebus
tions presented in a single query [24], but their perfor- illustrations are not available in Eureka5.
6
mances improved significantly with ad-hoc training [25]. Since verbalized rebus are produced from textual contents only,
crossword definitions may refer to different word meanings (e.g.
This acts as an initial motivation for our rebus-solving [Two soccer players] is used to represent the word “wings” in Fig-
ure 1 despite not matching the word sense “bird wings” of the
1
Code, data and models are available on Github and Huggingface original image. This does not affect the validity of our task.
2 7
Refer to Miola [6], Bartezzaghi [7], Ichino [8] for a comprehensive Composed by Test id and Test ood, described in Section 5
8
overview of peculiarities and norms in modern Italian rebuses. https://hf.co/spaces/FinancialSupport/open_ita_llm_leaderboard
3 9
Referred to as diagramma in jargon. https://github.com/unslothai/unsloth
4 10
https://www.evalita.it An English example is available in Table 9
2
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Prompt • First Pass Words/Letter Accuracy: Proportion
Risolvi gli indizi tra parentesi per ottenere una prima lettura, of correct words and letters in the generated first
e usa la chiave di lettura per ottenere la soluzione del rebus. pass. Lower scores may indicate issues with as-
sembling a first pass from previous information.
Rebus: U [Lo è il passacavallo] LO [È fatta di vimini] F F
• First Pass Exact Match (EM): Proportion of gen-
[Decimi di chilo] S [Disusato soprabito] A [Un rampicante
dei Tropici]
erated first passes matching the gold reference.
• Solution Key Match: Proportion of generated
Chiave di lettura: 3 6 12 8 solution words matching the lengths specified by
the solution key. Lower scores may indicate diffi-
Generation Procediamo alla risoluzione del rebus passo
per passo:
culty in respecting the given length constraints.
• Solution First Pass Match: Proportion of first
-U=U
⎧
⎪ pass characters employed to construct solution
- [Lo è il passacavallo] = nave
⎪
⎪
⎪
⎪ words. Lower scores indicate issues with using
-LO=LO
⎪
generated first pass characters in the solution.11
⎪
⎪
⎪
- [È fatta di vimini] = cesta
⎪
⎪
⎪
Definition
⎪
⎨
-FF=FF • Solution Words Accuracy: Proportion of cor-
resolution ⎪ - [Decimi di chilo] = etti rect words in the generated solution.
⎪
-S=S • Solution Exact Match (EM): Proportion of gen-
⎪
⎪
⎪
⎪
- [Disusato soprabito] = tait
⎪
⎪
⎪
⎪ erated solutions matching the gold reference.
⎩- A = A
⎪
⎪
⎪
- [Un rampicante dei Tropici] = liana
{︂ 4. Results
Prima lettura: U nave LO cesta F F etti S tait
First pass
A liana Table 2 presents our evaluation results. We observe that
Ora componiamo la soluzione seguendo la
all prompted models perform poorly on the task, with the
chiave risolutiva: overall best prompted system (Claude 3.5 Sonnet) obtain-
⎧ ing the correct solution only for 24% of the 2k tested
⎪ 3 = Una examples. Notably, open-source systems perform signifi-
Solution ⎪
6 = veloce
⎨
segmen- cantly worse than proprietary ones, producing correct
tation ⎪12 = staffettista
⎪ first passes only for 4% of the examples, and next to no
8 = italiana
⎩
correct solutions. Our fine-tuned system largely outper-
{︀ forms all state-of-the-art prompted models, predicting
Solution Soluzione: Una veloce staffettista italiana
the correct solution in 51% of cases. From first pass met-
Table 1 rics, it is evident these results can be largely explained by
Template used in fine-tuning and prompting experiments with the poor word-guessing capabilities of the models, which
highlighted reasoning stages. Example rebus by Il Piacentino, are greatly improved with fine-tuning. For prompted
Settimana Enigmistica n. 2942 models, the slight decrease in scores between Def. and
FP Words also highlights issues with copying predicted
3) re-segmenting it into solution words based on the solu- words in the expected format. Finally, we observe that
tion key (solution segmentation); and finally 4) produc- fine-tuning strongly improves the constraint-following
ing the solution by copying re-segmented words. We abilities of our system, with prompted systems being less
automatically convert rebuses in this format by deriving strict with applying length and letter-choice constraints
the solution key from solution word lengths and dynami- for their solutions (Key/FP Match).
cally infilling the available information into the template.
We use a similar format for prompting experiments, with
five in-context step-by-step demonstrations and an ex- 5. What Motivates Model
plicit instruction asking the model to stick to the previous Performances?
examples’ format to streamline solution parsing.
In light of the strong performances achieved by our rela-
Metrics For our granular evaluation of rebus-solving tively small fine-tuned system, this section conducts an
performance, we adopt the following set of metrics focus- in-depth investigation to identify factors motivating such
ing on the first passes (FP) and solutions (S) generated performance improvements.
by LLMs:
11
In practice, we define this as 1 − CER(FP, S), where CER is the
• Definition (Def.): Proportion of correctly character error rate [37] between the two sequences (lowercased,
guessed words during definition resolution. whitespace removed) computed with Jiwer
3
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
First Pass (FP) Solution (S)
Model Setup Def.
Words Letters EM Key Match FP Match Words EM
LLaMA-3 70B 5-shot prompt 0.22 0.20 0.60 0.04 0.16 0.51 0.03 0.00
Qwen-2 72B 5-shot prompt 0.28 0.25 0.76 0.04 0.20 0.52 0.04 0.00
GPT-4o 5-shot prompt 0.55 0.51 0.83 0.15 0.53 0.74 0.27 0.11
Claude-3.5 Sonnet 5-shot prompt 0.66 0.62 0.90 0.28 0.83 0.82 0.43 0.24
Phi-3 3.8B (ours) fine-tuned 0.84 0.84 1.00 0.56 0.86 0.94 0.68 0.51
Table 2
Fine-grained verbalized rebus solving performances of various LLMs. Bold denotes best overall performances, and underline
marks best training-free results.
GPT-4o Phi-3 (ours) we evaluate our fine-tuned model in out-of-distribution
Metric
Test Test Test Test Test Test settings. For this evaluation, the 2k examples of the test
id ood Δ id ood Δ set from previous sections are divided into two subsets:
FP W. ID 0.52 0.51 -0.01 0.96 0.96 0.00 one in which all first pass words were seen during fine-
FP W. OOD - 0.44 - - 0.20 - tuning by Phi-3 (Test id, 1061 examples) and one in
FP EM 0.16 0.14 -0.02 0.89 0.18 -0.71 which, for every example, at least one first pass word
S W. ID 0.29 0.26 -0.03 0.92 0.49 -0.43
was unseen in training (Test ood, 939 examples). In-
S W. OOD 0.18 0.16 -0.02 0.63 0.20 -0.40 tuitively, if Phi-3 performance is mainly motivated by
S EM 0.12 0.09 -0.03 0.82 0.16 -0.66 memorizing fine-tuning data, introducing OOD words
should produce a significant drop in model performances.
Table 3 Results shown in Table 3 confirm that this is indeed the
Model performances for test subsets containing only in- case. We find Phi-3 performances to be near-perfect on
domain (Test ID), or some out-of-domain (Test OOD) first seen first pass words (FP W. ID = 0.96) in both test sets,
pass words. W. ID and W. OOD are accuracies for ID and OOD
with a major drop for OOD words (FP W. OOD = 0.20).
words for first pass (FP) and solution (S) sequences. Test Δ =
This produces second-order effects on subsequent steps,
Test ID - Test OOD performance.
causing the FP EM results to drop by 71% (FP EM Test
∆), while significantly impacting downstream solution
Word Complexity and Frequency Affects LLM Fine- accuracies. On the contrary, GPT-4o few-shot prompting
tuning Performance For every word in the first performances remain nearly identical on both splits, con-
passes and solutions of test set examples, we measure firming that these results are not the product of a skewed
LLMs’ overall accuracy in predicting it for the full test data selection process. Overall, these results strongly
set. We then correlate this score to various quantities suggest that memorization is the main factor behind the
that could motivate LLMs’ performances. More specifi- strong rebus-solving performance of our fine-tuned LLM.
cally, we use 1) the word frequency in the training set;
2) the word frequency in Paisà [38], a large web Ital-
Manual Inspection We conclude by manually evalu-
ian corpus; and 3) the length of the word (number of
ating some generations produced by the best-performing
characters). We find a significant positive correlation
LLMs. Table 4 presents two examples with definitions
(𝜌 = 0.44) between first pass word prediction accuracy
(D) and solution (S) words predicted by three LLMs, with
and training frequency for the fine-tuned Phi-3 model,
more examples provided in Appendix C. We use naw as
suggesting that model performance is strongly related
short-hand for “Not A Word” to mark nonsensical terms.
to training coverage. The length of characters is also
In the first example, Phi-3 correctly predicts all first
found to negatively affect our model’s performance, al-
pass and solution words. On the contrary, other mod-
beit to a smaller extent (𝜌 = −0.11). The performance of
els make several mistakes in the first pass, leading to
prompted models is unrelated to both properties for first
incorrect solutions. Both prompted models tend to ig-
pass words, indicating that these results are the product
nore first pass words when these cannot be assembled
of fine-tuning.12
to form sensical, length-fitting solution words. For ex-
ample, for D1 GPT-4o predicts p (naw), which would
LLM Fine-Tuning Fails to Generalize to Unseen lead to the solution word “SAPpTE” (naw), but the S8 =
Words To further confirm the importance of fine- “Spettacolo” (show) is predicted instead by the model). In
tuning word coverage in defining model performances, particular, GPT-4o appears to prioritize grammatically
correct solutions at the cost of ignoring first pass words
12
Paisà frequency is never found to correlate significantly. Full and solution key length constraints, while Claude 3.5S
correlation results are available in Table 6.
4
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Rebus: SAP [La porta della breccia] D1 TE [La pinza del 6. Discussion and Conclusion
granchio] D2 SBA [Si legge su alcuni orologi] D3 G [Le
sue coccole sono aromatiche] D4 V [Un gioco con dadi e This work introduced a verbalized rebus-solving task
pedine] D5 D [Sono verdi in gioventù] D6 and dataset for evaluating LLMs’ sequential instruction
Chiave di lettura: 8 3 2 12 7 5 following skills for the Italian language. We crafted a
large collection of 83k verbalized rebuses by combining
Step GPT-4o Claude 3.5S Phi-3
rebus transcriptions with crossword definitions and used
D1 p one pia
D2 chela chela chela it to evaluate the rebus-solving skills of state-of-the-art
D3 ora data data LLMs. Our experiments revealed the challenging nature
D4 ginepro lio ginepro of this task, with even the most capable prompted models
D5 ludo oca oca achieving only 24% accuracy on solutions.
D6 acerbi anni anni While fine-tuning a smaller LLM dramatically im-
S8 Spettacolo Saponate Sappiate proved performance to 51% solution accuracy, our anal-
S3 che che che ysis uncovered that these gains were largely driven
S2 fa la la by memorization and do not generalize to out-of-
S12 sognare sbadataggine sbadataggine distribution examples. These results suggest important
S7 ogni vocando provoca limitations in the generalization capabilities of current
S5 sera danni danni systems for sequential instruction following tasks. Our
Soluzione: SAPpiaTE che la SBAdataGgine proVoca Danni manual analysis further shows that LLMs seldom account
for length constraints when solving definitions, despite
Rebus: STU [Si salva otturandolo] D1 S [Ha foglie the fundamental role of these cues in restricting the pool
seghettate] D2 AL [Lo è l’operaio che lavora in cantiere] D3 of possible words. These results suggest that search-
G [Un uomo... non all’ altezza] D4 based approaches accounting for constraints more ex-
Chiave di lettura: 11 7 2 7
plicitly might improve puzzle structure adherence, as
Step GPT-4o Claude 3.5S Phi-3 previously shown by Chen et al. [39]. Other augmenta-
D1 tappo falla dente tion techniques employing LLM reformulation skills can
D2 acero ortica aro also be explored to mitigate overfitting.
D3 edile edile edile Future work in this area should focus on expanding
D4 nano nano nano similar evaluations to a wider set of languages, input
S11 Stupaccerone Stufallassor Studentesaro modalities, and puzzle categories, creating a comprehen-
S7 salendo ticale aledile sive benchmark to test LLMs’ puzzle-solving skills. Im-
S2 al di gi portantly, the task of solving visual rebuses and their
S7 genano Legnano nanano more convoluted variants13 remains far beyond the cur-
Soluzione: STUdenteSsa liceALe di LeGnano
rent capabilities of vision-language models. Hence, solv-
ing these puzzles automatically can be considered an
Table 4 important milestone in developing multimodal AI sys-
Examples of LLM generations for rebuses by Slam, Nuova tems for constrained multi-step reasoning tasks. Our
Enigmistica Tascabile n. 2802 (top) and Grizzly, Domenica Quiz results confirm that the challenging nature of rebuses,
n. 2 (bottom). Correct guesses and errors and denoted for
even in their verbalized form, makes this task valuable
predicted first pass definitions (D1,...,𝑁 ) and solution words
(S𝑖 , with 𝑖 being the 𝑖-th solution key value).
for assessing future progress in LLMs’ linguistic profi-
ciency and sequential reasoning abilities. Finally, our
rebus-solving LLM can facilitate future interpretability
shows an improved ability to follow these constraints, as work investigating the mechanisms behind factual recall
confirmed by Key/FP Match results of Table 2. and multi-step reasoning in transformer models [40].
In the second example, the first pass word D2 = salice
(willow) is OOD for Phi-3. Consequently, the model pro-
Limitations Our analysis was limited to a relatively
duces the incorrect prediction aro (naw), and the error is
small set of models, and a single prompt template ob-
propagated to all solution words, as previously observed
tained after minimal tuning. Further experiments are
in the Test OOD column of Table 3. Prompted models
needed to verify that memorization patterns after fine-
also underperform in this example, with errors on D1 and
tuning remain relevant for other model sizes, prompt for-
D2 propagating to most solution words. However, we
mats, and training regimes, particularly for full-weight
note that D1 and D2 incorrect predictions for Claude 3.5S
training approaches.
satisfy the provided definitions, suggesting that access
to more explicit information about the given constraints
13
could further boost LLMs’ performance on this task. For example, rebuses requiring first pass anagrams (anarebus) or
dynamic relations derived from multi-scene analysis (stereorebus)
5
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Acknowledgments rebus per solutori (ancora) poco abili, Bompiani,
Milan, 2021.
Gabriele Sarti and Arianna Bisazza acknowledge the [9] R. Manna, M. P. di Buono, J. Monti, Riddle me
support of the Dutch Research Council (NWO) for the this: Evaluating large language models in solving
project InDeep (NWA.1292.19.399). Arianna Bisazza word-based games, in: C. Madge, J. Chamberlain,
is further supported by the NWO Talent Programme K. Fort, U. Kruschwitz, S. Lukin (Eds.), Proceedings
(VI.Vidi.221C.009). We are grateful to the Associazione of the 10th Workshop on Games and Natural Lan-
Culturale “Biblioteca Enigmistica Italiana - G. Panini” guage Processing @ LREC-COLING 2024, ELRA
for making its rebus collection freely accessible on the and ICCL, Torino, Italia, 2024, pp. 97–106. URL:
Eureka5 platform, and to Valeriya Zelenkova for her valu- https://aclanthology.org/2024.games-1.11.
able comments on the first version of this work. We also [10] P. Giadikiaroglou, M. Lymperaiou, G. Filandrianos,
thank the CLiC-it 2024 reviewers for their valuable feed- G. Stamou, Puzzle solving using reasoning of large
back. language models: A survey, ArXiv (2024). URL:
https://arxiv.org/abs/2402.11291.
[11] M. L. Littman, G. A. Keim, N. Shazeer, A
References probabilistic approach to solving crossword puz-
[1] D. Silver, A. Huang, C. J. Maddison, A. Guez, zles, Artificial Intelligence 134 (2002) 23–
L. Sifre, G. van den Driessche, J. Schrittwieser, 55. URL: https://www.sciencedirect.com/science/
I. Antonoglou, V. Panneershelvam, M. Lanctot, article/pii/S000437020100114X. doi:https://doi.
S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, org/10.1016/S0004-3702(01)00114-X.
I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, [12] M. Ernandes, G. Angelini, M. Gori, We-
T. Graepel, D. Hassabis, Mastering the game of Go bcrow: A web-based system for crossword solv-
with deep neural networks and tree search, Nature ing, in: AAAI Conference on Artificial In-
529 (2016) 484–489. doi:10.1038/nature16961. telligence, 2005. URL: https://link.springer.com/
[2] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, chapter/10.1007/11590323_37.
M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, [13] A. Boda, Sadallah, D. Kotova, E. Kochmar, S. Yao,
T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao,
A general reinforcement learning algorithm that K. N. 2023, S. Yousefi, L. Betthauser, H. Hasan-
masters chess, shogi, and go through self-play, Sci- beig, R. Milliere, I. Momennejad, De-coding, A. Zu-
ence 362 (2018) 1140–1144. doi:10.1126/science. garini, T. Röthenbacher, K. Klede, M. Ernandes,
aar6404. B. M. Eskofier, D. Z. 2023, Are llms good cryp-
[3] J. Rozner, C. Potts, K. Mahowald, Decrypting tic crossword solvers?, ArXiv (2024). URL: https:
cryptic crosswords: Semantically complex word- //arxiv.org/abs/2403.12094.
play puzzles as a target for nlp, in: M. Ranzato, [14] G. Todd, T. Merino, S. Earle, J. Togelius, Missed
A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan connections: Lateral thinking puzzles for large lan-
(Eds.), Advances in Neural Information Processing guage models, Arxiv (2024). URL: https://arxiv.org/
Systems, volume 34, Curran Associates, Inc., 2021, abs/2404.11730.
pp. 11409–11421. URL: https://proceedings. [15] B. J. Anderson, J. G. Meyer, Finding the optimal
neurips.cc/paper_files/paper/2021/file/ human strategy for wordle using maximum cor-
5f1d3986fae10ed2994d14ecd89892d7-Paper.pdf. rect letter probabilities and reinforcement learning,
[4] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak, Arxiv (2022). URL: https://arxiv.org/abs/2202.00557.
M. Ginsberg, D. Klein, Automated crossword solv- [16] G. Angelini, M. Ernandes, T. laquinta, C. Stehl’e,
ing, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), F. Simoes, K. Zeinalipour, A. Zugarini, M. Gori,
Proceedings of the 60th Annual Meeting of the As- The webcrow french crossword solver, in: In-
sociation for Computational Linguistics (Volume 1: telligent Technologies for Interactive Entertain-
Long Papers), Association for Computational Lin- ment, 2023. URL: https://link.springer.com/chapter/
guistics, Dublin, Ireland, 2022, pp. 3073–3085. URL: 10.1007/978-3-031-55722-4_14.
https://aclanthology.org/2022.acl-long.219. doi:10. [17] A. Zugarini, T. Rothenbacher, K. Klede, M. Ernan-
18653/v1/2022.acl-long.219. des, B. M. Eskofier, D. Zanca, Die rätselrevolu-
[5] D. Tolosani, Enimmistica, Hoepli, Milan, 1901. tion: Automated german crossword solving, in:
[6] E. Miola, Che cos’è un rebus, Carocci, 2020. Proceedings of the 9th Italian Conference on Com-
[7] S. Bartezzaghi, Parole in gioco: Per una semiotica putational Linguistics (CLiC-it 2023), 2023. URL:
del gioco linguistico, Bompiani, 2017. https://ceur-ws.org/Vol-3596.
[8] P. Ichino, L’ora desiata vola: guida al mondo del [18] G. Angelini, M. Ernandes, M. Gori, Solving ital-
ian crosswords using the web, in: International
6
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Conference of the Italian Association for Artificial Inc., 2020, pp. 1877–1901. URL: https://proceedings.
Intelligence, 2005. URL: https://link.springer.com/ neurips.cc/paper_files/paper/2020/file/
chapter/10.1007/11558590_40. 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[19] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Maggini, [27] J. Wei, X. Wang, D. Schuurmans, M. Bosma,
M. Gori, L. Rigutini, Clue-instruct: Text-based clue b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain-
generation for educational crossword puzzles, in: of-thought prompting elicits reasoning in large
N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, language models, in: S. Koyejo, S. Mohamed,
N. Xue (Eds.), Proceedings of the 2024 Joint In- A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.),
ternational Conference on Computational Linguis- Advances in Neural Information Processing
tics, Language Resources and Evaluation (LREC- Systems, volume 35, Curran Associates, Inc., 2022,
COLING 2024), ELRA and ICCL, Torino, Italia, 2024, pp. 24824–24837. URL: https://proceedings.
pp. 3347–3356. URL: https://aclanthology.org/2024. neurips.cc/paper_files/paper/2022/file/
lrec-main.297. 9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.
[20] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini, pdf.
L. Rigutini, M. Maggini, M. Gori, Italian crossword [28] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja,
generator: Enhancing education through interac- A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
tive word puzzles, in: Proceedings of the 9th Italian A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko,
Conference on Computational Linguistics (CLiC-it J. Bjorck, S. Bubeck, Q. C. et al., Phi-3 techni-
2023), 2023. URL: https://ceur-ws.org/Vol-3596. cal report: A highly capable language model lo-
[21] K. Zeinalipour, Y. G. Keptig, M. Maggini, L. Rigutini, cally on your phone, Arxiv (2024). URL: https:
M. Gori, A turkish educational crossword puzzle //arxiv.org/abs/2404.14219.
generator, ArXiv abs/2405.07035 (2024). URL: https: [29] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li,
//arxiv.org/abs/2405.07035v2. S. Wang, L. Wang, W. Chen, LoRA: Low-rank adap-
[22] P. Basile, M. Lovetere, J. Monti, A. Pascucci, F. San- tation of large language models, in: The Tenth
gati, L. Siciliani, Ghigliottin-ai@evalita2020: Eval- International Conference on Learning Representa-
uating artificial players for the language game tions (ICLR 2022), OpenReview, Online, 2022. URL:
"la ghigliottina" (short paper), EVALITA Evalua- https://openreview.net/forum?id=nZeVKeeFYf9.
tion of NLP and Speech Tools for Italian - Decem- [30] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettle-
ber 17th, 2020 (2020). URL: https://doi.org/10.4000/ moyer, Qlora: Efficient finetuning of quantized
books.aaccademia.7488. llms, in: A. Oh, T. Naumann, A. Globerson,
[23] P. Basile, M. de Gemmis, P. Lops, G. Semeraro, Solv- K. Saenko, M. Hardt, S. Levine (Eds.), Advances
ing a complex language game by using knowledge- in Neural Information Processing Systems,
based word associations discovery, IEEE Trans- volume 36, Curran Associates, Inc., 2023,
actions on Computational Intelligence and AI in pp. 10088–10115. URL: https://proceedings.
Games 8 (2016) 13–26. doi:10.1109/TCIAIG.2014. neurips.cc/paper_files/paper/2023/file/
2355859. 1feb87871436031bdc0f2beaa62a049b-Paper-Conference.
[24] X. Chen, B. Liao, J. Qi, P. Eustratiadis, C. Monz, pdf.
A. Bisazza, M. de Rijke, The sifo benchmark: Inves- [31] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
tigating the sequential instruction following abil- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
ity of large language models, 2024. URL: https: towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
//arxiv.org/abs/2406.19999. arXiv:2406.19999. Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
[25] H. Hu, S. Yu, P. Chen, E. M. Ponti, Fine-tuning M. Drame, Q. Lhoest, A. Rush, Transformers:
large language models with sequential instructions, State-of-the-art natural language processing, in:
Arxiv (2024). URL: https://arxiv.org/abs/2403.07794. Q. Liu, D. Schlangen (Eds.), Proceedings of the
[26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. 2020 Conference on Empirical Methods in Natu-
Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, ral Language Processing: System Demonstrations,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, Association for Computational Linguistics, On-
G. Krueger, T. Henighan, R. Child, A. Ramesh, line, 2020, pp. 38–45. URL: https://aclanthology.
D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, org/2020.emnlp-demos.6. doi:10.18653/v1/2020.
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, emnlp-demos.6.
C. Berner, S. McCandlish, A. Radford, I. Sutskever, [32] OpenAI, Hello gpt-4o, Website, 2024. URL: https:
D. Amodei, Language models are few-shot learners, //openai.com/index/hello-gpt-4o.
in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, [33] Anthropic, Claude 3.5 sonnet, Website, 2024.
H. Lin (Eds.), Advances in Neural Information URL: https://www.anthropic.com/news/
Processing Systems, volume 33, Curran Associates, claude-3-5-sonnet.
7
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
[34] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, Statistic EurekaRebus ItaCW-filtered
C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, # examples 222089 83157
J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, # authors 8138 5046
J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, Year range 1800 - 2024 1869 - 2024
K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, First pass
P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang,
S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, # unique words 38977 8960
X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Avg./SD words/ex. 3.50/1/48 3.08/1.00
Avg./SD word len. 6.51/1.96 5.70/1.60
Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui,
Avg./SD FP len. 26.45/11.19 25.74/8.73
Z. Zhang, Z. Fan, Qwen2 technical report, 2024.
URL: https://arxiv.org/abs/2407.10671. Solution
[35] M. AI, Introducing meta llama 3: The most capable # unique words 75718 42558
openly available llm to date, Website, 2024. URL: Avg./SD words/ex. 3.02/1.60 2.80/1.21
https://ai.meta.com/blog/meta-llama-3. Avg./SD word len. 8.07/2.30 7.79/2.23
[36] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino, Avg./SD Sol. len. 19.47/8.44 18.81/6.06
A. Seveso, Disce aut deficere: Evaluating llms profi-
Table 5
ciency on the invalsi italian benchmark, 2024. URL:
Statistics for the full EurekaRebus dataset and the crosswords-
https://arxiv.org/abs/2406.17535. filtered subset used in this work. Avg./SD = Average/standard
[37] A. Morris, V. Maier, P. Green, From wer and ril deviation.
to mer and wil: improved evaluation measures for
connected speech recognition., 2004.
Model # Char. Paisà Freq. Train Freq.
[38] V. Lyding, E. Stemle, C. Borghetti, M. Brunello,
S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci, GPT-4o -0.01 0.01 0.02
V. Pirrelli, The PAISÀ corpus of Italian web texts, Claude-3.5 -0.02 -0.02 0.00
in: F. Bildhauer, R. Schäfer (Eds.), Proceedings of Phi-3 (ours) -0.11 -0.05 0.44
the 9th Web as Corpus Workshop (WaC-9), Associ- GPT-4o -0.18 0.14 0.19
ation for Computational Linguistics, Gothenburg, Claude-3.5 -0.15 0.08 0.13
Sweden, 2014, pp. 36–43. URL: https://aclanthology. Phi-3 (ours) -0.02 0.08 0.22
org/W14-0406. doi:10.3115/v1/W14-0406.
[39] L. Chen, J. Liu, S. Jiang, C. Wang, J. Liang, Table 6
Y. Xiao, S. Zhang, R. Song, Crossword puzzle Spearman’s correlation with average word accuracies for
resolution via monte carlo tree search, Proceed- metrics computed on first pass (top) and solution (bottom)
words. Bold scores are significant with Bonferroni-corrected
ings of the International Conference on Auto-
𝑝 < 1𝑒 − 5 [41]
mated Planning and Scheduling 32 (2022) 35–43.
URL: https://ojs.aaai.org/index.php/ICAPS/article/
view/19783. doi:10.1609/icaps.v32i1.19783. the pool of available definitions for every word.
[40] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-jussà,
A primer on the inner workings of transformer- First pass/Solution word distribution Figure 2
based language models, Arxiv (2024). URL: https: shows the distribution of first pass and solution words
//arxiv.org/abs/2405.00208. for the filtered EurekaRebus subset used in our work.
[41] C. Bonferroni, Teoria statistica delle classi e calcolo
delle probabilita, Pubblicazioni del R. Istituto Su-
periore di Scienze Economiche e Commericiali di B. Additional Experimental
Firenze 8 (1936) 3–62. Results
Table 6 presents the correlations between model accu-
racy and the properties presented in Section 5. Table 7
A. Additional Data Information presents the full ID/OOD performances for all tested
Dataset statistics Table 5 presents statistics for the Eu- models, showing consistent results with Table 3 for all
rekaRebus dataset and the filtered subset we use for com- prompted models. Table 8 presents Phi-3 Mini perfor-
posing verbalized rebuses. The ItaCW dataset contains a mances across rebus-solving fine-tuning steps.
total of 125,202 definitions for 40,963 unique words, with
the most frequent words having hundreds of different
definitions, e.g. 173 for re (king), 155 for te (you). Defini-
tions used for verbalization are randomly sampled from
8
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Word Frequency
re (6091)
ali (3068), in (2793)
est (2365)
ante (1916)
tori (1748)
accetta (139)
Word Frequency
di (8449)
d’ (2910)
Una (2111)
a (1821)
amore (684)
pesante (172)
importante (81)
Word
Figure 2: Word frequencies for words in first passes (top) and solutions (bottom) for the selected subset of EurekaRebus used
for training and evaluation. Words are colored according to their length, and the most frequent examples per frequency bin
are highlighted.
LLaMA-3 Qwen-2 GPT-4o Claude-3.5S Phi-3 (ours)
Metric
Test Test Test Test Test Test Test Test Test Test Test Test Test Test Test
id ood Δ id ood Δ id ood Δ id ood Δ id ood Δ
FP W. ID 0.20 0.19 -0.01 0.26 0.25 -0.01 0.52 0.51 -0.01 0.65 0.63 -0.02 0.96 0.96 0.00
FP W. OOD - 0.18 - - 0.24 - - 0.44 - - 0.54 - - 0.20 -
FP EM 0.03 0.04 0.01 0.03 0.05 0.02 0.16 0.14 -0.02 0.30 0.25 -0.05 0.89 0.18 -0.71
S W. ID 0.03 0.04 0.01 0.04 0.05 0.01 0.29 0.26 -0.03 0.48 0.40 -0.08 0.92 0.49 -0.43
S W. OOD 0.01 0.00 -0.01 0.02 0.00 -0.02 0.18 0.16 -0.02 0.41 0.30 -0.11 0.63 0.20 -0.40
S EM 0.00 0.00 0.00 0.00 0.00 0.00 0.12 0.09 -0.03 0.27 0.22 -0.05 0.82 0.16 -0.66
Table 7
Full model performances for test subsets containing only in-domain (Test ID), or some out-of-domain (Test OOD) first pass
words. W. ID and W. OOD are accuracies for ID and OOD words for first pass (FP) and solution (S) sequences. Test Δ = Test ID -
Test OOD performance.
C. Additional Model Generations Tables 10 and 11 provide additional example of LLM gen-
erations for tested rebuses, with the example from Ta-
Table 9 presents an English translation of Figure 1 ex- ble 11 (bottom) being OOD due to the manovella (crank)
ample using the prompt format adopted in this study. word in D2, and the others being ID for the fine-tuned
First Pass (FP) Solution (S)
# Train Steps Def.
Words Letters EM Key Match FP Match Words EM
500 0.64 0.63 0.97 0.25 0.66 0.86 0.36 0.16
1000 0.74 0.74 1.00 0.38 0.72 0.89 0.48 0.28
1500 0.78 0.77 0.99 0.42 0.78 0.91 0.55 0.34
2000 0.80 0.79 1.00 0.47 0.81 0.93 0.59 0.40
2500 0.81 0.81 1.00 0.49 0.81 0.92 0.62 0.42
3000 0.82 0.82 1.00 0.51 0.82 0.92 0.63 0.44
3500 0.82 0.82 1.00 0.53 0.83 0.93 0.65 0.47
4000 0.83 0.83 1.00 0.53 0.85 0.93 0.66 0.47
4500 0.83 0.83 1.00 0.55 0.85 0.93 0.67 0.49
5072 (final) 0.84 0.84 1.00 0.56 0.86 0.94 0.68 0.51
Table 8
Fine-grained verbalized rebus solving performances of the Phi-3 Mini model on the test set including both ID and OOD
examples at various fine-tuned stages. Bold denotes best overall performances.
9
Gabriele Sarti et al. CEUR Workshop Proceedings 1–10
Prompt
Solve the clues provided between parentheses to obtain a first Rebus: PRI [Ricoperto di sudore] D1 MIN [Gli altari del
pass, and use the solution key to obtain the rebus’ solution. tempio] D2 DO [Un ordigno bellico] D3 [Possono essere “di
serie” in certi tornei] D4 SSO
Rebus: M [Two attacking footballers] N [Used for eating ice- Chiave di lettura: 5 2 8 6 2 6
cream] [Barks and bites] NIA
Step GPT-4o Claude 3.5S Phi-3
Solution key: 11 5 D1 sudato madido madido
D2 are are are
Generation Let’s proceed to solve the rebus step by step: D3 bomba mina mina
D4 teste teste teste
-M=M
⎧
⎪
- [Two attacking footballers] = ali (wings)
⎪
⎪
⎪
⎪ S5 Prima Prima Prima
-N=N
⎨
Definition S2 di di di
resolution ⎪
⎪ - [Used for eating icecream] = coni (cones) S8 sudare minaccia dominare
⎩- [Barks and bites] = cane (dog)
⎪
⎪
⎪ S6 molto teste dominate
-NIA=NIA S2 di di se
{︀ S6 testa dosso stesso
First pass First pass: M ali N coni cane NIA
Soluzione: PRIma di doMINare DOmina te steSSO
Let’s now compose the solution by following
Rebus: AT [Si alzano nel camping] D1 [Emoziona pescatori e
the solution key:
navigatori] D2 [Come una nota Foresta] D3 MEN [Quadro ad
olio] D4 S [Atteggiamento da modella] D5
{︂
Solution 11 = Malinconica (melancholic)
segmen- Chiave di lettura: 9 11 2 5
5 = nenia (lullaby)
tation
{︂
Solution: Malinconica nenia (melancholic Step GPT-4o Claude 3.5S Phi-3
Solution
lullaby) D1 tende tende tende
D2 marea mare rete
Table 9 D3 nera nera nera
English translation of the template used in fine-tuning and D4 dipinto tela tela
prompting experiments with highlighted reasoning stages, D5 posa posa posa
with translated first pass and solution words in parentheses.
S9 Attenderemo Attendere Attendere
Example rebus by A. Rizzi, Settimana Enigmistica n. 4656
S11 mareanera marenamente teneramente
Rebus: MOR [Musa della commedia] D1 L [Si accendono per S2 di la la
vedere] D2 NO [Uomini di intelligenza superiore] D3 S5 posa posa sposa
Chiave di lettura: 7 12 Soluzione: ATtendere teneraMENte la Sposa
Rebus: B [Una figura geometrica] D1 [La si impugna per far
Step GPT-4o Claude 3.5S Phi-3 girare un congegno] D2 DA [Le produce il rovo] D3
D1 Talia Talia Talia
D2 luci luci luci Chiave di lettura: 10 7 1’ 5
D3 geni geni genii
Step GPT-4o Claude 3.5S Phi-3
S7 Mortali Mortali Mortali
D1 cerchio rombo ellissi
S12 allucinogeni allucinogeni allucinogeni
D2 manovella manovella leva
Soluzione: MORTali aLluciNOgeni D3 more more more
Table 10 S10 Bcerchiomanovella Bromomanov Bellissile
Examples of LLM generations for a rebus by De Vico C., S7 elladam vadamore
Domenica Quiz n. 5. Correct guesses and errors and de- S1’ d’ o’ ’
noted for predicted first pass definitions (D1,...,𝑁 ) and so- S5 amore more remo
lution words (S𝑖 , with 𝑖 being the 𝑖-th solution key value).
Soluzione: Bellissima novella D’ Amore
Phi-3 Mini. Table 11
Examples of LLM generations for rebuses by Baruffa, Rebus
n. 12 (top), Contini C., La Settimana Enigmistica n. 4102 (mid)
and Liosca, La Settimana Enigmistica n. 4581 (bottom). Correct
guesses and errors and denoted for predicted first pass defini-
tions (D1,...,𝑁 ) and solution words (S𝑖 , with 𝑖 being the 𝑖-th
solution key value).
10