=Paper= {{Paper |id=Vol-3878/96_main_long |storemode=property |title=Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses |pdfUrl=https://ceur-ws.org/Vol-3878/96_main_long.pdf |volume=Vol-3878 |authors=Gabriele Sarti,Tommaso Caselli,Malvina Nissim,Arianna Bisazza |dblpUrl=https://dblp.org/rec/conf/clic-it/SartiCNB24 }} ==Non Verbis, Sed Rebus: Large Language Models Are Weak Solvers of Italian Rebuses== https://ceur-ws.org/Vol-3878/96_main_long.pdf
                                Non Verbis, Sed Rebus:
                                Large Language Models are Weak Solvers of Italian Rebuses
                                Gabriele Sarti1,* , Tommaso Caselli1 , Malvina Nissim1 and Arianna Bisazza1
                                1
                                    Center for Language and Cognition (CLCG), University of Groningen, Oude Kijk in ’t Jatstraat 26
                                     Groningen, 9712EK, The Netherlands


                                                  Abstract
                                                  Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In
                                                  this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving
                                                  capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform
                                                  poorly on this task, ad-hoc fine-tuning seems to improve models’ performance. However, we find that performance gains
                                                  from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to
                                                  evaluate large language models’ linguistic proficiency and sequential instruction-following skills.

                                                  Keywords
                                                  Large language models, Sequential reasoning, Puzzle, Rebus, Crosswords, Enigmistica Italiana



                                1. Introduction
                                                                                                                          Ali
                                Complex games such as chess and Go have long been                                     (wings)
                                a source of inspiration to develop more flexible and ro-
                                bust AI systems [1, 2]. Recent developments in NLP sug-
                                gested that creative language games could be exploited
                                as promising benchmarks for quantifying the ability of
                                large language models (LLMs) to carry out multi-step
                                knowledge-intensive reasoning tasks under pre-specified                                                                                                  Cane
                                                                                                                                                                                         (dog)
                                constraints [3]. While crossword puzzles have been his-
                                torically the main focus of such efforts [4], other cat-                               Coni
                                                                                                                      (cones)
                                egories of linguistic games received only marginal at-
                                tention, especially for languages other than English. A
                                prominent example of less-studied language games is the
                                rebus, a visual puzzle combining images and graphic                                               First Pass: M ali - N coni - cane NIA
                                signs to encode a hidden phrase. Indeed, rebus solving is                            Verbalized Rebus:
                                a complex, multi-step process requiring factual knowl-                               M [Due calciatori attaccanti] (Two attacking footballers)
                                edge, contextual understanding, vocabulary usage, and                                N [Usati per mangiare il gelato] (Used for eating ice cream)
                                reasoning within pre-defined constraints – a set of fun-                             [Abbaia e morde] (Barks and bites) NIA
                                damental skills to address a variety of real-world tasks.
                                   In this work, we conduct the first open evaluation of                                  Solution key (# of chars/word):                11          5
                                LLMs’ rebus-solving capabilities, focusing specifically                                   Solution: Malinconica nenia (melancholic lullaby)
                                on the Italian language. We propose a novel strategy to
                                                                                                                    Figure 1: An example of a verbalized rebus crafted by combin-
                                derive text-only verbalized rebuses from transcribed inter-                         ing a rebus first pass (intermediate solution) with crossword
                                mediate rebus solutions and use it to produce a large col-                          definitions. We use verbalized rebuses to test LLMs’ sequen-
                                lection with more than 80k verbalized rebuses. We then                              tial instruction following capabilities. Image from Settimana
                                evaluate the rebus-solving skills of state-of-the-art LLMs,                         Enigmistica n. 4656, © Bresi S.r.l.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                           including open-source systems and proprietary models,
                                *
                                 Corresponding author.                                                                                   via few-shot prompting. Moreover, we fine-tune a small
                                $ g.sarti@rug.nl (G. Sarti); t.caselli@rug.nl (T. Caselli);
                                                                                                                                         but capable LLM on verbalized rebus solving, outperform-
                                m.nissim@rug.nl (M. Nissim); a.bisazza@rug.nl (A. Bisazza)
                                € https://gsarti.com (G. Sarti); https://cs.rug.nl/~bisazza                                              ing state-of-the-art systems by a wide margin. Finally, we
                                (A. Bisazza)                                                                                             conduct a fine-grained assessment of LLMs’ sequential
                                 0000-0001-8715-2987 (G. Sarti); 0000-0003-2936-0256 (T. Caselli);                                      reasoning steps, explaining model performance in terms
                                0000-0001-5289-0971 (M. Nissim); 0000-0003-1270-3048 (A. Bisazza)                                        of word complexity and memorization.
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).




                                                                                                                1




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                           1–10



   Beyond rebus solving, our evaluation sheds light on the                  fine-tuning experiments. In our evaluation, we also adopt
limits of current LLMs in multi-step reasoning settings,                    few-shot prompting [26] and chain-of-thought reason-
highlighting challenges with their application to complex                   ing [27], which were both shown to strongly improve
sequential instruction-following scenarios.1                                LLMs’ abilities when solving complex multi-step tasks.


2. Background and Related Work                                              3. Experimental Setup
Italian Enigmistica and Rebuses The Italian lan-                            Data We begin by extracting all rebuses’ first passes
guage is characterized by a rich and long-standing tra-                     and solutions available on Eureka55 , an online repository
dition of puzzle games, including rebuses, dating back                      of Italian puzzles. We refer to the resulting dataset con-
to the 19th century [5]2 In Italian rebuses, a first pass                   taining 223k unique rebuses sourced from various publi-
(prima lettura) representing an intermediate solution of                    cations as EurekaRebus. For crossword definitions, we
the puzzle is produced by combining graphemes with                          use ItaCW [20], containing 125k unique definition-word
underlying image elements in a left-to-right direction                      pairs. We select only EurekaRebus examples in which
(Figure 1). Then, the letters and words of the first pass                   all first pass words match an existing ItaCW definition
undergo a re-segmentation (cesura) according to a solu-                     to enable verbalization, maintaining 83,157 examples for
tion key (chiave di lettura3 ), which specifies the length of               our modeling experiments.6 Since several ItaCW words
words in the solution (frase risolutiva). The verbalized                    are associated with multiple definitions, we randomly
rebuses we introduce in this work are variants of textual                   sample definitions to promote diversity in the resulting
rebuses (rebus descritto or verbis), where the text-based                   verbalized rebuses. A test set of 2k examples7 is kept
puzzle is crafted by replacing first pass words with their                  aside for evaluation, and the remaining 81k examples are
crossword definitions in a templated format (Figure 1).                     used for model training.

Linguistic Puzzles as NLP Progress Metrics Lan-                             Models We fine-tune Phi-3 Mini 3.8B 4K [28], the most
guage games have recently been adopted as challeng-                         capable LLM below 4B parameters for a wide range of Ital-
ing tasks for LLM evaluation [3, 9, 10]. While works                        ian language tasks8 . We use quantized low-rank adapters
in this area have historically focused on English cross-                    (QLoRA; 29, 30) for efficient fine-tuning with Unsloth9
words [11, 12, 4, 13], recent tests focus on a more di-                     and Transformers [31], training the model for 5,000 steps
verse set of games such as the New York Times’ “Con-                        with a batch size of 16 over 81k examples. For compar-
nections” [14] and “Wordle” [15]. Automatic crossword                       ing our model performances, we select GPT-4o [32] and
solvers were also developed for French [16], German [17]                    Claude-3.5 Sonnet [33] as the current state-of-the-art
and Italian [18, 19], while didactic crossword generators                   for proprietary LLMs and the instruction-tuned variants
are available for Italian [20] and Turkish [21]. Relat-                     of Qwen-2 72B [34] and LLaMA-3 70B [35] as the best-
edly, the Italian evaluation campaign EVALITA4 recently                     performing open-source LLMs according to the Invalsi
hosted two shared tasks focusing on the word-guessing                       Italian benchmark [36]. These four systems are used as
game “La Ghigliottina” (The Guillotine) [22, 23]. To our                    untrained baselines thanks to their instruction-following
knowledge, our work is the first to attempt the computa-                    abilities and prompted for rebus solving in a few-shot
tional modeling and evaluation of rebus-solving systems.                    setting.
Importantly, language games such as rebuses are not eas-
ily translatable into other languages due to their struc-                   Format Table 1 presents an example in the templated
tural and cultural elements. This makes them a scarce                       format used for fine-tuning Phi-3.10 The model is
but valuable resource for language-specific evaluations                     prompted to reason step-by-step by 1) solving crossword
of language processing systems.                                             definitions sequentially (definition resolution); 2) pro-
                                                                            ducing a first pass copying letters and definitions’ words;
LLMs as Sequential Reasoners State-of-the-art
LLMs were shown to struggle to follow sequential instruc-                   5
                                                                              http://www.eureka5.it, additional details in Appendix A. Rebus
tions presented in a single query [24], but their perfor-                     illustrations are not available in Eureka5.
                                                                            6
mances improved significantly with ad-hoc training [25].                      Since verbalized rebus are produced from textual contents only,
                                                                              crossword definitions may refer to different word meanings (e.g.
This acts as an initial motivation for our rebus-solving                      [Two soccer players] is used to represent the word “wings” in Fig-
                                                                              ure 1 despite not matching the word sense “bird wings” of the
1
  Code, data and models are available on Github and Huggingface               original image. This does not affect the validity of our task.
2                                                                           7
  Refer to Miola [6], Bartezzaghi [7], Ichino [8] for a comprehensive         Composed by Test id and Test ood, described in Section 5
                                                                            8
  overview of peculiarities and norms in modern Italian rebuses.              https://hf.co/spaces/FinancialSupport/open_ita_llm_leaderboard
3                                                                           9
  Referred to as diagramma in jargon.                                         https://github.com/unslothai/unsloth
4                                                                           10
  https://www.evalita.it                                                       An English example is available in Table 9




                                                                        2
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                        1–10



 Prompt                                                                          • First Pass Words/Letter Accuracy: Proportion
 Risolvi gli indizi tra parentesi per ottenere una prima lettura,                  of correct words and letters in the generated first
 e usa la chiave di lettura per ottenere la soluzione del rebus.                   pass. Lower scores may indicate issues with as-
                                                                                   sembling a first pass from previous information.
 Rebus: U [Lo è il passacavallo] LO [È fatta di vimini] F F
                                                                                 • First Pass Exact Match (EM): Proportion of gen-
 [Decimi di chilo] S [Disusato soprabito] A [Un rampicante
 dei Tropici]
                                                                                   erated first passes matching the gold reference.
                                                                                 • Solution Key Match: Proportion of generated
 Chiave di lettura: 3 6 12 8                                                       solution words matching the lengths specified by
                                                                                   the solution key. Lower scores may indicate diffi-
  Generation Procediamo alla risoluzione del rebus passo
                per passo:
                                                                                   culty in respecting the given length constraints.
                                                                                 • Solution First Pass Match: Proportion of first
                 -U=U
             ⎧
             ⎪                                                                     pass characters employed to construct solution
                - [Lo è il passacavallo] = nave
             ⎪
             ⎪
             ⎪
             ⎪                                                                     words. Lower scores indicate issues with using
                -LO=LO
             ⎪
                                                                                   generated first pass characters in the solution.11
             ⎪
             ⎪
             ⎪
                - [È fatta di vimini] = cesta
             ⎪
             ⎪
             ⎪
  Definition
             ⎪
             ⎨
                -FF=FF                                                           • Solution Words Accuracy: Proportion of cor-
  resolution ⎪  - [Decimi di chilo] = etti                                         rect words in the generated solution.
             ⎪
                -S=S                                                             • Solution Exact Match (EM): Proportion of gen-
             ⎪
             ⎪
             ⎪
             ⎪
                - [Disusato soprabito] = tait
             ⎪
             ⎪
             ⎪
             ⎪                                                                     erated solutions matching the gold reference.
             ⎩- A = A
             ⎪
             ⎪
             ⎪
                - [Un rampicante dei Tropici] = liana
             {︂                                                         4. Results
                Prima lettura: U nave LO cesta F F etti S tait
  First pass
                A liana                                                 Table 2 presents our evaluation results. We observe that
                  Ora componiamo la soluzione seguendo la
                                                                        all prompted models perform poorly on the task, with the
                  chiave risolutiva:                                    overall best prompted system (Claude 3.5 Sonnet) obtain-
               ⎧                                                        ing the correct solution only for 24% of the 2k tested
               ⎪   3 = Una                                              examples. Notably, open-source systems perform signifi-
  Solution     ⎪
                  6 = veloce
               ⎨
  segmen-                                                               cantly worse than proprietary ones, producing correct
   tation      ⎪12 = staffettista
               ⎪                                                        first passes only for 4% of the examples, and next to no
                  8 = italiana
               ⎩
                                                                        correct solutions. Our fine-tuned system largely outper-
               {︀                                                       forms all state-of-the-art prompted models, predicting
  Solution        Soluzione: Una veloce staffettista italiana
                                                                        the correct solution in 51% of cases. From first pass met-
Table 1                                                                 rics, it is evident these results can be largely explained by
Template used in fine-tuning and prompting experiments with             the poor word-guessing capabilities of the models, which
highlighted reasoning stages. Example rebus by Il Piacentino,           are greatly improved with fine-tuning. For prompted
Settimana Enigmistica n. 2942                                           models, the slight decrease in scores between Def. and
                                                                        FP Words also highlights issues with copying predicted
3) re-segmenting it into solution words based on the solu-              words in the expected format. Finally, we observe that
tion key (solution segmentation); and finally 4) produc-                fine-tuning strongly improves the constraint-following
ing the solution by copying re-segmented words. We                      abilities of our system, with prompted systems being less
automatically convert rebuses in this format by deriving                strict with applying length and letter-choice constraints
the solution key from solution word lengths and dynami-                 for their solutions (Key/FP Match).
cally infilling the available information into the template.
We use a similar format for prompting experiments, with
five in-context step-by-step demonstrations and an ex-                  5. What Motivates Model
plicit instruction asking the model to stick to the previous               Performances?
examples’ format to streamline solution parsing.
                                                          In light of the strong performances achieved by our rela-
Metrics For our granular evaluation of rebus-solving tively small fine-tuned system, this section conducts an
performance, we adopt the following set of metrics focus- in-depth investigation to identify factors motivating such
ing on the first passes (FP) and solutions (S) generated performance improvements.
by LLMs:
                                                                        11
                                                                             In practice, we define this as 1 − CER(FP, S), where CER is the
     • Definition (Def.): Proportion of correctly                            character error rate [37] between the two sequences (lowercased,
       guessed words during definition resolution.                           whitespace removed) computed with Jiwer




                                                                    3
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                     1–10



                                                               First Pass (FP)                             Solution (S)
      Model                  Setup              Def.
                                                          Words        Letters        EM     Key Match     FP Match       Words   EM
      LLaMA-3 70B            5-shot prompt       0.22      0.20           0.60       0.04       0.16          0.51        0.03    0.00
      Qwen-2 72B             5-shot prompt       0.28      0.25           0.76       0.04       0.20          0.52        0.04    0.00
      GPT-4o                 5-shot prompt       0.55      0.51           0.83       0.15       0.53          0.74        0.27    0.11
      Claude-3.5 Sonnet      5-shot prompt       0.66      0.62           0.90       0.28       0.83          0.82        0.43    0.24
      Phi-3 3.8B (ours)      fine-tuned         0.84       0.84           1.00       0.56       0.86         0.94         0.68    0.51

Table 2
Fine-grained verbalized rebus solving performances of various LLMs. Bold denotes best overall performances, and underline
marks best training-free results.

                           GPT-4o                 Phi-3 (ours)                we evaluate our fine-tuned model in out-of-distribution
     Metric
                    Test    Test     Test     Test      Test      Test        settings. For this evaluation, the 2k examples of the test
                     id     ood       Δ        id       ood        Δ          set from previous sections are divided into two subsets:
     FP W. ID       0.52    0.51     -0.01    0.96      0.96      0.00        one in which all first pass words were seen during fine-
     FP W. OOD        -     0.44       -        -       0.20        -         tuning by Phi-3 (Test id, 1061 examples) and one in
     FP EM          0.16    0.14     -0.02    0.89      0.18      -0.71       which, for every example, at least one first pass word
     S W. ID        0.29    0.26     -0.03    0.92      0.49      -0.43
                                                                              was unseen in training (Test ood, 939 examples). In-
     S W. OOD       0.18    0.16     -0.02    0.63      0.20      -0.40       tuitively, if Phi-3 performance is mainly motivated by
     S EM           0.12    0.09     -0.03    0.82      0.16      -0.66       memorizing fine-tuning data, introducing OOD words
                                                                              should produce a significant drop in model performances.
Table 3                                                                       Results shown in Table 3 confirm that this is indeed the
Model performances for test subsets containing only in-                       case. We find Phi-3 performances to be near-perfect on
domain (Test ID), or some out-of-domain (Test OOD) first                      seen first pass words (FP W. ID = 0.96) in both test sets,
pass words. W. ID and W. OOD are accuracies for ID and OOD
                                                                              with a major drop for OOD words (FP W. OOD = 0.20).
words for first pass (FP) and solution (S) sequences. Test Δ =
                                                                              This produces second-order effects on subsequent steps,
Test ID - Test OOD performance.
                                                                              causing the FP EM results to drop by 71% (FP EM Test
                                                                              ∆), while significantly impacting downstream solution
Word Complexity and Frequency Affects LLM Fine-                               accuracies. On the contrary, GPT-4o few-shot prompting
tuning Performance For every word in the first                                performances remain nearly identical on both splits, con-
passes and solutions of test set examples, we measure                         firming that these results are not the product of a skewed
LLMs’ overall accuracy in predicting it for the full test                     data selection process. Overall, these results strongly
set. We then correlate this score to various quantities                       suggest that memorization is the main factor behind the
that could motivate LLMs’ performances. More specifi-                         strong rebus-solving performance of our fine-tuned LLM.
cally, we use 1) the word frequency in the training set;
2) the word frequency in Paisà [38], a large web Ital-
                                                                              Manual Inspection We conclude by manually evalu-
ian corpus; and 3) the length of the word (number of
                                                                              ating some generations produced by the best-performing
characters). We find a significant positive correlation
                                                                              LLMs. Table 4 presents two examples with definitions
(𝜌 = 0.44) between first pass word prediction accuracy
                                                                              (D) and solution (S) words predicted by three LLMs, with
and training frequency for the fine-tuned Phi-3 model,
                                                                              more examples provided in Appendix C. We use naw as
suggesting that model performance is strongly related
                                                                              short-hand for “Not A Word” to mark nonsensical terms.
to training coverage. The length of characters is also
                                                                                 In the first example, Phi-3 correctly predicts all first
found to negatively affect our model’s performance, al-
                                                                              pass and solution words. On the contrary, other mod-
beit to a smaller extent (𝜌 = −0.11). The performance of
                                                                              els make several mistakes in the first pass, leading to
prompted models is unrelated to both properties for first
                                                                              incorrect solutions. Both prompted models tend to ig-
pass words, indicating that these results are the product
                                                                              nore first pass words when these cannot be assembled
of fine-tuning.12
                                                                              to form sensical, length-fitting solution words. For ex-
                                                                              ample, for D1 GPT-4o predicts p (naw), which would
LLM Fine-Tuning Fails to Generalize to Unseen                                 lead to the solution word “SAPpTE” (naw), but the S8 =
Words To further confirm the importance of fine-                              “Spettacolo” (show) is predicted instead by the model). In
tuning word coverage in defining model performances,                          particular, GPT-4o appears to prioritize grammatically
                                                                              correct solutions at the cost of ignoring first pass words
12
     Paisà frequency is never found to correlate significantly. Full          and solution key length constraints, while Claude 3.5S
     correlation results are available in Table 6.




                                                                          4
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                       1–10



 Rebus: SAP [La porta della breccia] D1 TE [La pinza del              6. Discussion and Conclusion
 granchio] D2 SBA [Si legge su alcuni orologi] D3 G [Le
 sue coccole sono aromatiche] D4 V [Un gioco con dadi e               This work introduced a verbalized rebus-solving task
 pedine] D5 D [Sono verdi in gioventù] D6                             and dataset for evaluating LLMs’ sequential instruction
 Chiave di lettura: 8 3 2 12 7 5                                      following skills for the Italian language. We crafted a
                                                                      large collection of 83k verbalized rebuses by combining
  Step      GPT-4o        Claude 3.5S            Phi-3
                                                                      rebus transcriptions with crossword definitions and used
  D1            p             one                  pia
  D2         chela           chela               chela                it to evaluate the rebus-solving skills of state-of-the-art
  D3           ora           data                 data                LLMs. Our experiments revealed the challenging nature
  D4        ginepro           lio               ginepro               of this task, with even the most capable prompted models
  D5          ludo            oca                 oca                 achieving only 24% accuracy on solutions.
  D6         acerbi          anni                 anni                   While fine-tuning a smaller LLM dramatically im-
  S8      Spettacolo        Saponate           Sappiate               proved performance to 51% solution accuracy, our anal-
  S3         che              che                che                  ysis uncovered that these gains were largely driven
  S2          fa               la                 la                  by memorization and do not generalize to out-of-
  S12      sognare        sbadataggine       sbadataggine             distribution examples. These results suggest important
  S7         ogni           vocando            provoca                limitations in the generalization capabilities of current
  S5         sera            danni              danni                 systems for sequential instruction following tasks. Our
  Soluzione: SAPpiaTE che la SBAdataGgine proVoca Danni               manual analysis further shows that LLMs seldom account
                                                                      for length constraints when solving definitions, despite
 Rebus: STU [Si salva otturandolo] D1 S [Ha foglie                    the fundamental role of these cues in restricting the pool
 seghettate] D2 AL [Lo è l’operaio che lavora in cantiere] D3         of possible words. These results suggest that search-
 G [Un uomo... non all’ altezza] D4                                   based approaches accounting for constraints more ex-
 Chiave di lettura: 11 7 2 7
                                                                      plicitly might improve puzzle structure adherence, as
  Step        GPT-4o         Claude 3.5S          Phi-3               previously shown by Chen et al. [39]. Other augmenta-
  D1           tappo             falla            dente               tion techniques employing LLM reformulation skills can
  D2           acero            ortica             aro                also be explored to mitigate overfitting.
  D3            edile           edile             edile                  Future work in this area should focus on expanding
  D4           nano             nano              nano                similar evaluations to a wider set of languages, input
  S11      Stupaccerone      Stufallassor     Studentesaro            modalities, and puzzle categories, creating a comprehen-
  S7          salendo           ticale           aledile              sive benchmark to test LLMs’ puzzle-solving skills. Im-
  S2             al               di                gi                portantly, the task of solving visual rebuses and their
  S7          genano          Legnano            nanano               more convoluted variants13 remains far beyond the cur-
         Soluzione: STUdenteSsa liceALe di LeGnano
                                                                      rent capabilities of vision-language models. Hence, solv-
                                                                      ing these puzzles automatically can be considered an
Table 4                                                               important milestone in developing multimodal AI sys-
Examples of LLM generations for rebuses by Slam, Nuova                tems for constrained multi-step reasoning tasks. Our
Enigmistica Tascabile n. 2802 (top) and Grizzly, Domenica Quiz        results confirm that the challenging nature of rebuses,
n. 2 (bottom). Correct guesses and errors and denoted for
                                                                      even in their verbalized form, makes this task valuable
predicted first pass definitions (D1,...,𝑁 ) and solution words
(S𝑖 , with 𝑖 being the 𝑖-th solution key value).
                                                                      for assessing future progress in LLMs’ linguistic profi-
                                                                      ciency and sequential reasoning abilities. Finally, our
                                                                      rebus-solving LLM can facilitate future interpretability
shows an improved ability to follow these constraints, as             work investigating the mechanisms behind factual recall
confirmed by Key/FP Match results of Table 2.                         and multi-step reasoning in transformer models [40].
   In the second example, the first pass word D2 = salice
(willow) is OOD for Phi-3. Consequently, the model pro-
                                                                      Limitations Our analysis was limited to a relatively
duces the incorrect prediction aro (naw), and the error is
                                                                      small set of models, and a single prompt template ob-
propagated to all solution words, as previously observed
                                                                      tained after minimal tuning. Further experiments are
in the Test OOD column of Table 3. Prompted models
                                                                      needed to verify that memorization patterns after fine-
also underperform in this example, with errors on D1 and
                                                                      tuning remain relevant for other model sizes, prompt for-
D2 propagating to most solution words. However, we
                                                                      mats, and training regimes, particularly for full-weight
note that D1 and D2 incorrect predictions for Claude 3.5S
                                                                      training approaches.
satisfy the provided definitions, suggesting that access
to more explicit information about the given constraints
                                                                      13
could further boost LLMs’ performance on this task.                        For example, rebuses requiring first pass anagrams (anarebus) or
                                                                           dynamic relations derived from multi-scene analysis (stereorebus)




                                                                  5
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                     1–10



Acknowledgments                                                      rebus per solutori (ancora) poco abili, Bompiani,
                                                                     Milan, 2021.
Gabriele Sarti and Arianna Bisazza acknowledge the               [9] R. Manna, M. P. di Buono, J. Monti, Riddle me
support of the Dutch Research Council (NWO) for the                  this: Evaluating large language models in solving
project InDeep (NWA.1292.19.399). Arianna Bisazza                    word-based games, in: C. Madge, J. Chamberlain,
is further supported by the NWO Talent Programme                     K. Fort, U. Kruschwitz, S. Lukin (Eds.), Proceedings
(VI.Vidi.221C.009). We are grateful to the Associazione              of the 10th Workshop on Games and Natural Lan-
Culturale “Biblioteca Enigmistica Italiana - G. Panini”              guage Processing @ LREC-COLING 2024, ELRA
for making its rebus collection freely accessible on the             and ICCL, Torino, Italia, 2024, pp. 97–106. URL:
Eureka5 platform, and to Valeriya Zelenkova for her valu-            https://aclanthology.org/2024.games-1.11.
able comments on the first version of this work. We also        [10] P. Giadikiaroglou, M. Lymperaiou, G. Filandrianos,
thank the CLiC-it 2024 reviewers for their valuable feed-            G. Stamou, Puzzle solving using reasoning of large
back.                                                                language models: A survey, ArXiv (2024). URL:
                                                                     https://arxiv.org/abs/2402.11291.
                                                                [11] M. L. Littman, G. A. Keim, N. Shazeer,             A
References                                                           probabilistic approach to solving crossword puz-
 [1] D. Silver, A. Huang, C. J. Maddison, A. Guez,                   zles,      Artificial Intelligence 134 (2002) 23–
     L. Sifre, G. van den Driessche, J. Schrittwieser,               55. URL: https://www.sciencedirect.com/science/
     I. Antonoglou, V. Panneershelvam, M. Lanctot,                   article/pii/S000437020100114X. doi:https://doi.
     S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,                org/10.1016/S0004-3702(01)00114-X.
     I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,      [12] M. Ernandes, G. Angelini, M. Gori,               We-
     T. Graepel, D. Hassabis, Mastering the game of Go               bcrow: A web-based system for crossword solv-
     with deep neural networks and tree search, Nature               ing,     in: AAAI Conference on Artificial In-
     529 (2016) 484–489. doi:10.1038/nature16961.                    telligence, 2005. URL: https://link.springer.com/
 [2] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou,          chapter/10.1007/11590323_37.
     M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran,         [13] A. Boda, Sadallah, D. Kotova, E. Kochmar, S. Yao,
     T. Graepel, T. Lillicrap, K. Simonyan, D. Hassabis,             D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao,
     A general reinforcement learning algorithm that                 K. N. 2023, S. Yousefi, L. Betthauser, H. Hasan-
     masters chess, shogi, and go through self-play, Sci-            beig, R. Milliere, I. Momennejad, De-coding, A. Zu-
     ence 362 (2018) 1140–1144. doi:10.1126/science.                 garini, T. Röthenbacher, K. Klede, M. Ernandes,
     aar6404.                                                        B. M. Eskofier, D. Z. 2023, Are llms good cryp-
 [3] J. Rozner, C. Potts, K. Mahowald, Decrypting                    tic crossword solvers?, ArXiv (2024). URL: https:
     cryptic crosswords: Semantically complex word-                  //arxiv.org/abs/2403.12094.
     play puzzles as a target for nlp, in: M. Ranzato,          [14] G. Todd, T. Merino, S. Earle, J. Togelius, Missed
     A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan             connections: Lateral thinking puzzles for large lan-
     (Eds.), Advances in Neural Information Processing               guage models, Arxiv (2024). URL: https://arxiv.org/
     Systems, volume 34, Curran Associates, Inc., 2021,              abs/2404.11730.
     pp. 11409–11421. URL: https://proceedings.                 [15] B. J. Anderson, J. G. Meyer, Finding the optimal
     neurips.cc/paper_files/paper/2021/file/                         human strategy for wordle using maximum cor-
     5f1d3986fae10ed2994d14ecd89892d7-Paper.pdf.                     rect letter probabilities and reinforcement learning,
 [4] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak,               Arxiv (2022). URL: https://arxiv.org/abs/2202.00557.
     M. Ginsberg, D. Klein, Automated crossword solv-           [16] G. Angelini, M. Ernandes, T. laquinta, C. Stehl’e,
     ing, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.),         F. Simoes, K. Zeinalipour, A. Zugarini, M. Gori,
     Proceedings of the 60th Annual Meeting of the As-               The webcrow french crossword solver, in: In-
     sociation for Computational Linguistics (Volume 1:              telligent Technologies for Interactive Entertain-
     Long Papers), Association for Computational Lin-                ment, 2023. URL: https://link.springer.com/chapter/
     guistics, Dublin, Ireland, 2022, pp. 3073–3085. URL:            10.1007/978-3-031-55722-4_14.
     https://aclanthology.org/2022.acl-long.219. doi:10.        [17] A. Zugarini, T. Rothenbacher, K. Klede, M. Ernan-
     18653/v1/2022.acl-long.219.                                     des, B. M. Eskofier, D. Zanca, Die rätselrevolu-
 [5] D. Tolosani, Enimmistica, Hoepli, Milan, 1901.                  tion: Automated german crossword solving, in:
 [6] E. Miola, Che cos’è un rebus, Carocci, 2020.                    Proceedings of the 9th Italian Conference on Com-
 [7] S. Bartezzaghi, Parole in gioco: Per una semiotica              putational Linguistics (CLiC-it 2023), 2023. URL:
     del gioco linguistico, Bompiani, 2017.                          https://ceur-ws.org/Vol-3596.
 [8] P. Ichino, L’ora desiata vola: guida al mondo del          [18] G. Angelini, M. Ernandes, M. Gori, Solving ital-
                                                                     ian crosswords using the web, in: International



                                                            6
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                 1–10



     Conference of the Italian Association for Artificial          Inc., 2020, pp. 1877–1901. URL: https://proceedings.
     Intelligence, 2005. URL: https://link.springer.com/           neurips.cc/paper_files/paper/2020/file/
     chapter/10.1007/11558590_40.                                  1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
[19] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Maggini,   [27] J. Wei, X. Wang, D. Schuurmans, M. Bosma,
     M. Gori, L. Rigutini, Clue-instruct: Text-based clue          b. ichter, F. Xia, E. Chi, Q. V. Le, D. Zhou, Chain-
     generation for educational crossword puzzles, in:             of-thought prompting elicits reasoning in large
     N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti,        language models, in: S. Koyejo, S. Mohamed,
     N. Xue (Eds.), Proceedings of the 2024 Joint In-              A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.),
     ternational Conference on Computational Linguis-              Advances in Neural Information Processing
     tics, Language Resources and Evaluation (LREC-                Systems, volume 35, Curran Associates, Inc., 2022,
     COLING 2024), ELRA and ICCL, Torino, Italia, 2024,            pp. 24824–24837. URL: https://proceedings.
     pp. 3347–3356. URL: https://aclanthology.org/2024.            neurips.cc/paper_files/paper/2022/file/
     lrec-main.297.                                                9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.
[20] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,         pdf.
     L. Rigutini, M. Maggini, M. Gori, Italian crossword      [28] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja,
     generator: Enhancing education through interac-               A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
     tive word puzzles, in: Proceedings of the 9th Italian         A. Bakhtiari, J. Bao, H. Behl, A. Benhaim, M. Bilenko,
     Conference on Computational Linguistics (CLiC-it              J. Bjorck, S. Bubeck, Q. C. et al., Phi-3 techni-
     2023), 2023. URL: https://ceur-ws.org/Vol-3596.               cal report: A highly capable language model lo-
[21] K. Zeinalipour, Y. G. Keptig, M. Maggini, L. Rigutini,        cally on your phone, Arxiv (2024). URL: https:
     M. Gori, A turkish educational crossword puzzle               //arxiv.org/abs/2404.14219.
     generator, ArXiv abs/2405.07035 (2024). URL: https:      [29] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li,
     //arxiv.org/abs/2405.07035v2.                                 S. Wang, L. Wang, W. Chen, LoRA: Low-rank adap-
[22] P. Basile, M. Lovetere, J. Monti, A. Pascucci, F. San-        tation of large language models, in: The Tenth
     gati, L. Siciliani, Ghigliottin-ai@evalita2020: Eval-         International Conference on Learning Representa-
     uating artificial players for the language game               tions (ICLR 2022), OpenReview, Online, 2022. URL:
     "la ghigliottina" (short paper), EVALITA Evalua-              https://openreview.net/forum?id=nZeVKeeFYf9.
     tion of NLP and Speech Tools for Italian - Decem-        [30] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettle-
     ber 17th, 2020 (2020). URL: https://doi.org/10.4000/          moyer, Qlora: Efficient finetuning of quantized
     books.aaccademia.7488.                                        llms, in: A. Oh, T. Naumann, A. Globerson,
[23] P. Basile, M. de Gemmis, P. Lops, G. Semeraro, Solv-          K. Saenko, M. Hardt, S. Levine (Eds.), Advances
     ing a complex language game by using knowledge-               in Neural Information Processing Systems,
     based word associations discovery, IEEE Trans-                volume 36, Curran Associates, Inc., 2023,
     actions on Computational Intelligence and AI in               pp. 10088–10115. URL: https://proceedings.
     Games 8 (2016) 13–26. doi:10.1109/TCIAIG.2014.                neurips.cc/paper_files/paper/2023/file/
     2355859.                                                      1feb87871436031bdc0f2beaa62a049b-Paper-Conference.
[24] X. Chen, B. Liao, J. Qi, P. Eustratiadis, C. Monz,            pdf.
     A. Bisazza, M. de Rijke, The sifo benchmark: Inves-      [31] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     tigating the sequential instruction following abil-           langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     ity of large language models, 2024. URL: https:               towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
     //arxiv.org/abs/2406.19999. arXiv:2406.19999.                 Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
[25] H. Hu, S. Yu, P. Chen, E. M. Ponti, Fine-tuning               M. Drame, Q. Lhoest, A. Rush, Transformers:
     large language models with sequential instructions,           State-of-the-art natural language processing, in:
     Arxiv (2024). URL: https://arxiv.org/abs/2403.07794.          Q. Liu, D. Schlangen (Eds.), Proceedings of the
[26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.                2020 Conference on Empirical Methods in Natu-
     Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,                ral Language Processing: System Demonstrations,
     G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,            Association for Computational Linguistics, On-
     G. Krueger, T. Henighan, R. Child, A. Ramesh,                 line, 2020, pp. 38–45. URL: https://aclanthology.
     D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,              org/2020.emnlp-demos.6. doi:10.18653/v1/2020.
     E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,            emnlp-demos.6.
     C. Berner, S. McCandlish, A. Radford, I. Sutskever,      [32] OpenAI, Hello gpt-4o, Website, 2024. URL: https:
     D. Amodei, Language models are few-shot learners,             //openai.com/index/hello-gpt-4o.
     in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan,    [33] Anthropic, Claude 3.5 sonnet, Website, 2024.
     H. Lin (Eds.), Advances in Neural Information                 URL:              https://www.anthropic.com/news/
     Processing Systems, volume 33, Curran Associates,             claude-3-5-sonnet.



                                                          7
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                  1–10



[34] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou,             Statistic              EurekaRebus           ItaCW-filtered
     C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin,        # examples                   222089              83157
     J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma,              # authors                     8138                5046
     J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu,          Year range                 1800 - 2024         1869 - 2024
     K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang,                                          First pass
     P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang,
     S. Bai, S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng,          # unique words                38977               8960
     X. Zhou, X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan,              Avg./SD words/ex.           3.50/1/48          3.08/1.00
                                                                     Avg./SD word len.           6.51/1.96           5.70/1.60
     Y. Yao, Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui,
                                                                     Avg./SD FP len.            26.45/11.19         25.74/8.73
     Z. Zhang, Z. Fan, Qwen2 technical report, 2024.
     URL: https://arxiv.org/abs/2407.10671.                                                     Solution
[35] M. AI, Introducing meta llama 3: The most capable               # unique words                 75718              42558
     openly available llm to date, Website, 2024. URL:               Avg./SD words/ex.            3.02/1.60          2.80/1.21
     https://ai.meta.com/blog/meta-llama-3.                          Avg./SD word len.            8.07/2.30          7.79/2.23
[36] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino,            Avg./SD Sol. len.           19.47/8.44         18.81/6.06
     A. Seveso, Disce aut deficere: Evaluating llms profi-
                                                                    Table 5
     ciency on the invalsi italian benchmark, 2024. URL:
                                                                    Statistics for the full EurekaRebus dataset and the crosswords-
     https://arxiv.org/abs/2406.17535.                              filtered subset used in this work. Avg./SD = Average/standard
[37] A. Morris, V. Maier, P. Green, From wer and ril                deviation.
     to mer and wil: improved evaluation measures for
     connected speech recognition., 2004.
                                                                       Model           # Char.      Paisà Freq.     Train Freq.
[38] V. Lyding, E. Stemle, C. Borghetti, M. Brunello,
     S. Castagnoli, F. Dell’Orletta, H. Dittmann, A. Lenci,            GPT-4o           -0.01            0.01           0.02
     V. Pirrelli, The PAISÀ corpus of Italian web texts,               Claude-3.5       -0.02           -0.02           0.00
     in: F. Bildhauer, R. Schäfer (Eds.), Proceedings of               Phi-3 (ours)     -0.11           -0.05           0.44
     the 9th Web as Corpus Workshop (WaC-9), Associ-                   GPT-4o           -0.18           0.14            0.19
     ation for Computational Linguistics, Gothenburg,                  Claude-3.5       -0.15           0.08            0.13
     Sweden, 2014, pp. 36–43. URL: https://aclanthology.               Phi-3 (ours)     -0.02           0.08            0.22
     org/W14-0406. doi:10.3115/v1/W14-0406.
[39] L. Chen, J. Liu, S. Jiang, C. Wang, J. Liang,                  Table 6
     Y. Xiao, S. Zhang, R. Song, Crossword puzzle                   Spearman’s correlation with average word accuracies for
     resolution via monte carlo tree search, Proceed-               metrics computed on first pass (top) and solution (bottom)
                                                                    words. Bold scores are significant with Bonferroni-corrected
     ings of the International Conference on Auto-
                                                                    𝑝 < 1𝑒 − 5 [41]
     mated Planning and Scheduling 32 (2022) 35–43.
     URL: https://ojs.aaai.org/index.php/ICAPS/article/
     view/19783. doi:10.1609/icaps.v32i1.19783.                     the pool of available definitions for every word.
[40] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-jussà,
     A primer on the inner workings of transformer-                 First pass/Solution word distribution Figure 2
     based language models, Arxiv (2024). URL: https:               shows the distribution of first pass and solution words
     //arxiv.org/abs/2405.00208.                                    for the filtered EurekaRebus subset used in our work.
[41] C. Bonferroni, Teoria statistica delle classi e calcolo
     delle probabilita, Pubblicazioni del R. Istituto Su-
     periore di Scienze Economiche e Commericiali di                B. Additional Experimental
     Firenze 8 (1936) 3–62.                                            Results
                                                               Table 6 presents the correlations between model accu-
                                                               racy and the properties presented in Section 5. Table 7
A. Additional Data Information                                 presents the full ID/OOD performances for all tested
Dataset statistics Table 5 presents statistics for the Eu- models, showing consistent results with Table 3 for all
rekaRebus dataset and the filtered subset we use for com- prompted models. Table 8 presents Phi-3 Mini perfor-
posing verbalized rebuses. The ItaCW dataset contains a mances across rebus-solving fine-tuning steps.
total of 125,202 definitions for 40,963 unique words, with
the most frequent words having hundreds of different
definitions, e.g. 173 for re (king), 155 for te (you). Defini-
tions used for verbalization are randomly sampled from



                                                                8
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                                               1–10



Word Frequency
                              re (6091)
                        ali (3068), in (2793)
                       est (2365)
                          ante (1916)
                          tori (1748)
                                         accetta (139)
Word Frequency




                             di (8449)
                       d’ (2910)
                      Una (2111)
                      a (1821)
                      amore (684)
                      pesante (172)
                       importante (81)

                                                                                       Word

Figure 2: Word frequencies for words in first passes (top) and solutions (bottom) for the selected subset of EurekaRebus used
for training and evaluation. Words are colored according to their length, and the most frequent examples per frequency bin
are highlighted.


                          LLaMA-3                           Qwen-2                        GPT-4o                       Claude-3.5S                 Phi-3 (ours)
      Metric
                   Test      Test         Test      Test        Test      Test     Test       Test     Test     Test      Test      Test    Test      Test       Test
                    id       ood           Δ         id         ood        Δ        id        ood       Δ        id       ood        Δ       id       ood         Δ
      FP W. ID     0.20      0.19         -0.01     0.26        0.25      -0.01    0.52       0.51     -0.01    0.65      0.63     -0.02    0.96       0.96       0.00
      FP W. OOD      -       0.18            -        -         0.24         -       -        0.44       -        -       0.54       -        -        0.20         -
      FP EM        0.03      0.04          0.01     0.03        0.05       0.02    0.16       0.14     -0.02    0.30      0.25     -0.05    0.89       0.18      -0.71
      S W. ID      0.03      0.04          0.01     0.04        0.05       0.01    0.29       0.26     -0.03    0.48      0.40     -0.08    0.92       0.49      -0.43
      S W. OOD     0.01      0.00         -0.01     0.02        0.00      -0.02    0.18       0.16     -0.02    0.41      0.30     -0.11    0.63       0.20      -0.40
      S EM         0.00      0.00          0.00     0.00        0.00       0.00    0.12       0.09     -0.03    0.27      0.22     -0.05    0.82       0.16      -0.66

Table 7
Full model performances for test subsets containing only in-domain (Test ID), or some out-of-domain (Test OOD) first pass
words. W. ID and W. OOD are accuracies for ID and OOD words for first pass (FP) and solution (S) sequences. Test Δ = Test ID -
Test OOD performance.


C. Additional Model Generations                         Tables 10 and 11 provide additional example of LLM gen-
                                                        erations for tested rebuses, with the example from Ta-
Table 9 presents an English translation of Figure 1 ex- ble 11 (bottom) being OOD due to the manovella (crank)
ample using the prompt format adopted in this study. word in D2, and the others being ID for the fine-tuned



                                                            First Pass (FP)                                    Solution (S)
                  # Train Steps             Def.
                                                         Words         Letters    EM      Key Match            FP Match          Words     EM
                  500                       0.64         0.63           0.97      0.25          0.66             0.86            0.36      0.16
                  1000                      0.74         0.74           1.00      0.38          0.72             0.89            0.48      0.28
                  1500                      0.78         0.77           0.99      0.42          0.78             0.91            0.55      0.34
                  2000                      0.80         0.79           1.00      0.47          0.81             0.93            0.59      0.40
                  2500                      0.81         0.81           1.00      0.49          0.81             0.92            0.62      0.42
                  3000                      0.82         0.82           1.00      0.51          0.82             0.92            0.63      0.44
                  3500                      0.82         0.82           1.00      0.53          0.83             0.93            0.65      0.47
                  4000                      0.83         0.83           1.00      0.53          0.85             0.93            0.66      0.47
                  4500                      0.83         0.83           1.00      0.55          0.85             0.93            0.67      0.49
                  5072 (final)              0.84         0.84           1.00      0.56          0.86             0.94            0.68      0.51

Table 8
Fine-grained verbalized rebus solving performances of the Phi-3 Mini model on the test set including both ID and OOD
examples at various fine-tuned stages. Bold denotes best overall performances.



                                                                                   9
Gabriele Sarti et al. CEUR Workshop Proceedings                                                                                      1–10



 Prompt
 Solve the clues provided between parentheses to obtain a first          Rebus: PRI [Ricoperto di sudore] D1 MIN [Gli altari del
 pass, and use the solution key to obtain the rebus’ solution.           tempio] D2 DO [Un ordigno bellico] D3 [Possono essere “di
                                                                         serie” in certi tornei] D4 SSO
 Rebus: M [Two attacking footballers] N [Used for eating ice-            Chiave di lettura: 5 2 8 6 2 6
 cream] [Barks and bites] NIA
                                                                               Step      GPT-4o      Claude 3.5S       Phi-3
 Solution key: 11 5                                                            D1        sudato        madido          madido
                                                                               D2          are           are             are
  Generation Let’s proceed to solve the rebus step by step:                    D3        bomba          mina            mina
                                                                               D4         teste         teste           teste
                 -M=M
             ⎧
             ⎪
                - [Two attacking footballers] = ali (wings)
             ⎪
             ⎪
             ⎪
             ⎪                                                                 S5        Prima           Prima         Prima
                -N=N
             ⎨
  Definition                                                                   S2           di             di            di
  resolution ⎪
             ⎪  - [Used for eating icecream] = coni (cones)                    S8        sudare         minaccia      dominare
             ⎩- [Barks and bites] = cane (dog)
             ⎪
             ⎪
             ⎪                                                                 S6        molto           teste        dominate
                -NIA=NIA                                                       S2           di             di            se
             {︀                                                                S6         testa          dosso         stesso
  First pass    First pass: M ali N coni cane NIA
                                                                               Soluzione: PRIma di doMINare DOmina te steSSO
                     Let’s now compose the solution by following
                                                                         Rebus: AT [Si alzano nel camping] D1 [Emoziona pescatori e
                     the solution key:
                                                                         navigatori] D2 [Come una nota Foresta] D3 MEN [Quadro ad
                                                                         olio] D4 S [Atteggiamento da modella] D5
                {︂
  Solution           11 = Malinconica (melancholic)
  segmen-                                                                Chiave di lettura: 9 11 2 5
                     5 = nenia (lullaby)
   tation
                {︂
                      Solution: Malinconica nenia (melancholic            Step          GPT-4o          Claude 3.5S        Phi-3
  Solution
                     lullaby)                                             D1             tende             tende           tende
                                                                          D2            marea              mare             rete
Table 9                                                                   D3              nera              nera            nera
English translation of the template used in fine-tuning and               D4            dipinto             tela            tela
prompting experiments with highlighted reasoning stages,                  D5              posa              posa            posa
with translated first pass and solution words in parentheses.
                                                                          S9          Attenderemo      Attendere        Attendere
Example rebus by A. Rizzi, Settimana Enigmistica n. 4656
                                                                          S11          mareanera      marenamente      teneramente
 Rebus: MOR [Musa della commedia] D1 L [Si accendono per                  S2               di              la               la
 vedere] D2 NO [Uomini di intelligenza superiore] D3                      S5              posa            posa            sposa

 Chiave di lettura: 7 12                                                            Soluzione: ATtendere teneraMENte la Sposa

                                                                         Rebus: B [Una figura geometrica] D1 [La si impugna per far
    Step        GPT-4o          Claude 3.5S          Phi-3               girare un congegno] D2 DA [Le produce il rovo] D3
    D1           Talia             Talia             Talia
    D2           luci               luci              luci               Chiave di lettura: 10 7 1’ 5
    D3           geni              geni              genii
                                                                         Step             GPT-4o            Claude 3.5S          Phi-3
    S7           Mortali           Mortali          Mortali
                                                                         D1               cerchio              rombo             ellissi
    S12       allucinogeni      allucinogeni     allucinogeni
                                                                         D2              manovella           manovella            leva
               Soluzione: MORTali aLluciNOgeni                           D3                more                 more             more
Table 10                                                                 S10         Bcerchiomanovella      Bromomanov          Bellissile
Examples of LLM generations for a rebus by De Vico C.,                   S7                                   elladam          vadamore
Domenica Quiz n. 5. Correct guesses and errors and de-                   S1’                 d’                   o’                ’
noted for predicted first pass definitions (D1,...,𝑁 ) and so-           S5                amore               more              remo
lution words (S𝑖 , with 𝑖 being the 𝑖-th solution key value).
                                                                                      Soluzione: Bellissima novella D’ Amore

Phi-3 Mini.                                                             Table 11
                                                                        Examples of LLM generations for rebuses by Baruffa, Rebus
                                                                        n. 12 (top), Contini C., La Settimana Enigmistica n. 4102 (mid)
                                                                        and Liosca, La Settimana Enigmistica n. 4581 (bottom). Correct
                                                                        guesses and errors and denoted for predicted first pass defini-
                                                                        tions (D1,...,𝑁 ) and solution words (S𝑖 , with 𝑖 being the 𝑖-th
                                                                        solution key value).




                                                                   10