=Paper= {{Paper |id=Vol-3878/132_calamita_long |storemode=property |title=EurekaRebus - Verbalized Rebus Solving with LLMs: A CALAMITA Challenge |pdfUrl=https://ceur-ws.org/Vol-3878/132_calamita_long.pdf |volume=Vol-3878 |authors=Gabriele Sarti,Tommaso Caselli,Arianna Bisazza,Malvina Nissim |dblpUrl=https://dblp.org/rec/conf/clic-it/SartiCBN24 }} ==EurekaRebus - Verbalized Rebus Solving with LLMs: A CALAMITA Challenge== https://ceur-ws.org/Vol-3878/132_calamita_long.pdf
                                EurekaRebus - Verbalized Rebus Solving with LLMs:
                                A CALAMITA Challenge
                                Gabriele Sarti1,* , Tommaso Caselli1 , Arianna Bisazza1 and Malvina Nissim1
                                1
                                    Center for Language and Cognition (CLCG), University of Groningen, Oude Kijk in ’t Jatstraat 26
                                     Groningen, 9712EK, The Netherlands


                                                Abstract
                                                Language games can be valuable resources for testing the ability of large language models (LLMs) to conduct challenging
                                                multi-step, knowledge-intensive inferences while respecting predefined constraints. Our proposed challenge prompts LLMs
                                                to reason step-by-step to solve verbalized variants of rebus games recently introduced with the EurekaRebus dataset [1].
                                                Verbalized rebuses replace visual cues with crossword definitions to create an encrypted first pass, making the problem
                                                entirely text-based. We introduce a simplified task variant with word length hints and adopt a comprehensive set of metrics to
                                                obtain a granular overview of models’ performance in knowledge recall, constraints adherence, and re-segmentation abilities
                                                across reasoning steps.

                                                Keywords
                                                Large language models, Sequential reasoning, Puzzle, Rebus, Crosswords, Enigmistica Italiana, CALAMITA



                                1. Challenge: Introduction and
                                                                                                                       Reti
                                   Motivation                                                                          (nets)


                                Language games were adopted as testbeds for measuring
                                NLP progress in recent years [2, 3, 4], with a particular
                                focus on (cryptic) crossword solving English [5, 6, 7, 8, 9].
                                For the Italian language, initial efforts focused on cross-
                                word solving and generation [10, 11] and clue-based word                                                            Tè
                                guessing [12, 13, 9]. Recently, Sarti et al. [1] introduced an                                                      (tea)

                                extensive collection of text-adapted Italian rebus puzzles
                                to evaluate large language models’ (LLMs) knowledge            Timone
                                and sequential reasoning abilities. Rebuses are complex        (rudder)

                                puzzles combining visual elements and graphic signs to
                                encode a hidden phrase. Italian can boast a rich and                    First Pass: TeS timone - reti CE - N te
                                long-standing rebus tradition dating back to the 19th
                                century [14], popularized by high-diffusion magazines           Verbalized Rebus:
                                                                    1                           TES [Dirige la rotta] (Directs the course)
                                such as La Settimana Enigmistica . The structure of Ital-
                                                                                                [Le difendono i portieri] (Protected by goalkeepers) CE
                                ian rebuses has, with time, been formalized into beauty         N [Calda bevanda rilassante] (Warm relaxing drink)
                                canons [15], and their peculiarities and design principles
                                were analyzed by several authors [16, 17, 18].                     Solution key (# of chars/word): 9             9
                                   In Italian rebuses, rebus solving begins by combining           Solution: Testimone reticente (reticent witness)
                                derived by combining graphemes with their underlying
                                visual elements in a left-to-right fashion, composing a Figure 1: Example of a verbalized rebus crafted by combin-
                                first pass (prima lettura) representing an intermediate ing a rebus first pass (intermediate solution) with crossword
                                solution of the puzzle. Then, first pass elements are re- definitions. Rebus by Lionello, art by Laura Neri.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                                           segmented (cesura) according to a solution key (dia-
                                *
                                  Corresponding author.                                                                                  gramma), which specifies the length of each word in the
                                $ g.sarti@rug.nl (G. Sarti); t.caselli@rug.nl (T. Caselli);                                              solution (frase risolutiva). The verbalized rebuses in-
                                a.bisazza@rug.nl (A. Bisazza); m.nissim@rug.nl (M. Nissim)
                                € https://gsarti.com (G. Sarti); https://cs.rug.nl/~bisazza                                              troduced by Sarti et al. [1] are text-only version of real
                                (A. Bisazza); https://malvinanissim.github.io (M. Nissim)                                                rebuses published in popular outlets derived by replacing
                                 0000-0001-8715-2987 (G. Sarti); 0000-0003-2936-0256 (T. Caselli);                                      words corresponding to visual elements with externally-
                                0000-0003-1270-3048 (A. Bisazza); 0000-0001-5289-0971 (M. Nissim)                                        sourced crossword definitions in the transcribed first
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

                                1
                                            Attribution 4.0 International (CC BY 4.0).                                                   passes, using a standardize format. Figure 1 provides a
                                    https://www.lasettimanaenigmistica.com/




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
simple example.                                                      3. Data description
   This work proposes to adopt the EurekaRebus intro-
duced by Sarti et al. [1] to extend their evaluation of              3.1. Origin of data
LLMs’ multi-step reasoning and linguistic/cultural aware-
                                                                     The dataset used for this challenge is an extended version
ness to the systems evaluated as part of the CALAMITA
                                                                     of EurekaRebus [1], a collection of 222,089 unique Ital-
evaluation campaign [19]. We believe the task is par-
                                                                     ian rebuses extracted from Eureka5 platform3 , an open
ticularly relevant since the crossword definitions that
                                                                     database of rebuses and other linguistic puzzles main-
compose verbalized rebuses rely heavily on idiomatic
                                                                     tained by the Associazione Culturale “Biblioteca Enig-
expressions, wordplay, and cultural references specific
                                                                     mistica Italiana - G. Panini”4 . Among these, 83,157 were
to Italian. Hence, the results of this task could provide
                                                                     converted by the original authors in verbalized form by
valuable insights into the linguistic and cultural compe-
                                                                     leveraging the crossword definitions from the ItaCW col-
tence of LLMs trained on the Italian language. Moreover,
                                                                     lection [10], including 125,202 definition-solution pairs.
the task is especially appealing since it is framed in a
                                                                     While Sarti et al. [1] evaluated the performances of
templated reasoning format, enabling us to disentangle
                                                                     prompted and tuned LLMs on rebuses up to June 17th,
the various components required to successfully solve a
                                                                     2024, the current test set include 168 new unseen exam-
verbalized rebus step-by-step. More specifically, several
                                                                     ples released on Eureka5 after that date.
metrics will be employed to assess LLMs’ factual recall,
textual concatenation and re-segmentation capabilities
and, finally, constraint satisfaction given the provided             3.2. Annotation details
cues.
                                                                     We employ the same procedure of Sarti et al. [1] for ver-
   In light of the results reported by [1] for state-of-the-
                                                                     balizing available rebuses. More specifically, only re-
art proprietary LLMs, we expect all tested open-source
                                                                     buses having all lowercased or camel-cased words among
systems to perform very poorly, with final solution ac-
                                                                     ItaCW solutions are selected, and every word is replaced
curacies well below 30%. We also note that the high-
                                                                     by sampling one of the available crossword definitions for
est reported overall performance in previous work2 was
                                                                     it at random.5 Moreover, only regular rebuses containing
found by the original authors to be primarily the prod-
                                                                     at least two hidden words are selected, avoiding examples
uct of memorization. We anticipate that this challenge
                                                                     requiring a single definition-solving step and those with
will highlight significant limitations in LLMs’ current
                                                                     more complex templates (e.g., anarebuses using anagrams
factual recall and multi-step reasoning ability and act as
                                                                     of hidden words for the solution).
a catalyst for future improvements in these areas.

                                                                     3.3. Data format
2. Challenge: Description
                                                                     Each example in the dataset consists of:
The proposed challenge aims to evaluate the capabilities
of existing LLMs in solving verbalized Italian rebuses via                    • The verbalized rebus (verbalized_rebus) con-
prompting at various granularity levels. More specifically,                     taining letters from the original rebus and
LLMs will be evaluated in a few-shot prompting setting                          crossword-style definitions enclosed in square
with two fixed in-context learning examples pre-selected                        brackets.
at random from the available pool of verbalized rebuses                       • A variant of the verbalized rebus con-
in EurekaRebus, in two settings:                                                taining length hints for definitions
                                                                                (verbalized_rebus_with_length_hints).
        • Regular, matching the example in table 1 and                        • The solution key, composed by whitespace-
          the original input format used by Sarti et al. [1].                   separated numbers representing the word lengths
        • Hints, in which the number of characters for ev-                      in the final solution (solution_key).
          ery hidden word is provided alongside definitions                   • The first pass words matching definitions in
          in the verbalized rebus to help the model in iden-                    the verbalized rebus, provided in a semicolon-
          tifying the correct choice. This variant was not                      separated string in order of occurrence
          tested by Sarti et al. [1].                                           (word_guesses).
   Refer to section 3.3 for the respective example formats.                   • The first pass obtained by infilling words in
Models will be evaluated on their performance at each                           place of their definitions in the verbalized rebus
step required to successfully solve the verbalized rebus                        (first_pass).
and their overall ability to produce correct final solutions.        3
                                                                         http://www.eureka5.it
2                                                                    4
    Namely 58% Solution Exact Match for a LLaMA-3.1 8B model LoRA-       http://www.enignet.it/home
                                                                     5
    tuned on 80k EurekaRebus examples [20, 21]                           Words in ItaCW can be associated to multiple definitions.
1   {
2         "verbalized_rebus": "[Edificio religioso] G [Lo fa doppio l'opportunista] NP [Poco cortese,
          ˓→  severo] NZ [Parente... molto lontana]",
3         "verbalized_rebus_with_length_hints": "[Edificio religioso (6)] G [Lo fa doppio l'opportunista
          ˓→  (5)] NP [Poco cortese, severo (4)] NZ [Parente... molto lontana (3)]",
4         "solution key": "3 1 6 3 8 2",
5         "word_guesses": "chiesa;gioco;rude;ava",
6         "first_pass": "chiesa G gioco NP rude NZ ava",
7         "solution_words": "Chi;è;saggio;con;prudenza;va",
8         "solution": "Chi è saggio con prudenza va"
9   }


                                     Listing 1: Example entry for the challenge test set.


          • The whitespace-separated solution words ob-          the task. Then, the resolved words need to be infilled
            tained after resegmenting the first pass ac-         into the original rebus to compose the first pass, and
            cording to the solution key, provided in a           re-segmented in the Solution segmentation step. Fi-
            semicolon-separated string in order of occurrence    nally, the individual solution words are reassembled into
            (solution_words).                                    a single solution string.
          • The solution of the verbalized rebus used as the
            final prediction target for the LLM (solution).      3.5. Detailed data statistics
        An example is provided in Listing 1.                     Table 2 from Sarti et al. [1] reports statistics for the full
                                                                 and verbalized subsets of the EurekaRebus dataset.
    3.4. Prompting
                                                                 Train set contents The training set contains 80,158
    Table 1 shows the 2-shot prompting template adopted for      examples, which are ignored for the purpose of the
    generating a templated solution with the tested LLMs.        CALAMITA campaign provided that no adaptation meth-
    The second in-context example used in the template,          ods are evaluated.
    omitted for brevity, corresponds to the one shown in List-
    ing 1.                                                       Test set contents The test set contains 3,167 examples
       The task description provided to the model was de-        divided as follows, in order of appearance:
    rived from a trial-and-error process starting from the
    original prompt by Sarti et al. [1]. Notably, compared to         • 2000 examples matching the in-domain setting
    the original authors the task description provides more             for models trained by [1], i.e. containing only first
    detailed descriptions of individual components of the re-           pass words seen by all available trained models.
    bus to provide a clearer overview of the task to the LLM.         • 999 examples matching the out-of-distribution
    We opted for a 2-shot setting as opposed to the 5-shot              setting for models trained by [1], i.e. containing
    prompting employed by Sarti et al. [1] to accommodate               at least one first pass word unseen during training
    the limited context length of some of the tested LLMs,              by available trained models.
    thus ensuring that the total length after model generation        • 168 new verbalized rebuses added in EurekaRe-
    does not exceed 1024 tokens6 . The two examples pro-                bus v1.1, added to the Eureka5 platform after
    vided remain the same shown here to simplify evaluation             June 17th, 2024. These can be either in-domain
    and ensure consistent results.                                      or out-of-distribution for models trained on the
                                                                        EurekaRebus’s training set.
    Verbalized rebus solving steps Table 1 provide la-           While prompted models should obtain similar perfor-
    bels for the steps necessary to solve the verbalized rebus mances across all test subsets, the aformentioned division
    that are considered in this challenge task. The model will enable further comparisons with previously trained
    receives a problem input including a verbalized rebus systems.
    (possibly with length hints) and a solution key (chiave
    di lettura). The first step involves resolving crossword
    definitions in order (Definition resolution), exploiting 4. Metrics
    only the model’s parametric knowledge to accomplish
                                                               The challenge employs a comprehensive set of metrics
    6
      The LLaMA 3 tokenizer was used to perform this estimate  adapted from the original evaluation of [1]:
 Prompt template                                                      Statistic              EurekaRebus        ItaCW-filtered
 Sei un’esperto risolutore di giochi enigmistici. Il seguente         # examples                 222089              83157
 gioco contiene una frase (Rebus) nella quale alcune parole           # authors                   8138                5046
 sono state sostituite da indizi tra parentesi quadre. I numeri in    Year range               1800 - 2024         1869 - 2024
 ogni indizio rappresentano la lunghezza della parola nascosta.
                                                                                              First pass
 Il tuo compito è quello di identificare le parole nascoste e
 sostituirle agli indizi nel Rebus, producendo una prima lettura      # unique words              38977                8960
 dalla quale poi si deriverà una frase risolutiva. La chiave          Avg./SD words/ex.         3.50/1/48           3.08/1.00
 di lettura è una sequenza di numeri che rappresentano la             Avg./SD word len.         6.51/1.96            5.70/1.60
 rispettive lunghezze delle parole che compongono la frase            Avg./SD FP len.          26.45/11.19          25.74/8.73
 risolutiva. La tua risposta deve essere una frase risolutiva                                  Solution
 sensata e che rispetti le lunghezze definite nella chiave di
 lettura.                                                             # unique words               75718               42558
                                                                      Avg./SD words/ex.          3.02/1.60           2.80/1.21
  First ex- # Esempio 1:                                              Avg./SD word len.          8.07/2.30           7.79/2.23
  ample    ⎧                                                          Avg./SD Sol. len.         19.47/8.44          18.81/6.06
           ⎪Rebus: AC [Un mollusco nell’insalata di
           ⎪
           ⎨mare (5)] GLI [Lo è l’operaio che lavora in
           ⎪
           ⎪
  Problem                                                            Table 2
   input    cantiere (5)] S TO [Soldati da trincea (5)]              Statistics for the full EurekaRebus dataset and the crosswords-
           ⎪
           ⎪
           ⎩Chiave di lettura: 11 2 10
           ⎪
           ⎪                                                         filtered subset used in this work. Avg./SD = Average/standard
                                                                     deviation. Table adapted from Sarti et al. [1].
                Procediamo alla risoluzione del rebus passo
                per passo:
                                                                          • Word Guess Accuracy: Proportion of correctly
             ⎧
                 -AC=AC                                                     guessed words during definition resolution (corre-
                                                                            sponding to the Definition metric in the original
             ⎪
                - [Un mollusco nell’insalata di mare] =
             ⎪
             ⎪
             ⎪
                                                                            evaluation).
             ⎪
                cozza
             ⎪
             ⎪
             ⎪
             ⎪
                -GLI=GLI                                                  • Word Guess Length Accuracy: Proportion of
             ⎨
  Definition
  resolution ⎪
             ⎪  - [Lo è l’operaio che lavora in cantiere] =                 word guesses in definition resolution matching
                edile
             ⎪
                                                                            the correct length. This is evaluated only for the
             ⎪
             ⎪
             ⎪
             ⎪- S T O = S T O
             ⎪
                                                                            Hints setting, where the length is explicitly pro-
             ⎪
             ⎩
                - [Soldati da trincea] = fanti
                                                                            vided (not evaluated in previous works).
                                                                          • First Pass Accuracy: Proportion of generated
             {︂
                 Prima lettura: AC cozza GLI edile S TO
  First pass
                fanti                                                       first passes matching the gold reference (corre-
                                                                            sponding to the First Pass Exact Match metric in
                  Ora componiamo la soluzione seguendo la
                                                                            the original evaluation).
                  chiave risolutiva:
                                                                          • Solution Word Accuracy: Proportion of correct
                                                                            words in the generated solutions.
               ⎧
  Solution     ⎨ 11 = Accozzaglie
  segmen-         2 = di                                                  • Solution Words Lengths Accuracy: Proportion
   tation         12 = lestofanti                                           of generated solution words matching the lengths
               ⎩

  Solution
               {︀
                  Soluzione: Accozzaglie di lestofanti
                                                                            specified by the solution key. Lower scores may
                                                                            indicate difficulty in respecting the given length
                                                                            constraints (corresponding to the Solution Key
  Second         # Esempio 2:
  example
                                                                            Match metric in the original evaluation).
                 ... (same format as the first example)                   • Solution Match: Proportion of generated solu-
                                                                            tions matching the gold reference (corresponding
                 # Ora tocca a te!
  Answer
                                                                            to the Solution Exact Match metric in the original
  prefix
                                                                            evaluation).
                 Completa il rebus seguendo il procedimento
                 descritto, rispondendo esattamente nello       The Solution Match metric will be used as a primary
                 stesso formato utilizzato dagli esempi prece-
                                                             metric of correctness, since it captures the model abil-
                 denti.
                                                             ity to fully solve the verbalized rebus. While no base-
                 Rebus: {{verbalized_rebus}} or {{verbal-
                 ized_rebus_with_length_hints}}              line evaluation was conducted for the new test set used
                 Chiave di lettura: {{solution_key}}         in this challenge, we expect the performances of most
                                                             capable open-source systems to align with those of 5-
Table 1                                                      shot prompted LLaMA-3 70B and Qwen-2 72B models
2-shot prompt used for the CALAMITA evaluation. Blue text reported by Sarti et al. [1], which we summarize in Sec-
represent additions for the evaluation in the Hints setting. tion 4. The results show that current models struggle
Template elements are highlighted next to the first in-context
example. Example rebus by Parodi E., Domenica Quiz n. 7
       Model            Word Acc.       FP Acc.    Solution Word Acc.       Solution Word Len.        Solution Acc.
       LLaMA-3 70B          0.22          0.04              0.03                     0.16                   0.00
       Qwen-2 72B           0.28          0.04              0.04                     0.20                   0.00

Table 3
Baseline results for LLaMA-3 70B and Qwen-2 72B for the original test set, adapted from Sarti et al. [1].



to complete the task primarily due to incorrect word            findings should not be overgeneralized to Italian lan-
guesses, with errors propagating across resolution steps        guage competence as a whole or to other cultures. This
and ultimately resulting in a final accuracy of 0%.             dataset’s rebuses and crossword definitions are derived
                                                                from commercially available published sources. While
                                                                efforts have been made to ensure this data’s exclusive,
5. Limitations                                                  fair usage for research purposes, there may be copyright
                                                                considerations to address.
Several limitations should be considered when interpret-
ing the results of this challenge:
                                                                7. Data license and copyright
Verbalization Simplification The use of verbalized                 issues
rebuses, while necessary for text-based LLMs, simplifies
the original visual puzzle. This does not fully capture the  As reported by the original EurekaRebus dataset license,
complexity of solving traditional rebuses, which rely on     the data is redistributed for research purposes only with
visual cues and cultural knowledge, making verbalized        the explicit approval of the Associazione Culturale “Bib-
rebus solving a much simpler proxy to the multi-step         lioteca Enigmistica Italiana - G. Panini” (here onwards
reasoning required for regular rebuses.                      referred to as the Association), and the rights to each entry
                                                             in the EurekaRebus collection are the property of the re-
Cultural Specificity The selected rebuses and cross- spective copyright holders. The usage and redistribution
word definitions rely heavily on Italian-specific linguistic of these data is allowed only for users providing appro-
and cultural background. Performance on this task may priate attribution to the original copyright holders and
not generalize to other languages or puzzle types, and the Association, and the creation of derivative works is
it might be unrealistic to expect general-purpose LLMs permitted only for research purposes, using terms no less
to possess the specific lexicon and knowledge used for restrictive than the EurekaRebus license. Researchers are
rebus solving.                                               encouraged to contact the challenge organizers with any
                                                             questions or concerns about data usage and licensing.
Prompt Sensitivity While the selected prompt tem-
plate was observed to perform well for capable propri-
etary LLMs in preliminary tests, there are no guarantees
                                                             Acknowledgments
that the instructions provided in the prompt are sufficient We would like to express our gratitude to the following
for smaller open-source models to perform verbalized individuals and organizations:
rebus solving proficiently. Moreover, alternative prompt
formulations could lead to potentially better results.             • The Associazione Culturale "Biblioteca Enig-
                                                                     mistica Italiana - G. Panini" for making their rebus
                                                                     collection freely accessible on Eureka5.
Lack of Human Baseline The challenge currently
                                                                   • The creators of the ItaCW dataset for enabling
lacks a clear human performance baseline, which would
                                                                     the creation of verbalized rebuses.
be valuable for contextualizing model performance on
                                                                   • The puzzle creators whose work is represented
verbalized rebus solving.
                                                                     in this dataset.
                                                                   Gabriele Sarti and Arianna Bisazza acknowledge the
6. Ethical issues                                               support of the Dutch Research Council (NWO) for the
                                                                project InDeep (NWA.1292.19.399). Arianna Bisazza
While this challenge focuses on a relatively benign task
                                                                is further supported by the NWO Talent Programme
of puzzle-solving, there are some ethical considerations
                                                                (VI.Vidi.221C.009). We hope this challenge will contribute
to keep in mind. First, the dataset captures a very narrow
                                                                to the diffusion of the art of Italian enigmistica among
subset of Italian language and culture. Hence, evaluation
                                                                computational linguistics and artificial intelligence re-
                                                                searchers.
References                                                         K. Fort, U. Kruschwitz, S. Lukin (Eds.), Proceedings
                                                                   of the 10th Workshop on Games and Natural Lan-
[1] G. Sarti, T. Caselli, M. Nissim, A. Bisazza, Non ver-          guage Processing @ LREC-COLING 2024, ELRA
    bis, sed rebus: Large language models are weak                 and ICCL, Torino, Italia, 2024, pp. 97–106. URL:
    solvers of italian rebuses, in: F. Dell’Orletta,               https://aclanthology.org/2024.games-1.11.
    A. Lenci, S. Montemagni, R. Sprugnoli (Eds.), Pro-        [10] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,
    ceedings of the Tenth Italian Conference on Compu-             L. Rigutini, M. Maggini, M. Gori, Italian crossword
    tational Linguistics (CLiC-it 2024), CEUR.org, Pisa,           generator: Enhancing education through interac-
    Italy, 2024. URL: https://arxiv.org/abs/2408.00584.            tive word puzzles, in: Proceedings of the 9th Italian
[2] P. Giadikiaroglou, M. Lymperaiou, G. Filandrianos,             Conference on Computational Linguistics (CLiC-it
    G. Stamou, Puzzle solving using reasoning of large             2023), 2023. URL: https://ceur-ws.org/Vol-3596.
    language models: A survey, ArXiv (2024). URL:             [11] G. Angelini, M. Ernandes, M. Gori, Solving ital-
    https://arxiv.org/abs/2402.11291.                              ian crosswords using the web, in: International
[3] B. J. Anderson, J. G. Meyer, Finding the optimal               Conference of the Italian Association for Artificial
    human strategy for wordle using maximum cor-                   Intelligence, 2005. URL: https://link.springer.com/
    rect letter probabilities and reinforcement learning,          chapter/10.1007/11558590_40.
    Arxiv (2022). URL: https://arxiv.org/abs/2202.00557.      [12] P. Basile, M. Lovetere, J. Monti, A. Pascucci, F. San-
[4] G. Todd, T. Merino, S. Earle, J. Togelius, Missed              gati, L. Siciliani, Ghigliottin-ai@evalita2020: Eval-
    connections: Lateral thinking puzzles for large lan-           uating artificial players for the language game
    guage models, Arxiv (2024). URL: https://arxiv.org/            "la ghigliottina" (short paper), EVALITA Evalua-
    abs/2404.11730.                                                tion of NLP and Speech Tools for Italian - Decem-
[5] M. Ernandes, G. Angelini, M. Gori,                We-          ber 17th, 2020 (2020). URL: https://doi.org/10.4000/
    bcrow: A web-based system for crossword solv-                  books.aaccademia.7488.
    ing,     in: AAAI Conference on Artificial In-            [13] P. Basile, M. de Gemmis, P. Lops, G. Semeraro, Solv-
    telligence, 2005. URL: https://link.springer.com/              ing a complex language game by using knowledge-
    chapter/10.1007/11590323_37.                                   based word associations discovery, IEEE Trans-
[6] J. Rozner, C. Potts, K. Mahowald, Decrypting                   actions on Computational Intelligence and AI in
    cryptic crosswords: Semantically complex word-                 Games 8 (2016) 13–26. doi:10.1109/TCIAIG.2014.
    play puzzles as a target for nlp, in: M. Ranzato,              2355859.
    A. Beygelzimer, Y. Dauphin, P. Liang, J. W. Vaughan       [14] D. Tolosani, Enimmistica, Hoepli, Milan, 1901.
    (Eds.), Advances in Neural Information Processing         [15] G. Brighenti,        I canoni di bellezza nel re-
    Systems, volume 34, Curran Associates, Inc., 2021,             bus,        Labirinto - Mensile di cultura enig-
    pp. 11409–11421. URL: https://proceedings.                     mistica (1974). URL: http://win.cantodellasfinge.
    neurips.cc/paper_files/paper/2021/file/                        net/portale/leonardo/articoli/langense/pag2.asp.
    5f1d3986fae10ed2994d14ecd89892d7-Paper.pdf.               [16] E. Miola, Che cos’è un rebus, Carocci, 2020.
[7] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak,         [17] S. Bartezzaghi, Parole in gioco: Per una semiotica
    M. Ginsberg, D. Klein, Automated crossword solv-               del gioco linguistico, Bompiani, 2017.
    ing, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.),   [18] P. Ichino, L’ora desiata vola: guida al mondo del
    Proceedings of the 60th Annual Meeting of the As-              rebus per solutori (ancora) poco abili, Bompiani,
    sociation for Computational Linguistics (Volume 1:             Milan, 2021.
    Long Papers), Association for Computational Lin-          [19] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
    guistics, Dublin, Ireland, 2022, pp. 3073–3085. URL:           cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
    https://aclanthology.org/2022.acl-long.219. doi:10.            naldi, D. Scalena, CALAMITA: Challenge the Abili-
    18653/v1/2022.acl-long.219.                                    ties of LAnguage Models in ITAlian, in: Proceed-
[8] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Mag-             ings of the 10th Italian Conference on Computa-
    gini, M. Gori, L. Rigutini, Clue-instruct: Text-based          tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
    clue generation for educational crossword puzzles,             ber 4 - December 6, 2024, CEUR Workshop Proceed-
    in: Proceedings of the 2024 Joint International Con-           ings, CEUR-WS.org, 2024.
    ference on Computational Linguistics, Language            [20] M. AI, Introducing meta llama 3: The most capable
    Resources and Evaluation (LREC-COLING 2024),                   openly available llm to date, Website, 2024. URL:
    ELRA and ICCL, Torino, Italia, 2024, pp. 3347–3356.            https://ai.meta.com/blog/meta-llama-3.
    URL: https://aclanthology.org/2024.lrec-main.297.         [21] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li,
[9] R. Manna, M. P. di Buono, J. Monti, Riddle me                  S. Wang, L. Wang, W. Chen, LoRA: Low-rank adap-
    this: Evaluating large language models in solving              tation of large language models, in: The Tenth
    word-based games, in: C. Madge, J. Chamberlain,                International Conference on Learning Representa-
tions (ICLR 2022), OpenReview, Online, 2022. URL:
https://openreview.net/forum?id=nZeVKeeFYf9.