Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses GabrieleSarti g.sarti@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

TommasoCaselli t.caselli@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

MalvinaNissim m.nissim@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

AriannaBisazza a.bisazza@rug.nl Center for Language and Cognition (CLCG) University of Groningen

Oude Kijk in 't Jatstraat 26 9712EK Groningen The Netherlands

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses 1613-0073 290F50854597AF065AEC9D2728AE0361 GROBID - A machine learning software for extracting information from scholarly documents Large language models Sequential reasoning Puzzle Rebus Crosswords Enigmistica Italiana

Rebuses are puzzles requiring constrained multi-step reasoning to identify a hidden phrase from a set of images and letters. In this work, we introduce a large collection of verbalized rebuses for the Italian language and use it to assess the rebus-solving capabilities of state-of-the-art large language models. While general-purpose systems such as LLaMA-3 and GPT-4o perform poorly on this task, ad-hoc fine-tuning seems to improve models' performance. However, we find that performance gains from training are largely motivated by memorization. Our results suggest that rebus solving remains a challenging test bed to evaluate large language models' linguistic proficiency and sequential instruction-following skills.

Introduction

Complex games such as chess and Go have long been a source of inspiration to develop more flexible and robust AI systems [1,2]. Recent developments in NLP suggested that creative language games could be exploited as promising benchmarks for quantifying the ability of large language models (LLMs) to carry out multi-step knowledge-intensive reasoning tasks under pre-specified constraints [3]. While crossword puzzles have been historically the main focus of such efforts [4], other categories of linguistic games received only marginal attention, especially for languages other than English. A prominent example of less-studied language games is the rebus, a visual puzzle combining images and graphic signs to encode a hidden phrase. Indeed, rebus solving is a complex, multi-step process requiring factual knowledge, contextual understanding, vocabulary usage, and reasoning within pre-defined constraints -a set of fundamental skills to address a variety of real-world tasks.

In this work, we conduct the first open evaluation of LLMs' rebus-solving capabilities, focusing specifically on the Italian language. We propose a novel strategy to derive text-only verbalized rebuses from transcribed intermediate rebus solutions and use it to produce a large collection with more than 80k verbalized rebuses. We then evaluate the rebus-solving skills of state-of-the-art LLMs, Solution key (# of chars/word): 11 5 including open-source systems and proprietary models, via few-shot prompting. Moreover, we fine-tune a small but capable LLM on verbalized rebus solving, outperforming state-of-the-art systems by a wide margin. Finally, we conduct a fine-grained assessment of LLMs' sequential reasoning steps, explaining model performance in terms of word complexity and memorization.

Gabriele Sarti et al. CEUR Workshop Proceedings

1-10

Beyond rebus solving, our evaluation sheds light on the limits of current LLMs in multi-step reasoning settings, highlighting challenges with their application to complex sequential instruction-following scenarios.1

Background and Related Work

Italian Enigmistica and Rebuses The Italian language is characterized by a rich and long-standing tradition of puzzle games, including rebuses, dating back to the 19th century [5] 2 In Italian rebuses, a first pass (prima lettura) representing an intermediate solution of the puzzle is produced by combining graphemes with underlying image elements in a left-to-right direction (Figure 1). Then, the letters and words of the first pass undergo a re-segmentation (cesura) according to a solution key (chiave di lettura 3 ), which specifies the length of words in the solution (frase risolutiva). The verbalized rebuses we introduce in this work are variants of textual rebuses (rebus descritto or verbis), where the text-based puzzle is crafted by replacing first pass words with their crossword definitions in a templated format (Figure 1).

Linguistic Puzzles as NLP Progress Metrics

Language games have recently been adopted as challenging tasks for LLM evaluation [3,9,10]. While works in this area have historically focused on English crosswords [11,12,4,13], recent tests focus on a more diverse set of games such as the New York Times' "Connections" [14] and "Wordle" [15]. Automatic crossword solvers were also developed for French [16], German [17] and Italian [18,19], while didactic crossword generators are available for Italian [20] and Turkish [21]. Relatedly, the Italian evaluation campaign EVALITA 4 recently hosted two shared tasks focusing on the word-guessing game "La Ghigliottina" (The Guillotine) [22,23]. To our knowledge, our work is the first to attempt the computational modeling and evaluation of rebus-solving systems. Importantly, language games such as rebuses are not easily translatable into other languages due to their structural and cultural elements. This makes them a scarce but valuable resource for language-specific evaluations of language processing systems.

LLMs as Sequential Reasoners State-of-the-art

LLMs were shown to struggle to follow sequential instructions presented in a single query [24], but their performances improved significantly with ad-hoc training [25]. This acts as an initial motivation for our rebus-solving Models We fine-tune Phi-3 Mini 3.8B 4K [28], the most capable LLM below 4B parameters for a wide range of Italian language tasks8 . We use quantized low-rank adapters (QLoRA; 29, 30) for efficient fine-tuning with Unsloth 9and Transformers [31], training the model for 5,000 steps with a batch size of 16 over 81k examples. For comparing our model performances, we select GPT-4o [32] and Claude-3.5 Sonnet [33] as the current state-of-the-art for proprietary LLMs and the instruction-tuned variants of Qwen-2 72B [34] and LLaMA-3 70B [35] as the bestperforming open-source LLMs according to the Invalsi Italian benchmark [36]. These four systems are used as untrained baselines thanks to their instruction-following abilities and prompted for rebus solving in a few-shot setting.

Format Table 1 presents an example in the templated format used for fine-tuning Phi-3. 10 The model is prompted to reason step-by-step by 1) solving crossword definitions sequentially (definition resolution); 2) producing a first pass copying letters and definitions' words;

Prompt

Risolvi gli indizi tra parentesi per ottenere una prima lettura, e usa la chiave di lettura per ottenere la soluzione del rebus.

Rebus: U [Lo è il passacavallo] LO [È fatta di vimini] F F [Decimi di chilo] S [Disusato soprabito] A [Un rampicante dei Tropici]

Chiave di lettura: 3 6 12 8 Generation Procediamo alla risoluzione del rebus passo per passo:

Definition resolution ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ -U = U -[Lo è il passacavallo] = nave -L O = L O -[È fatta di vimini] = cesta -F F = F F -[Decimi di chilo] = etti -S = S -[Disusato soprabito] = tait -A = A -[Un rampicante dei Tropici] = liana

First pass

{︂

Prima lettura: U nave LO cesta F F etti S tait A liana

Ora componiamo la soluzione seguendo la chiave risolutiva:

Solution segmen- tation ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 3 = Una 6 = veloce 12 = staffettista 8 = italiana Solution {︀ Soluzione: Una veloce staffettista italiana

Results

Table 2 presents our evaluation results. We observe that all prompted models perform poorly on the task, with the overall best prompted system (Claude 3.5 Sonnet) obtaining the correct solution only for 24% of the 2k tested examples. Notably, open-source systems perform significantly worse than proprietary ones, producing correct first passes only for 4% of the examples, and next to no correct solutions. Our fine-tuned system largely outperforms all state-of-the-art prompted models, predicting the correct solution in 51% of cases. From first pass metrics, it is evident these results can be largely explained by the poor word-guessing capabilities of the models, which are greatly improved with fine-tuning. For prompted models, the slight decrease in scores between Def. and FP Words also highlights issues with copying predicted words in the expected format. Finally, we observe that fine-tuning strongly improves the constraint-following abilities of our system, with prompted systems being less strict with applying length and letter-choice constraints for their solutions (Key/FP Match).

What Motivates Model Performances?

In light of the strong performances achieved by our relatively small fine-tuned system, this section conducts an in-depth investigation to identify factors motivating such performance improvements. 2) the word frequency in Paisà [38], a large web Italian corpus; and 3) the length of the word (number of characters). We find a significant positive correlation (𝜌 = 0.44) between first pass word prediction accuracy and training frequency for the fine-tuned Phi-3 model, suggesting that model performance is strongly related to training coverage. The length of characters is also found to negatively affect our model's performance, albeit to a smaller extent (𝜌 = −0.11). The performance of prompted models is unrelated to both properties for first pass words, indicating that these results are the product of fine-tuning. 12 LLM Fine-Tuning Fails to Generalize to Unseen Words To further confirm the importance of finetuning word coverage in defining model performances, 12 Paisà frequency is never found to correlate significantly. Full correlation results are available in Table 6.

Model

we 3 confirm that this is indeed the case. We find Phi-3 performances to be near-perfect on seen first pass words (FP W. ID = 0.96) in both test sets, with a major drop for OOD words (FP W. OOD = 0.20). This produces second-order effects on subsequent steps, causing the FP EM results to drop by 71% (FP EM Test ∆), while significantly impacting downstream solution accuracies. On the contrary, GPT-4o few-shot prompting performances remain nearly identical on both splits, confirming that these results are not the product of a skewed data selection process. Overall, these results strongly suggest that memorization is the main factor behind the strong rebus-solving performance of our fine-tuned LLM.

Manual Inspection

We conclude by manually evaluating some generations produced by the best-performing LLMs. Table 4 presents two examples with definitions (D) and solution (S) words predicted by three LLMs, with more examples provided in Appendix C. We use naw as short-hand for "Not A Word" to mark nonsensical terms.

In the first example, Phi-3 correctly predicts all first pass and solution words. On the contrary, other models make several mistakes in the first pass, leading to incorrect solutions. Both prompted models tend to ignore first pass words when these cannot be assembled to form sensical, length-fitting solution words. For example, for D1 GPT-4o predicts p (naw), which would lead to the solution word "SAPpTE" (naw), but the S8 = "Spettacolo" (show) is predicted instead by the model). In particular, GPT-4o appears to prioritize grammatically correct solutions at the cost of ignoring first pass words and solution key length constraints, while Claude 3.5S

Rebus: SAP [La porta della breccia] D1 TE [La pinza del granchio] D2 SBA [Si legge su alcuni orologi] D3 G [Le sue coccole sono aromatiche] D4 V [Un gioco con dadi e pedine] D5 D [Sono verdi in gioventù] D6

Chiave di lettura: 8 3 2 12 7 5

Step shows an improved ability to follow these constraints, as confirmed by Key/FP Match results of Table 2.

In the second example, the first pass word D2 = salice (willow) is OOD for Phi-3. Consequently, the model produces the incorrect prediction aro (naw), and the error is propagated to all solution words, as previously observed in the Test OOD column of Table 3. Prompted models also underperform in this example, with errors on D1 and D2 propagating to most solution words. However, we note that D1 and D2 incorrect predictions for Claude 3.5S satisfy the provided definitions, suggesting that access to more explicit information about the given constraints could further boost LLMs' performance on this task.

Discussion and Conclusion

This work introduced a verbalized rebus-solving task and dataset for evaluating LLMs' sequential instruction following skills for the Italian language. We crafted a large collection of 83k verbalized rebuses by combining rebus transcriptions with crossword definitions and used it to evaluate the rebus-solving skills of state-of-the-art LLMs. Our experiments revealed the challenging nature of this task, with even the most capable prompted models achieving only 24% accuracy on solutions.

While fine-tuning a smaller LLM dramatically improved performance to 51% solution accuracy, our analysis uncovered that these gains were largely driven by memorization and do not generalize to out-ofdistribution examples. These results suggest important limitations in the generalization capabilities of current systems for sequential instruction following tasks. Our manual analysis further shows that LLMs seldom account for length constraints when solving definitions, despite the fundamental role of these cues in restricting the pool of possible words. These results suggest that searchbased approaches accounting for constraints more explicitly might improve puzzle structure adherence, as previously shown by Chen et al. [39]. Other augmentation techniques employing LLM reformulation skills can also be explored to mitigate overfitting.

Future work in this area should focus on expanding similar evaluations to a wider set of languages, input modalities, and puzzle categories, creating a comprehensive benchmark to test LLMs' puzzle-solving skills. Importantly, the task of solving visual rebuses and their more convoluted variants 13 remains far beyond the current capabilities of vision-language models. Hence, solving these puzzles automatically can be considered an important milestone in developing multimodal AI systems for constrained multi-step reasoning tasks. Our results confirm that the challenging nature of rebuses, even in their verbalized form, makes this task valuable for assessing future progress in LLMs' linguistic proficiency and sequential reasoning abilities. Finally, our rebus-solving LLM can facilitate future interpretability work investigating the mechanisms behind factual recall and multi-step reasoning in transformer models [40].

Limitations Our analysis was limited to a relatively small set of models, and a single prompt template obtained after minimal tuning. Further experiments are needed to verify that memorization patterns after finetuning remain relevant for other model sizes, prompt formats, and training regimes, particularly for full-weight training approaches. [41] the pool of available definitions for every word.

A. Additional Data Information

Dataset statistics

First pass/Solution word distribution Figure 2 shows the distribution of first pass and solution words for the filtered EurekaRebus subset used in our work.

B. Additional Experimental Results

Table 6 presents the correlations between model accuracy and the properties presented in Section 5. Table 7 presents the full ID/OOD performances for all tested models, showing consistent results with Table 3 for all prompted models. Table 8 presents Phi-3 Mini performances across rebus-solving fine-tuning steps.

Ali

FirstPass: M ali -N coni -cane NIA Verbalized Rebus: M [Due calciatori attaccanti] (Two attacking footballers) N [Usati per mangiare il gelato] (Used for eating ice cream) [Abbaia e morde] (Barks and bites) NIA Solution: Malinconica nenia (melancholic lullaby)

Figure 1 :1Figure 1: An example of a verbalized rebus crafted by combining a rebus first pass (intermediate solution) with crossword definitions. We use verbalized rebuses to test LLMs' sequential instruction following capabilities. Image from Settimana Enigmistica n. 4656, © Bresi S.r.l.

Table 11TemplateMetrics For our granular evaluation of rebus-solvingperformance, we adopt the following set of metrics focus-ing on the first passes (FP) and solutions (S) generatedby LLMs:• Definition (Def.): Proportion of correctlyguessed words during definition resolution.

used in fine-tuning and prompting experiments with highlighted reasoning stages. Example rebus by Il Piacentino, Settimana Enigmistica n. 29423) re-segmenting it into solution words based on the solution key (solution segmentation); and finally 4) producing the solution by copying re-segmented words. We automatically convert rebuses in this format by deriving the solution key from solution word lengths and dynamically infilling the available information into the template. We use a similar format for prompting experiments, with five in-context step-by-step demonstrations and an explicit instruction asking the model to stick to the previous examples' format to streamline solution parsing.

Table 22Fine-grained verbalized rebus solving performances of various LLMs. Bold denotes best overall performances, and underline marks best training-free results.First Pass (FP)Solution (S)SetupDef.Words LettersEMKey Match FP Match WordsEMLLaMA-3 70B5-shot prompt0.220.200.600.040.160.510.030.00Qwen-2 72B5-shot prompt0.280.250.760.040.200.520.040.00GPT-4o5-shot prompt0.550.510.830.150.530.740.270.11Claude-3.5 Sonnet 5-shot prompt0.660.620.900.280.830.820.430.24Phi-3 3.8B (ours)fine-tuned0.840.841.000.560.860.940.680.51GPT-4oPhi-3 (ours)MetricTestTestTestTestTestTestidoodΔidoodΔFP W. ID0.520.51-0.010.960.960.00FP W. OOD-0.44--0.20-FP EM0.160.14-0.020.890.18-0.71S W. ID0.290.26-0.030.920.49-0.43S W. OOD0.180.16-0.020.630.20-0.40S EM0.120.09-0.030.820.16-0.66

Table 3 Model3performances for test subsets containing only indomain (Test ID), or some out-of-domain (Test OOD) first pass words. W. ID and W. OOD are accuracies for ID and OOD words for first pass (FP) and solution (S) sequences. Test Δ = Test ID -Test OOD performance.Word Complexity and Frequency Affects LLM Finetuning PerformanceFor every word in the first passes and solutions of test set examples, we measure LLMs' overall accuracy in predicting it for the full test set. We then correlate this score to various quantities that could motivate LLMs' performances. More specifically, we use 1) the word frequency in the training set;

evaluate our fine-tuned model in out-of-distribution settings. For this evaluation, the 2k examples of the test set from previous sections are divided into two subsets: one in which all first pass words were seen during finetuning by Phi-3 (Test id, 1061 examples) and one in which, for every example, at least one first pass word was unseen in training (Test ood, 939 examples). Intuitively, if Phi-3 performance is mainly motivated by memorizing fine-tuning data, introducing OOD words should produce a significant drop in model performances. Results shown in Table

Table 44Examples of LLM generations for rebuses by Slam, Nuova Enigmistica Tascabile n. 2802 (top) and Grizzly, Domenica Quiz n. 2 (bottom). Correct guesses and errors and denoted for predicted first pass definitions (D 1,...,𝑁 ) and solution words (S 𝑖 , with 𝑖 being the 𝑖-th solution key value).

GPT-4oClaude 3.5SPhi-3D1ponepiaD2chelachelachelaD3oradatadataD4gineproliogineproD5ludoocaocaD6acerbiannianniS8SpettacoloSaponateSappiateS3chechecheS2falalaS12sognaresbadatagginesbadataggineS7ognivocandoprovocaS5seradannidanniSoluzione: SAPpiaTE che la SBAdataGgine proVoca DanniRebus: STU [Si salva otturandolo] D1 S [Ha foglieseghettate] D2 AL [Lo è l'operaio che lavora in cantiere] D3G [Un uomo... non all' altezza] D4Chiave di lettura: 11 7 2 7StepGPT-4oClaude 3.5SPhi-3D1tappofalladenteD2aceroorticaaroD3edileedileedileD4nanonanonanoS11StupacceroneStufallassorStudentesaroS7salendoticalealedileS2aldigiS7genanoLegnanonananoSoluzione: STUdenteSsa liceALe di LeGnano

Table 55Table 5 presents statistics for the Eu-rekaRebus dataset and the filtered subset we use for composing verbalized rebuses. The ItaCW dataset contains a total of 125,202 definitions for 40,963 unique words, with the most frequent words having hundreds of different definitions, e.g. 173 for re (king), 155 for te (you). Definitions used for verbalization are randomly sampled from Statistics for the full EurekaRebus dataset and the crosswordsfiltered subset used in this work. Avg./SD = Average/standard deviation.StatisticEurekaRebus ItaCW-filtered# examples22208983157# authors81385046Year range1800 -20241869 -2024First pass# unique words389778960Avg./SD words/ex.3.50/1/483.08/1.00Avg./SD word len.6.51/1.965.70/1.60Avg./SD FP len.26.45/11.1925.74/8.73Solution# unique words7571842558Avg./SD words/ex.3.02/1.602.80/1.21Avg./SD word len.8.07/2.307.79/2.23Avg./SD Sol. len.19.47/8.4418.81/6.06Model# Char. Paisà Freq. Train Freq.GPT-4o-0.010.010.02Claude-3.5-0.02-0.020.00Phi-3 (ours)-0.11-0.050.44GPT-4o-0.180.140.19Claude-3.5-0.150.080.13Phi-3 (ours)-0.020.080.22

Table 66Spearman's correlation with average word accuracies for metrics computed on first pass (top) and solution (bottom) words. Bold scores are significant with Bonferroni-corrected 𝑝 < 1𝑒 − 5Code, data and models are available on Github and HuggingfaceRefer to Miola[6],Bartezzaghi [7], Ichino[8] for a comprehensive overview of peculiarities and norms in modern Italian rebuses.Referred to as diagramma in jargon.https://www.evalita.ithttp://www.eureka5.it, additional details in Appendix A. Rebus illustrations are not available in Eureka5.Since verbalized rebus are produced from textual contents only, crossword definitions may refer to different word meanings (e.g. [Two soccer players] is used to represent the word "wings" in Figure1despite not matching the word sense "bird wings" of the original image. This does not affect the validity of our task.7 Composed by Test id and Test ood, described in Section5 8 https://hf.co/spaces/FinancialSupport/open_ita_llm_leaderboard 9 https://github.com/unslothai/unsloth10 An English example is available in Table9In practice, we define this as 1 − CER(FP, S), where CER is the character error rate[37] between the two sequences (lowercased, whitespace removed) computed with JiwerFor example, rebuses requiring first pass anagrams (anarebus) or dynamic relations derived from multi-scene analysis (stereorebus)

Acknowledgments

Gabriele Sarti and Arianna Bisazza acknowledge the support of the Dutch Research Council (NWO) for the project InDeep (NWA.1292.19.399). Arianna Bisazza is further supported by the NWO Talent Programme (VI.Vidi.221C.009). We are grateful to the Associazione Culturale "Biblioteca Enigmistica Italiana -G. Panini" for making its rebus collection freely accessible on the Eureka5 platform, and to Valeriya Zelenkova for her valuable comments on the first version of this work. We also thank the CLiC-it 2024 reviewers for their valuable feedback.

C. Additional Model Generations

Table 9 presents an English translation of Figure 1 example using the prompt format adopted in this study.

Tables 10 and 11 provide additional example of LLM generations for tested rebuses, with the example from Table 11 (bottom) being OOD due to the manovella (crank) word in D2, and the others being ID for the fine-tuned # Train Steps Def.

First Pass (FP) Solution (S)Words

Prompt

Solve the clues provided between parentheses to obtain a first pass, and use the solution key to obtain the rebus' solution.

Rebus: M [Two attacking footballers] N [Used for eating icecream] [Barks and bites] NIA

Solution key: 11 5 Generation Let's proceed to solve the rebus step by step:

Two attacking footballers] = ali (wings) -N = N -[Used for eating icecream] = coni (cones) -[Barks and bites] = cane (dog) -N I A = N I A

Mastering the game of Go with deep neural networks and tree search DSilver AHuang CJMaddison AGuez LSifre GVan Den Driessche JSchrittwieser IAntonoglou VPanneershelvam MLanctot SDieleman DGrewe JNham NKalchbrenner ISutskever TLillicrap MLeach KKavukcuoglu TGraepel DHassabis 10.1038/nature16961 Nature 529 2016 A general reinforcement learning algorithm that masters chess, shogi, and go through self-play DSilver THubert JSchrittwieser IAntonoglou MLai AGuez MLanctot LSifre DKumaran TGraepel TLillicrap KSimonyan DHassabis 10.1126/science.aar6404 Science 362 2018 Decrypting cryptic crosswords: Semantically complex wordplay puzzles as a target for nlp JRozner CPotts KMahowald Advances in Neural Information Processing Systems MRanzato ABeygelzimer YDauphin PLiang JWVaughan Curran Associates, Inc 2021 34 Automated crossword solving EWallace NTomlin AXu KYang EPathak MGinsberg DKlein 10.18653/v1/2022.acl-long.219 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics DTolosani Enimmistica

Hoepli, Milan

1901 EMiola Che cos'è un rebus Carocci 2020 Parole in gioco: Per una semiotica del gioco linguistico SBartezzaghi 2017 Bompiani L'ora desiata vola: guida al mondo del rebus per solutori (ancora) poco abili PIchino 2021 Bompiani Milan Riddle me this: Evaluating large language models in solving word-based games RManna MPDi Buono JMonti Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024 CMadge JChamberlain KFort UKruschwitz SLukin the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

Torino, Italia

ELRA and ICCL 2024 Puzzle solving using reasoning of large language models: A survey PGiadikiaroglou MLymperaiou GFilandrianos GStamou ArXiv 2024 A probabilistic approach to solving crossword puzzles MLLittman GAKeim NShazeer Artificial Intelligence 134 23 2002 <idno type="DOI">10.1016/S0004-3702(01)00114-X</idno> <ptr target="https://doi.org/10.1016/S0004-3702(01)00114-X" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b12"> <analytic> <title level="a" type="main">Webcrow: A web-based system for crossword solving MErnandes GAngelini MGori 10.1007/11590323_37 AAAI Conference on Artificial Intelligence 2005 Are llms good cryptic crossword solvers? ABoda DSadallah EKotova SKochmar DYao JYu IZhao TLShafran YGriffiths KNCao SYousefi LBetthauser HHasanbeig RMilliere IMomennejad ADe-Coding TZugarini KRöthenbacher MKlede BMErnandes DZEskofier 2023. 2024 ArXiv GTodd TMerino SEarle JTogelius Missed connections: Lateral thinking puzzles for large language models 2024 Arxiv Finding the optimal human strategy for wordle using maximum correct letter probabilities and reinforcement learning BJAnderson JGMeyer 2022 Arxiv The webcrow french crossword solver GAngelini MErnandes TLaquinta CStehl'e FSimoes KZeinalipour AZugarini MGori 10.1007/978-3-031-55722-4_14 Intelligent Technologies for Interactive Entertainment 2023 Die rätselrevolution: Automated german crossword solving AZugarini TRothenbacher KKlede MErnandes BMEskofier DZanca Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023) the 9th Italian Conference on Computational Linguistics (CLiC-it 2023) 2023 Solving italian crosswords using the web GAngelini MErnandes MGori 10.1007/11558590_40 International Conference of the Italian Association for Artificial Intelligence 2005 Clue-instruct: Text-based clue generation for educational crossword puzzles AZugarini KZeinalipour SSKadali MMaggini MGori LRigutini Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) NCalzolari M.-YKan VHoste ALenci SSakti NXue the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 Italian crossword generator: Enhancing education through interactive word puzzles KZeinalipour TIaquinta AZanollo GAngelini LRigutini MMaggini MGori Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023) the 9th Italian Conference on Computational Linguistics (CLiC-it 2023) 2023 A turkish educational crossword puzzle generator KZeinalipour YGKeptig MMaggini LRigutini MGori ArXiv abs/2405.07035 2024 Ghigliottin-ai@evalita2020: Evaluating artificial players for the language game "la ghigliottina PBasile MLovetere JMonti APascucci FSangati LSiciliani 10.4000/books.aaccademia.7488 EVALITA Evaluation of NLP and Speech Tools for Italian -December 17th 2020. 2020 short paper Solving a complex language game by using knowledgebased word associations discovery PBasile MDe Gemmis PLops GSemeraro 10.1109/TCIAIG.2014.2355859 IEEE Transactions on Computational Intelligence and AI in Games 8 2016 The sifo benchmark: Investigating the sequential instruction following ability of large language models XChen BLiao JQi PEustratiadis CMonz ABisazza MDe Rijke 2024 Fine-tuning large language models with sequential instructions HHu SYu PChen EMPonti 2024 Arxiv Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 Chainof-thought prompting elicits reasoning in large language models JWei XWang DSchuurmans MBosma FIchter EXia QVChi DLe Zhou Advances in Neural Information Processing Systems SKoyejo SMohamed AAgarwal DBelgrave KCho AOh Curran Associates, Inc 2022 35 MAbdin SAJacobs AAAwan JAneja AAwadallah HAwadalla NBach ABahree ABakhtiari JBao HBehl ABenhaim MBilenko JBjorck SBubeck QC Phi-3 technical report: A highly capable language model locally on your phone 2024 Arxiv LoRA: Low-rank adaptation of large language models EJHu PShen ZWallis YAllen-Zhu SLi LWang WWang Chen The Tenth International Conference on Learning Representations (ICLR 2022)

OpenReview, Online

2022 Qlora: Efficient finetuning of quantized llms TDettmers APagnoni AHoltzman LZettlemoyer Advances in Neural Information Processing Systems AOh TNaumann AGloberson KSaenko MHardt SLevine Curran Associates, Inc 2023 36 Transformers: State-of-the-art natural language processing TWolf LDebut VSanh JChaumond CDelangue AMoi PCistac TRault RLouf MFuntowicz JDavison SShleifer PVon Platen CMa YJernite JPlu CXu TLeScao SGugger MDrame QLhoest ARush 10.18653/v1/2020.emnlp-demos.6 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics QLiu DSchlangen the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics

Online

2020 Openai Hello gpt-4o Website 2024 Claude 3.5 sonnet Website 2024 Anthropic <author> <persName><forename type="first">A</forename><surname>Yang</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Yang</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Hui</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Zheng</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Yu</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Zhou</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Li</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Li</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Liu</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Huang</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Dong</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Wei</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Lin</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Tang</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Yang</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Tu</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Ma</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Xu</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Zhou</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Bai</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>He</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Lin</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Dang</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Lu</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Chen</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Yang</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Li</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Xue</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Ni</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Peng</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Men</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Gao</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Lin</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Bai</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Tan</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Zhu</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Li</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Liu</surname></persName> </author> <author> <persName><forename type="first">W</forename><surname>Ge</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Deng</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Zhou</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Ren</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Wei</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Ren</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Fan</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Yao</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Wan</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Chu</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Liu</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Cui</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Zhang</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Fan</surname></persName> </author> <ptr target="https://arxiv.org/abs/2407.10671" /> <imprint> <date type="published" when="2024">2024</date> </imprint> </monogr> <note type="report_type">Qwen2 technical report</note> </biblStruct> <biblStruct xml:id="b35"> <monogr> <author> <persName><forename type="first">M</forename><surname>Ai</surname></persName> </author> <ptr target="https://ai.meta.com/blog/meta-llama-3" /> <title level="m">Introducing meta llama 3: The most capable openly available llm to date Website 2024 Disce aut deficere: Evaluating llms proficiency on the invalsi italian benchmark FMercorio MMezzanzanica DPotertì ASerino ASeveso 2024 From wer and ril to mer and wil: improved evaluation measures for connected speech recognition AMorris VMaier PGreen 2004 The PAISÀ corpus of Italian web texts VLyding EStemle CBorghetti MBrunello SCastagnoli FDell'orletta HDittmann ALenci VPirrelli 10.3115/v1/W14-0406 Proceedings of the 9th Web as Corpus Workshop (WaC-9), Association for Computational Linguistics FBildhauer RSchäfer the 9th Web as Corpus Workshop (WaC-9), Association for Computational Linguistics

Gothenburg, Sweden

2014 Crossword puzzle resolution via monte carlo tree search LChen JLiu SJiang CWang JLiang YXiao SZhang RSong 10.1609/icaps.v32i1.19783 Proceedings of the International Conference on Automated Planning and Scheduling the International Conference on Automated Planning and Scheduling 2022 32 JFerrando GSarti ABisazza MRCosta A primer on the inner workings of transformerbased language models 2024 Arxiv -jussà Teoria statistica delle classi e calcolo delle probabilita, Pubblicazioni del R. Istituto CBonferroni Superiore di Scienze Economiche e Commericiali di Firenze 8 1936