1. Introduction and Background

Manifold Learning for Italian Crosswords and Beyond

CristianoCiaccio

GabrieleSart

AlessioMiaschi

FeliceDell'Orlet

Language Games, Crosswords, Semantic Similarity, Embeddings, Natural Language Processing, Information Retrieval

0 Center for Language and Cognition (CLCG), University of Groningen , The Netherlands 1 ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC) , Pisa , Italy

2025

Answering crossword puzzle clues presents a challenging retrieval task that requires matching linguistically rich and often ambiguous clues with appropriate solutions. While traditional retrieval-based strategies can commonly be used to address this issue, wordplays and other lateral thinking strategies limit the efectiveness of conventional lexical and semantic approaches. In this work, we address the clue answering task as an information retrieval problem exploiting the potential of encoder-based Transformer models to learn a shared latent space between clues and solutions. In particular, we propose for the first time a collection of siamese and asymmetric dual encoder architectures trained to capture the complex properties and relation characterizing crossword clues and their solutions for the Italian language. After comparing various architectures for this task, we show that the strong retrieval capabilities of these systems extend to neologisms and dictionary terms, suggesting their potential use in linguistic analyses beyond the scope of language games.

often omitted While these traditional retrieval systems

1. Introduction and Background

Language games have emerged as compelling benchmarks for evaluating the reasoning capabilities of language models (LMs), ofering structured challenges that require diverse cognitive skills including wordplay comprehension, lateral thinking, and cultural knowledge integration2[, 3, 4, 5]. Among popular language games, crossword puzzles stand out as particularly challenging, demanding not only linguistic competence but also exnEvelop-O LGOBE (F. Dell’Orletta) (F. Dell’Orletta)

0009-0001-6113-4761 (C. Ciaccio);0000-0001-8715-2987 (G. Sarti);0000-0002-0736-5411 (A. Miaschi);0000-0003-3454-9387 (F. Dell’Orletta) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License

Language Models have shown impressive performance on many natural language understanding tasks, thoeuirr system, is in Appendix A. efectiveness on language games remains constrained tensive world knowledge, cultural awareness, and lateFrigaulre 1: An example of symmetric-style crossword puzzle. thinking skills6[, 7, 8, 9]. While recent advances in LargeThe grid was populated using clues taken from the test set. The correct solution, which was autonomously found leveraging by fundamental limitations in accessing linguistic an[1d2] proposed a retrieval model that exploited lexical culturally-relevant knowledge, in particular for rleessosu-rces and similarity metrics to match clues to canresourced non-English language5s][.

Before the advent of modern language models, most api-ntroduced SACRY, a system that incorporated syntacproaches to crossword solving relied on retrieval-basteidc information and ranking strategies to improve cluemethods and shallow lexical and semantic features atnoswer matching. Importantly, fill-in-the-blank clues identify relevant information10[, 11]. For example, and clues representing anagrams or linguistic games are didate answers in Italian. In a subsequent wor1k3,] [ CEUR Workshop

ISSN1613-0073 including the use of wordplay, homophones and other procedimenti lenti” plays on the polysemanticity olefnti (in Italian, either “slowm”asc. plur., or “lenses”), and could haveottici (opticians) as a valid solution. These kinds of subtle connections hinder the viability of traditional retrieval systems in the context of crosswocrodntrastive learning objective to learn a joint embedding games. space between clues and words.

Recent advances in cross-modal learning, particularlyIn the following sections, we describe in detail the arin vision-language models such as CLIP1[4, 15], have chitecture of our model (Sectio2n.1), the datasets used demonstrated the efectiveness of dual encoder architecfo-r the experiments (Section2.2), the encoder models tures in learning shared representations across diferenetmployed (Section2.3), the experimental setting (Secmodalities. These approaches typically employ separatteion2.4), the evaluation strategy adopted to assess the encoders for each modality, training them to project isny-stem’s performance (Section2.5). puts into a common latent space where semantically related items cluster together. Inspired by these success2e.s1, . Model’s Architecture we propose adapting this paradigm to the domain of language games, specifically focusing on the relationshipTo explore the efectiveness of our approach, we experibetween crossword clues and their solut1io.ns ment with diferent encoder-based models for initializ

In this work, we evaluate several dual encoder arcihnig- the encoder towers, each fine-tuned and tested on a tectures designed to learn efective representations fodrataset of Italian crossword clues. As shown by Dong crossword puzzle elements (see Figur1efor an example et al[.18], to efectively learn a shared parameter space of a crossword puzzle). Our approaches treat clues aunsding a dual encoder, there are two main architectural opsolutions as distinct ”modalities” that can be embeddetdions: (a) theSiamese Dual Encoder (SDE) and (b) the to a shared latent space. The clue encoder must undAers-ymmetric Dual Encoder (ADE) with a shared linear stand various forms of wordplay, cultural references, apnrdojection. Both consist of two pre-trained Transformers linguistic devices, while the solution encoder must reepn-coders, in our case, a clue-encod er1 and solutionresent semantic, lexical and grammatical characterisetniccosder 2, trained to produce representatiocn=s 1( ) of the words. By training these encoders jointly withaand s = 2( ) by average pooling, where botch , s ∈ ℝ . contrastive objective, we create a retrieval system speTchife-se are linearly projected into a shared feature space ically optimized for the complexities of crossword pu z-∈ ℝ in order to maximize the cosine similarity between zles. Our contributions are threefold: (1) We formalipzoesitive pairs c( , s+) and minimize it for negative ones the problem of specialized retrieval for language gam(ecs , s−). The distinction between SDE and ADE lies in and demonstrate the limitations of generic retrievaltahpe- parameter sharing: while in SDE the two encod e1rs proaches in this domain; (2) We introduce and evaluataend 2 have tied parameter s 1( = 2 ), in ADE the two multiple dual encoder architectures tailored for Itaelnicaonder towers have untied parameter 1s (≠ 2 ) but crossword puzzles, exploring diferent design choices share a final layer norm and the linear transformation and training strategies; (3) We demonstrate the utility ∶of ℝ →− ℝ , which is essential to achieve an efecour learned representations for solution ranking and texiv-ely shared space. Having separate encoders can be plore their generalization capabilities to neologisms. Oaudrvantageous when modeling diferent modalities and experimental results show that domain-specific modelsdistributions since it allows the two encoders to specialsignificantly outperform generic alternatives, suggestinigze independently on the specific nuances of the input that specialized retrieval mechanisms are essential tfoyrpes they process. To assess which of the two architecefectively ranking plausible alternatives in this domaintu.res is better suited for our task, we conduct preliminary experiments on both and compare their results in Section 3.1.

2. Our Approach 2https://www.iliesi.cnr.it/ON L.I/ 1Code, models and datasets are released aht:tps://github.com/

snizio/Crossword-Spac.e Our approach formalizes crossword’s clues answering a2s.2. Dataset an information retrieval problem. Given a cluferom For training our dual encoders, we employ tIhtaeCW the set = { 1, … , } and a matching solutio nfrom the crossword dataset 19[], containing 125k unique ifnite set of all available solution wo rd=s { 1, … , }, definition-word pairs. We expand this collection with our system scores the similarity of a subset of candaid-ditional clue-solution pairs found on the web, and dates ∗ ∈ with to produce a similarity-based rankd-eduplicate the resulting set of entries, obtaining a total ing. Inspired by CLIP’s approach 1[4], we opted for a of 416,407 samples. a dual encoder architectu1r6e][, composed of two pre- In addition to the original crossword dataset, to evalutrained transformers encoder1s7][ —referred to atsow- ate the out-of-distribution performances of our system ers— which are fine-tuned on clue-solution pairs with awe also consider word–definition pairs automatically extracted from the Italian Wiktionary, neologisms from the ONLI (Osservatorio Neologico della Lingua Ital2i)aannad [21] of encoder-decoders pre-trained on the Italian cleaned split of the MC422[]; Italian-ModernBERTbase3 (135M) and Italian-ModernBERT-base-embedmmarco-triplet4 (135M), both based on the Modern

BERT architecture23[] and pretrained on Italian with

the latter being finetuned in a sentence-transformer fashion [24] on the mMARCO dataset 2[5]; lastly, we employed paraphrase-multilingual-mpnet-basev25 [26] (278M), a multilingual model based on XLM

RoBERTa already tuned as a sentence embedder.

2.4. Experimental setting

We begin by comparing ADE and SDE architectures to

assess the optimal approach for our clues answering task. Subsequently, each model is trained across two dataset configurations: the first one consists of using only a subset of the crossword dataset as the training set, the second one introduces also a split of the Italian Wiktionary in the training data. On the other hand, the evaluation is always performed on an held-out test set composed of crossenhance downstream performance on the crossword data;

On the other hand, the ONLI and the 100-neologisms

(b) to assess the extent to which models trained on wortdh- e normalized embedding of th e-th clue, ands ∈ ℝ clue pairs can be used to answer dictionary definitionst.he normalized embedding of th e-th solution word. Let ∈ [0, 1] is a fraction that determines how many of the hardest negatives are kep2t7[]. Formally, lect ∈ ℝ be extension to the main dataset. Specifically, the usage offor ( − 1) ∗ augmenting the train set with word-definition pairs can dictionary data is twofold: (a) to understand whethlaerrity to the positive target, w heirsethe batch size and hard negatives that have the highest simition to diferent linguistic settings that exhibit the same input-output structure of crosswords, ofering a natural some word-definitions pairs maintain the same infer- words clues, dictionar6y, ONLI and the 100-neologisms ential relation that occurs for most clue-solution pdaeifinristions. After merging all the data sources we split (excluding nuanced and specific crossword cases), aug-the resulting dataset into 90% train, 5% validation and 5% menting the dataset with these specific resources allowtsest (see Tabl1e). us to assess the performance variations and generaliza-We train our SDE and ADE architectures to minimize the symmetric InfoNCE loss used in CLIP1[4] with inbatch negatives. During training, for each step, we mine dataset will be used to test the robustness and generalization of our systems, therefore simulating a scenario where a novel term appears in a crossword, as is often the case. The ONLI covers a wide range of neologisms appearing on national and local newspapers, thus strictly∑ − log related to the Italian culture, including newly coine d=1 or derived formations, internationalisms, foreignerisms, 1 while the 100-neologism dataset consists of lemmas extracted from various online dictionaries (lexicalized after1 namics and contain several foreignerisms. 2020) that focus mostly on politics, COVID-19 social dy- =1 ∑ − log 2.3. Models technical terms and some authorial neologisms until 201S9im;ilarly, thesolution-to-clue loss isℒ→ : =

exp() be a learnable temperature parameter, and let clue-to-solution contrastive loss is defined asℒ→ : denote the indices of the top-hardest negatives. The ∈ ∈ exp ( ⋅ cos(c , s )) exp ( ⋅ cos(c , s )) +

∑exp ( ⋅ cos(c , s )) exp ( ⋅ cos(s , c )) exp ( ⋅ cos(s , c )) + ∑exp ( ⋅ cos(s , c ))

As backbone models, we choose several pre-trained Specifically, we picked the encoders of IT5-small

encoders available for the Italian language, va3rDye-epMount00/Italian-ModernBERT-ba.se ing in parameter size and pre-training approaches4n.ickprock/Italian-ModernBERT-base-embed-mmarco-trip.let

5sentence-transformers/paraphrase-multilingual-mpnet-ba.se-v2

(35M) and IT5-base (110M) from the IT5 family 6When augmenting the dataset with dictionary definitions, all inlfected forms are dropped.

The final symmetric contrastive loss is the average of the two losses:

1 2 ℒ =

(ℒ→ + ℒ→ ) 100

1000

The training setup is the same across all models, architectures and dataset configurations. Each model is trained for a maximum of six epochs with a batch si ze Table 2 of 256 using AdamW [28] with a linearly decaying learnT-est results for ADE and SDE architectures across the four ing rate. The hard negatives fraction decays linearly dtuerst-ed domains. Top scores per dataset are marked in bold. ing training from 0.8 to 0.05 (for detailed hyperparameter see Appendix B). the correct solution word given the corresponding clue,

Before the test phase, all available solution w ordcsonsidering the top 1/10/100/1000 most similar words are encoded into their relative embeddings, normalizerdetrieved by our system as valid. and stored into a vector database. During inference, forMaean Reciprocal Rank (MRR) represents how well a normalized clue embeddingc , the retrieval is performedsystem ranks the first relevant result by averaging the releveraging the FAISS librar2y9[] by inner product on the ciprocal ranks of the first relevant item across all queries. stored embedding matrixE| |× , where | | = 106, 988 is the cardinality of the finite set of available solutiToon simulate a more realistic crossword puzzle solving words and is the embeddings dimension. scenario, we also report metrics for candidate words retrieved from the filtered setℓ ⊆ containing only words Baselines In order to further assess the performancwe ith the same character lengℓtahs the target word , of our models, we include and compare several basef-ormally ℓ = { ∈ ∣ len() = ℓ} . We append an asterisk lines based on two main approaches: (ac)lues to clues when reporting metrics that include this filtering process (e.g. Acc@10* or MRR*). (c2c), where, given an input clue, the most similar clues and their corresponding solutions are retrieved from the training set, as commonly done in the crosswor3d. Results solving literatur1e3[, 30, 31]; and (b) clues to solutions (c2s), where solutions are retrieved by directlWye begin by comparing the two architectures under comparing the given clue against the set of all possibelvealuation, SDE and ADE, and then report the perforsolutions. For c2c we computed the similarity scoremsance of all tested models for all datasets using the bestbetween clues using (1) Levenshtein distance (c2c-levp),erforming architecture. (2) BM25 (c2c-BM25) and (3) the cosine similarities between clues representations obtained wpiatrhaphrase- 3.1. Siamese vs. Asymmetric Encoders multilingual-mpnet-base-v2 (c2c-MPNet) as a standalone sentence embedder and without any finetuning.Table 2 reports our test results for ptahreaphraseFor the c2s baseline, we rank the answers by cosine simm-ultilingual-mpnet-base-v2 model, the largest we ilarity between the clue and all solutions using, astbrea-ined, which guided our choice between the siamese fore mentioned, theparaphrase-multilingual-mpnet- and asymmetric architecture variants. Interesttinhgely, base-v2 (c2s-MPNet). To ensure a fair comparison be-asymmetric architecture shows a substantial gain in tween models and baselines, the c2c retrieval is conductpederformance only for crossword clues and especially against the clues in the training set, augmented with dinic-ranking terms (Acc@1 +13%, MRR +10%), while being tionary definitions. outperformed by SDE in all other linguistic settings, although with a narrower gap. We hypothesize that due 2.5. Evaluation to the peculiar inference links that relate clues and target words, an asymmetric architecture could be better To evaluate the retrieval performance of our trained maodt-enriching representations with input/output nuances els, we adopt the following standard metrics: separately, rather than jointly as in ADE models. Indeed, Accuracy@1/10/100/1000 is the accuracy in retrievingmany puzzles feature clues with wordplay intended to 100 .15+.01 .36+.01 .26+.01 .52+.01 .34+.01 .55+.01 .16+.02 .37+.02 .43−.01 .64−.00 .13+.06 .31+.10 .25+.08 .50+.09 .09+.05 .26+.11 .10+.04 .28+.07 .12+.08 .33+.13 might be can (half of the dance namedcancan). In this test sets: setting, an encoder specialized in enriching the represenbe taken metaphorically or in other non-literal sens3e.s2.. Main results For example, a correct answer for the clue “half a dance”

Table3 shows the results of all models across the various subsequent evaluations. tation of the clue with dance names might be necessary MPNet-base, ModernSBert and IT5-base to achieve good performances. On the other hand, forCrosswords dictionary-like entries, there is no suficient need to de-strongly outperform all baselines, especially at higher velop uniquely independent representations (as shown bycandidate sizes and when applying length filtering (“*”). the ADE performance drop) since word-definition pairs Overall, the MPNet-base yields the best result, suggestare typically symmetric in meaning and structure. iInng that model size has a positive efect on improving these settings, the same encoder can efectively capturteask performance. In terms of MRR, ModernSBert is the both sides of the pair, benefiting from shared parameterssecond-best performer, substantially outperforming its that reinforce semantic alignment. Given that our pornil-y pre-trained counterpart, ModernBert, underscoring mary interest in this work lies in crosswords, we adoptthe additional value of using models that have already the ADE architecture with a shared linear projectionufonrdergone a sentence finetuning phase for boosting retrieval performance. All baselines leveraging the c2c approach are superior when confronted with IT5-small and

ModernBert, especially in terms of MRR. Interestingly, incorporating dictionary data into the training set yields only moderate overall gains and does not significantly

impact the results, further emphasizing that definitionmsay partially explain why models trained on crossword and crossword clues originate from diferent linguisticlues generalize better to ONLI neologisms than to standistributions. dard dictionary definitions, which are often more rigid and semantically grounded.

Dictionary All models, and especially baselines, severely drop in performance when dealing with dic-Neos. Models perform poorly in this setting. However, tionary data. Furthermore, the rank changes: IT5-batsheey still widely outperform all c2c baselines, which are obtains higher results than the multilingual MPNet-balsme,ost fully incapable of retrieving correct answers. Indespite having half of the parameters. As expected, ent-erestingly, the simple c2s-MPNet approach yields strong hancing the training set with dictionary samples yielrdessults, achieving top Acc@1 and Acc@∗s1cores. Oversubstantial gains across all models; especially, the MPNaeltl-, IT5-base achieves the best results, beating the c2sbase increases results-wise more than the IT5-base, rbea-seline from Acc@10, followed by the multilingual sulting in similar scores for both models. MPNet-base. As for ONLI and Dict., all models beneift importantly from training on dictionary definitions ONLI For ONLI neologisms, all c2c baselines con-and, especially, the MPNet-base in this configuration betinue to decline while c2s-MPNet gains significantly w.r.tc.omes the top performer in terms of Acc@∗1,0Acc@100, crossword clues and dictionary definitions. IT5-baseAcc@100∗ and Acc@1000. achieves the best results, with a substantial gap from the MPNet-base. As in the dictionary setting, augmenting th3e.2.1. Discussion dataset with dictionary definitions yields improvements, although more moderate. ONLI neologisms are retrieveOdverall, we observe an interesting trend concerning basebetter than dictionary words, even when augmentinlgines: while all c2c (clues to clues) approaches perform reasonably well on crosswords, their performance drasthe dataset. One hypothesis for this phenomenon is that crossword clues are more aligned with the definitions oftically drops when dealing with dictionary terms and neologisms. On the other hand, the c2s-MPNet baseneologisms, as they may reflect similar linguistic strate

line, which directly confronts clues and solutions durgies. Both crossword clues, particularly those involving wordplay, and journalistic neologism definitions often ing retrieval, exhibits an inverse trend, performing betrely on compositionality. For example, clues such as “hatlfer with definition-like clues than with crossword clues. a dance” or “prefix meaning new” require the decomposi- These results further corroborate the hypothesis that tion and reinterpretation of word parts, similarly to macnluyes and definitions have a diferent relation to target neologisms in ONLI are defined through transparent com-words: words and definitions are more semantically pounds or afix-based constructions (e.g.,mafiocracy = aligned, from a distributional point of view, than mafia + -cracy). This shared reliance on compositionalitycrossword clues and solutions. Furthermore, the extremely low performance of c2c-baselines on neologisms confirms that clues-to-clues mappings are insuficient Automated Crossword Solving Despite not being to handle lexical innovation in crossword puzzles. Thtishe main focus of this article, we tried to leverage our supports our initial motivation for a joint latent spsaycsetem to automatically solve crossword puzzles as a conthat leverages rich distributed representations, enablcirnegte application of clues answeringcrossword. Figures the modeling of unseen clues and solutions for the tas1kand 3 show an example of a crossword puzzle, built of crossword retrieval. Finally, tmhaejority of our entirely from clues in the test sets, automatically filled trained systems achieved better results than base- using the Z3 SMT (Satisfiability Modulo Theories) solver lines on crossword clues with the biggest and multi-[34]7, leveraging candidates retrieved by the MPNet-base lingual model, MPNet-base, achieving the best resultmso,del. Specifically, by treating crossword puzzles as a closely followed by the IT5-base. For neologisms in pars-atisfiability problem, we can define a set of first-order ticular, the better performances of the monolingual IlTo5g-ical constraints that must be satisfied across all varibase encoder despite its smaller parameter count suagb-les (grid cells) to find valid solutions: each clue corgest thatlanguage-specific training might benefit responds to a sequence of grid variables constrained to retrieval in domains heavily influenced by culture match one of its candidate answers, forming a disjunctive and language-specific lexical innovation dynamics . (OR) group. These candidate-level constraints are then combined conjunctively (AND) across all clues. Additionally, for intersecting cells, equality constraints are 4. Analysis and Applications enforced to ensure character consistency between overlapping horizontal and vertical words. The final formula, This section provides further explorations in applicatiocnosmposed of these conjunctive and disjunctive logical and properties of our crossword embeddings systems. statements, is passed to the solver, which searches for a globally consistent solution that satisfies all constraints Examples Analysis Table4 reports some examples simultaneously. Despite the complexity of this approach, of the Top2 retrieved answers across baselines, modelws hich requires that each candidate set contains the corand test sets. For this purpose, we manually selectreedct solution, our biggest model, MPNet-base, was able cases showing the limitations of traditional baselinteos,solve entirely some small-medium grids using a cane.g. crossword clues carrying a non-literal meaning. Fdoridate size10 ≤ ≤ 50 , confirming the efectiveness of example, the cryptic-style clue ”Lido senza pari” (tranosulr. system. We posit that a strategy iterating Z3 solving Beach without even) requires interpretinegven as refer- attempts over progressively larger candidate sizes could ring to the characters in even positions inside the worpdrovide a strong baseline for crossword solving systems lido. Baselines do not capture this meaning nuance, whilweith a given computational budget, and we leave such some of our models arrive at the correct solution, daes-sessment to future work. spite the well-known problem of character awareness in character-blind model3s2[, 33]. Another interesting case involves neologisms: baselines are unable to retrieve th5e. Conclusion and Future Work correct answers since they represent a fringe minority in the available pool of definitions and solutions. On theIn this work, we introduced and evaluated dual encoder other hand, our models, especially the monolingual ITa5,rchitectures for retrieving solutions of Italian crossword show signs of generalization and were able to retrievcelues by learning a shared latent space between clues the correct answers despite not being trained on thema.nd solutions. Our experiments demonstrated that the

7We partially modified the implementation found ahtttps://github.

com/pncnmnp/Crossword-Solve r.

Asymmetric Dual Encoder (ADE) architecture, with its ings of the Eleventh Italian Conference on Compuindependent encoders for clues and solutions, outper- tational Linguistics (CLiC-it 2025), 2025. formed the Siamese Dual Encoder (SDE) in handling the [2] P. Basile, M. de Gemmis, P. Lops, G. Semeraro, Solvnuanced and often non-literal relationships characteris- ing a complex language game by using knowledgetic of crossword puzzles. Our results also highlighted based word associations discovery, IEEE Transthe limitations of traditional retrieval-based approaches actions on Computational Intelligence and AI in (e.g., clues-to-clues methods), particularly when testing Games 8 (2016) 13–26. doi:10.1109/TCIAIG.2014. their generalization towards neologisms’ definitions. In 2355859. contrast, our dual encoder-based models, especially th[e3] R. Manna, M. P. di Buono, J. Monti, Riddle me larger and multilingual MPNet-base and the monolin- this: Evaluating large language models in solving gual IT5-base, exhibited signs of generalization across word-based games, in: C. Madge, J. Chamberlain, diverse linguistic settings, including newly coined terms K. Fort, U. Kruschwitz, S. Lukin (Eds.), Proceedings and culturally specific references. This underscores the of the 10th Workshop on Games and Natural Lanimportance of leveraging rich distributed representations guage Processing @ LREC-COLING 2024, ELRA to model the complex interplay between clues and solu- and ICCL, Torino, Italia, 2024, pp. 97–106. URL: tions. https://aclanthology.org/2024.games-1..11

In future work, it could be interesting to explore en[4-] P. Giadikiaroglou, M. Lymperaiou, G. Filandrisemble methods that combine traditional information re- anos, G. Stamou, Puzzle solving using reasontrieval approaches with dual encoder models, including ing of large language models: A survey, in: clues-to-clues retrieval techniques, to leverage their com- Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proplementary strengths. Training a cross-encoder reranker ceedings of the 2024 Conference on Empirical on top of retrieved candidate solutions may also prove Methods in Natural Language Processing, Asbeneficial, as it would enable the exploitation of contex- sociation for Computational Linguistics, Miami, tual relationships between clues and solutions, an ap- Florida, USA, 2024, pp. 11574–11591. URL:https: proach that is standard in retrieval-based systems. More- //aclanthology.org/2024.emnlp-main.64.6d/oi:10. over, conducting a detailed linguistic analysis of clues, 18653/v1/2024.emnlp-main.646. examining categories, frequency distributions, and other[5] G. Sarti, T. Caselli, M. Nissim, A. Bisazza, Non verproperties, could provide deeper insights into their char- bis, sed rebus: Large language models are weak acteristics. Finally, extending the methodology toward solvers of Italian rebuses, in: F. Dell’Orletta, an automatic completion system for crossword puzzle A. Lenci, S. Montemagni, R. Sprugnoli (Eds.), Progrids represents a promising direction for supporting full ceedings of the 10th Italian Conference on Compupuzzle solving. tational Linguistics (CLiC-it 2024), CEUR Workshop Proceedings, Pisa, Italy, 2024, pp. 888–897. URL: https://aclanthology.org/2024.clicit- 1..96/ Acknowledgments [6] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak, M. Ginsberg, D. Klein, Automated crossword solvThis work has been supported by the FAIR - Future AI ing, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Research (PE00000013) project under the NRRP MUR Proceedings of the 60th Annual Meeting of the Asprogram funded by the NextGenerationEU, the PRIN sociation for Computational Linguistics (Volume 1: PNRR 2022 Project EKEEL - Empowering Knowledge Long Papers), Association for Computational LinExtraction to Empower Learners (P20227PEPK) and the guistics, Dublin, Ireland, 2022, pp. 3073–3085. URL: XAI-CARE-PNRR-MAD-2022-12376692 project under the https://aclanthology.org/2022.acl-long..2d1o9i:10. NRRP MUR program funded by the NextGenerationEU.

Partial support was also received by the pro“jUecntder- [7] 1J.86R5o3z/nve1r/,2C0.22P.oatctls-, lKo.ngM.a21h9o.wald, Decrypting standing and Enhancing Preference Alignment in Large cryptic crosswords: Semantically complex wordLanguage Models Through Controlled Text Generation” play puzzles as a target for nlp, in: M. Ranzato, (IsCc8_ALIGNLLM), funded by CINECA under the IS- A. Beygelzimer, Y. Dauphin, P. Liang, J. W. CRA initiative, for the availability of HPC resources and Vaughan (Eds.), Advances in Neural Information support. Processing Systems, volume 34, Curran Associates, Inc., 2021, pp. 11409–11421. URL: https: References //proceedings.neurips.cc/paper_files/paper/2021/ file/5f1d3986fae10ed2994d14ecd89892d7-Paper.pd.f [1] C. Bosco, E. Ježek, M. Polignano, M. Sanguinetti, [8] S. Saha, S. Chakraborty, S. Saha, U. Garain, Preface to the Eleventh Italian Conference on Com- Language models are crossword solvers, in: putational Linguistics (CLiC-it 2025), in: Proceed- L. Chiruzzo, A. Ritter, L. Wang (Eds.), Proceedings of the 2025 Conference of the Nations of the Amer- F. Boschetti, G. E. Lebani, B. Magnini, N. Novielli icas Chapter of the Association for Computational (Eds.), Proceedings of the 9th Italian Conference on Linguistics: Human Language Technologies (Vol- Computational Linguistics, Venice, Italy, Novemume 1: Long Papers), Association for Computa- ber 30 - December 2, 2023, volume 3596 ofCEUR tional Linguistics, Albuquerque, New Mexico, 2025, Workshop Proceedings, CEUR-WS.org, 2023. URL: pp. 2074–2090. URL: https://aclanthology.org/2025. https://ceur-ws.org/Vol-3596/paper9.p.df naacl-long.104./ [16] D. Gillick, A. Presta, G. S. Tomar, End-to-end [9] A. Sadallah, D. Kotova, E. Kochmar, What makes retrieval in continuous space, arXiv preprint cryptic crosswords challenging for LLMs?, in: arXiv:1811.08008 (2018).

O. Rambow, L. Wanner, M. Apidianaki, H. Al- [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Khalifa, B. D. Eugenio, S. Schockaert (Eds.), Pro- Pre-training of deep bidirectional transformers for ceedings of the 31st International Conference on language understanding, in: J. Burstein, C. DoComputational Linguistics, Association for Com- ran, T. Solorio (Eds.), Proceedings of the 2019 Conputational Linguistics, Abu Dhabi, UAE, 2025, ference of the North American Chapter of the Aspp. 5102–5114. URL: https://aclanthology.org/2025. sociation for Computational Linguistics: Human coling-main.342/. Language Technologies, Volume 1 (Long and Short [10] M. Ernandes, G. Angelini, M. Gori, We- Papers), Association for Computational Linguistics, bcrow: A web-based system for crossword solv- Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: ing, in: AAAI Conference on Artificial Intelligence, https://aclanthology.org/N19-14.2d3o/i:10.18653/ 2005. URL: https://link.springer.com/chapter/10. v1/N19-1423.

1007/11590323_37. [18] Z. Dong, J. Ni, D. Bikel, E. Alfonseca, Y. Wang, [11] G. Angelini, M. Ernandes, M. Gori, Solving ital- C. Qu, I. Zitouni, Exploring dual encoder archiian crosswords using the web, in: International tectures for question answering, in: Y. GoldConference of the Italian Association for Artificial berg, Z. Kozareva, Y. Zhang (Eds.), Proceedings Intelligence, 2005. URLh:ttps://link.springer.com/ of the 2022 Conference on Empirical Methods chapter/10.1007/11558590_40. in Natural Language Processing, Association for [12] G. Barlacchi, M. Nicosia, A. Moschitti, A retrieval Computational Linguistics, Abu Dhabi, United model for automatic resolutionof crossword puz- Arab Emirates, 2022, pp. 9414–9419. URL:https: zles in italian language, in: Proceedings of the //aclanthology.org/2022.emnlp-main.64.0d/oi:10. First Italian Conference on Computational Linguis- 18653/v1/2022.emnlp-main.640. tics CLiC-it 2014 & and of the Fourth Internation[a1l9] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini, Workshop EVALITA 2014: 9-11 December 2014, L. Rigutini, M. Maggini, M. Gori, Italian crossword Pisa, Pisa University Press, 2014, pp. 33–37. generator: Enhancing education through interac[13] A. Moschitti, M. Nicosia, G. Barlacchi, SACRY: tive word puzzles, in: Proceedings of the 9th Italian Syntax-based automatic crossword puzzle resolu- Conference on Computational Linguistics (CLiC-it tion sYstem, in: H.-H. Chen, K. Markert (Eds.), 2023), 2023. URL: https://ceur-ws.org/Vol-35.96 Proceedings of ACL-IJCNLP 2015 System Demon- [20] C. Ciaccio, A. Miaschi, F. Dell’Orletta, Evaluating strations, Association for Computational Linguis- lexical proficiency in neural language models, in: tics and The Asian Federation of Natural Language Proceedings of the 63rd Annual Meeting of the AsProcessing, Beijing, China, 2015, pp. 79–84. URL: sociation for Computational Linguistics (Volume https://aclanthology.org/P15-40.1d4o/i:10.3115/ 1: Long Papers), Association for Computational v1/P15-4014. Linguistics, Vienna, Austria, 2025. [14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, [21] G. Sarti, M. Nissim, IT5: Text-to-text pretraining for G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, Italian language understanding and generation, in: J. Clark, G. Krueger, I. Sutskever, Learning trans- N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, ferable visual models from natural language su- N. Xue (Eds.), Proceedings of the 2024 Joint Inpervision, in: M. Meila, T. Zhang (Eds.), Pro- ternational Conference on Computational Linguisceedings of the 38th International Conference tics, Language Resources and Evaluation (LRECon Machine Learning, volume 139 ofProceed- COLING 2024), ELRA and ICCL, Torino, Italia, 2024, ings of Machine Learning Research, PMLR, 2021, pp. 9422–9433. URL: https://aclanthology.org/2024. pp. 8748–8763. URL: https://proceedings.mlr.press/ lrec-main.823./ v139/radford21a.htm.l [22] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al[15] F. Bianchi, G. Attanasio, R. Pisoni, S. Terragni, Rfou, A. Siddhant, A. Barua, C. Rafel, mT5: G. Sarti, D. Balestri, Contrastive language- A massively multilingual pre-trained text-to-text image pre-training for the italian language, in: transformer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 483–498.

URL: https://aclanthology.org/2021.naacl-mai.n.41 doi:10.18653/v1/2021.naacl-main.41. [23] B. Warner, A. Chafin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen, et al., Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory eficient, and long context finetuning and Figure 4: Solution for the autonomously solved crossword inference, arXiv preprint arXiv:2412.13663 (2024). puzzle in Figure 1. [24] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in: the 7th International Joint Conference on Natural K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of Language Processing (Volume 2: Short Papers), Asthe 2019 Conference on Empirical Methods in Nat- sociation for Computational Linguistics, Beijing, ural Language Processing and the 9th International China, 2015, pp. 199–204. URL: https://aclanthology. Joint Conference on Natural Language Processing org/P15-2033/. doi:10.3115/v1/P15-2033. (EMNLP-IJCNLP), Association for Computational[32] L. Edman, H. Schmid, A. Fraser, CUTE: MeaLinguistics, Hong Kong, China, 2019, pp. 3982–3992. suring LLMs’ understanding of their tokens, in: URL: https://aclanthology.org/D19-14.10d/oi:10. Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Pro18653/v1/D19-1410. ceedings of the 2024 Conference on Empirical [25] L. Bonifacio, I. Campiotti, R. de Alencar Lotufo, Methods in Natural Language Processing, AsR. F. Nogueira, mmarco: A multilingual version sociation for Computational Linguistics, Miami, of MS MARCO passage ranking dataset, CoRR Florida, USA, 2024, pp. 3017–3026. URL: https: abs/2108.13897 (2021). URL: https://arxiv.org/abs/ //aclanthology.org/2024.emnlp-main.17.7d/oi:10. 2108.13897. arXiv:2108.13897. 18653/v1/2024.emnlp-main.177. [26] N. Reimers, I. Gurevych, Making monolingual [33] C. Ciaccio, M. Sartor, A. Miaschi, F. Dell’Orletta, Besentence embeddings multilingual using knowl- yond the spelling miracle: Investigating substring edge distillation, in: B. Webber, T. Cohn, Y. He, awareness in character-blind language models, in: Y. Liu (Eds.), Proceedings of the 2020 Conference Proceedings of the 63rd Annual Meeting of the Ason Empirical Methods in Natural Language Process- sociation for Computational Linguistics (Volume ing (EMNLP), Association for Computational Lin- 1: Long Papers), Association for Computational guistics, Online, 2020, pp. 4512–4525. URL:https: Linguistics, Vienna, Austria, 2025. //aclanthology.org/2020.emnlp-main.36.5d/oi:10. [34] L. de Moura, N. Bjørner, Z3: An eficient smt solver, 18653/v1/2020.emnlp-main.365. in: C. R. Ramakrishnan, J. Rehof (Eds.), Tools and [27] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Con- Algorithms for the Construction and Analysis of trastive learning with hard negative samples, Inter- Systems, Springer Berlin Heidelberg, Berlin, Heinational Conference on Learning Representations delberg, 2008, pp. 337–340.

(2021). [28] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.0510A1 . Solved crossword puzzle (2017). [29] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szil- Figure4 report the solution of the crossword presented vasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou, in Figure1.

The faiss library, arXiv preprint arXiv:2401.08281 (2024). [30] A. Zugarini, M. Ernandes, A multi-strategy apB- . Further details on the proach to crossword clue answer retrieval and rank- hyperparameters ing, in: CLiC-it, 2021. [31] A. Severyn, M. Nicosia, G. Barlacchi, A. MoschittiB, oth the siamese and asymmetric architectures were deDistributional neural networks for automatic resisgon-ed using PyTorch and the training was conducted on lution of crossword puzzles, in: C. Zong, M. Strubetwo Nvidia GeForce RTX 4090 GPUs. For the asymmet(Eds.), Proceedings of the 53rd Annual Meeting ofric architecture we leverage parallelization by assigning the Association for Computational Linguistics anedach encoder to a diferent GPU. Each model was trained to produce representations of dimensionality equals to 768. We used the default betas andAdamW parameters.

Table5 reports the specific hyperparameters used with each model. Due to limited computational resources, we did not perform an extensive hyperparamters optimization, rather, we relied on the configurations suggested by the models creators. The maximum token length of the clues and solutions were set to respectively 64 and 16.

The learnable temperature parame tweras initialized to the equivalent of 0.07 from and clipped as done in CLIP paper. During batch generation, in order to avoid false negatives during hard batch mining, each batch cannot contain the same solution two or more times. weight decay

During training, we kept track of the model’s performance on the validation dataset and we picked the checkpoint with lowest validation loss.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Drafting content, Paraphrase and reword, Improve writing style, Grammar and spelling check, and Formatting assistance. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.