<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Manifold Learning for Italian Crosswords and Beyond</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>CristianoCiaccio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>GabrieleSart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AlessioMiaschi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FeliceDell'Orlet</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Language Games, Crosswords, Semantic Similarity, Embeddings, Natural Language Processing, Information Retrieval</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Language and Cognition (CLCG), University of Groningen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ItaliaNLP Lab, Istituto di Linguistica Computazionale “A. Zampolli” (CNR-ILC)</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Answering crossword puzzle clues presents a challenging retrieval task that requires matching linguistically rich and often ambiguous clues with appropriate solutions. While traditional retrieval-based strategies can commonly be used to address this issue, wordplays and other lateral thinking strategies limit the efectiveness of conventional lexical and semantic approaches. In this work, we address the clue answering task as an information retrieval problem exploiting the potential of encoder-based Transformer models to learn a shared latent space between clues and solutions. In particular, we propose for the first time a collection of siamese and asymmetric dual encoder architectures trained to capture the complex properties and relation characterizing crossword clues and their solutions for the Italian language. After comparing various architectures for this task, we show that the strong retrieval capabilities of these systems extend to neologisms and dictionary terms, suggesting their potential use in linguistic analyses beyond the scope of language games.</p>
      </abstract>
      <kwd-group>
        <kwd>often omitted</kwd>
        <kwd>While these traditional retrieval systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <p>Language games have emerged as compelling
benchmarks for evaluating the reasoning capabilities of
language models (LMs), ofering structured challenges that
require diverse cognitive skills including wordplay
comprehension, lateral thinking, and cultural knowledge
integration2[, 3, 4, 5]. Among popular language games,
crossword puzzles stand out as particularly challenging,
demanding not only linguistic competence but also
exnEvelop-O
LGOBE
(F. Dell’Orletta)
(F. Dell’Orletta)</p>
      <p>0009-0001-6113-4761 (C. Ciaccio);0000-0001-8715-2987
(G. Sarti);0000-0002-0736-5411 (A. Miaschi);0000-0003-3454-9387
(F. Dell’Orletta)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License</p>
      <p>Language Models have shown impressive performance
on many natural language understanding tasks, thoeuirr system, is in Appendix A.
efectiveness on language games remains constrained
tensive world knowledge, cultural awareness, and lateFrigaulre 1: An example of symmetric-style crossword puzzle.
thinking skills6[, 7, 8, 9]. While recent advances in LargeThe grid was populated using clues taken from the test set. The
correct solution, which was autonomously found leveraging
by fundamental limitations in accessing linguistic an[1d2] proposed a retrieval model that exploited lexical
culturally-relevant knowledge, in particular for rleessosu-rces and similarity metrics to match clues to
canresourced non-English language5s][.</p>
      <p>Before the advent of modern language models, most api-ntroduced SACRY, a system that incorporated
syntacproaches to crossword solving relied on retrieval-basteidc information and ranking strategies to improve
cluemethods and shallow lexical and semantic features atnoswer matching. Importantly, fill-in-the-blank clues
identify relevant information10[, 11]. For example, and clues representing anagrams or linguistic games are
didate answers in Italian. In a subsequent wor1k3,] [
CEUR
Workshop</p>
      <p>ISSN1613-0073
including the use of wordplay, homophones and other
procedimenti lenti” plays on the polysemanticity olefnti
(in Italian, either “slowm”asc. plur., or “lenses”), and
could haveottici (opticians) as a valid solution. These
kinds of subtle connections hinder the viability of
traditional retrieval systems in the context of crosswocrodntrastive learning objective to learn a joint embedding
games. space between clues and words.</p>
      <p>Recent advances in cross-modal learning, particularlyIn the following sections, we describe in detail the
arin vision-language models such as CLIP1[4, 15], have chitecture of our model (Sectio2n.1), the datasets used
demonstrated the efectiveness of dual encoder architecfo-r the experiments (Section2.2), the encoder models
tures in learning shared representations across diferenetmployed (Section2.3), the experimental setting
(Secmodalities. These approaches typically employ separatteion2.4), the evaluation strategy adopted to assess the
encoders for each modality, training them to project isny-stem’s performance (Section2.5).
puts into a common latent space where semantically
related items cluster together. Inspired by these success2e.s1, . Model’s Architecture
we propose adapting this paradigm to the domain of
language games, specifically focusing on the relationshipTo explore the efectiveness of our approach, we
experibetween crossword clues and their solut1io.ns ment with diferent encoder-based models for
initializ</p>
      <p>In this work, we evaluate several dual encoder arcihnig- the encoder towers, each fine-tuned and tested on a
tectures designed to learn efective representations fodrataset of Italian crossword clues. As shown by Dong
crossword puzzle elements (see Figur1efor an example et al[.18], to efectively learn a shared parameter space
of a crossword puzzle). Our approaches treat clues aunsding a dual encoder, there are two main architectural
opsolutions as distinct ”modalities” that can be embeddetdions: (a) theSiamese Dual Encoder (SDE) and (b) the
to a shared latent space. The clue encoder must undAers-ymmetric Dual Encoder (ADE) with a shared linear
stand various forms of wordplay, cultural references, apnrdojection. Both consist of two pre-trained Transformers
linguistic devices, while the solution encoder must reepn-coders, in our case, a clue-encod er1 and
solutionresent semantic, lexical and grammatical characterisetniccosder 2, trained to produce representatiocn=s  1(  )
of the words. By training these encoders jointly withaand s =  2(  ) by average pooling, where botch , s ∈ ℝ .
contrastive objective, we create a retrieval system speTchife-se are linearly projected into a shared feature space
ically optimized for the complexities of crossword pu z-∈ ℝ  in order to maximize the cosine similarity between
zles. Our contributions are threefold: (1) We formalipzoesitive pairs c( , s+) and minimize it for negative ones
the problem of specialized retrieval for language gam(ecs , s−). The distinction between SDE and ADE lies in
and demonstrate the limitations of generic retrievaltahpe- parameter sharing: while in SDE the two encod e1rs
proaches in this domain; (2) We introduce and evaluataend  2 have tied parameter s 1( =   2 ), in ADE the two
multiple dual encoder architectures tailored for Itaelnicaonder towers have untied parameter 1s (≠   2 ) but
crossword puzzles, exploring diferent design choices share a final layer norm and the linear transformation
and training strategies; (3) We demonstrate the utility ∶of ℝ  →− ℝ , which is essential to achieve an
efecour learned representations for solution ranking and texiv-ely shared space. Having separate encoders can be
plore their generalization capabilities to neologisms. Oaudrvantageous when modeling diferent modalities and
experimental results show that domain-specific modelsdistributions since it allows the two encoders to
specialsignificantly outperform generic alternatives, suggestinigze independently on the specific nuances of the input
that specialized retrieval mechanisms are essential tfoyrpes they process. To assess which of the two
architecefectively ranking plausible alternatives in this domaintu.res is better suited for our task, we conduct preliminary
experiments on both and compare their results in Section
3.1.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Our Approach</title>
      <sec id="sec-2-1">
        <title>2https://www.iliesi.cnr.it/ON L.I/</title>
      </sec>
      <sec id="sec-2-2">
        <title>1Code, models and datasets are released aht:tps://github.com/</title>
        <p>snizio/Crossword-Spac.e
Our approach formalizes crossword’s clues answering a2s.2. Dataset
an information retrieval problem. Given a cluferom For training our dual encoders, we employ tIhtaeCW
the set = { 1, … ,   } and a matching solutio nfrom the crossword dataset 19[], containing 125k unique
ifnite set of all available solution wo rd=s { 1, … ,   }, definition-word pairs. We expand this collection with
our system scores the similarity of a subset of candaid-ditional clue-solution pairs found on the web, and
dates ∗ ∈  with  to produce a similarity-based rankd-eduplicate the resulting set of entries, obtaining a total
ing. Inspired by CLIP’s approach 1[4], we opted for a of 416,407 samples.
a dual encoder architectu1r6e][, composed of two pre- In addition to the original crossword dataset, to
evalutrained transformers encoder1s7][ —referred to atsow- ate the out-of-distribution performances of our system
ers— which are fine-tuned on clue-solution pairs with awe also consider word–definition pairs automatically
extracted from the Italian Wiktionary, neologisms from the
ONLI (Osservatorio Neologico della Lingua Ital2i)aannad
[21] of encoder-decoders pre-trained on the Italian
cleaned split of the MC422[];
Italian-ModernBERTbase3 (135M) and
Italian-ModernBERT-base-embedmmarco-triplet4 (135M), both based on the
Modern</p>
      </sec>
      <sec id="sec-2-3">
        <title>BERT architecture23[] and pretrained on Italian with</title>
        <p>the latter being finetuned in a sentence-transformer
fashion [24] on the mMARCO dataset 2[5]; lastly,
we employed
paraphrase-multilingual-mpnet-basev25 [26] (278M), a multilingual model based on
XLM</p>
      </sec>
      <sec id="sec-2-4">
        <title>RoBERTa already tuned as a sentence embedder.</title>
        <p>2.4. Experimental setting</p>
      </sec>
      <sec id="sec-2-5">
        <title>We begin by comparing ADE and SDE architectures to</title>
        <p>assess the optimal approach for our clues answering task.
Subsequently, each model is trained across two dataset
configurations: the first one consists of using only a
subset of the crossword dataset as the training set, the second
one introduces also a split of the Italian Wiktionary in the
training data. On the other hand, the evaluation is always
performed on an held-out test set composed of
crossenhance downstream performance on the crossword data;</p>
      </sec>
      <sec id="sec-2-6">
        <title>On the other hand, the ONLI and the 100-neologisms</title>
        <p>(b) to assess the extent to which models trained on wortdh- e normalized embedding of th e-th clue, ands ∈ ℝ
clue pairs can be used to answer dictionary definitionst.he normalized embedding of th e-th solution word. Let
 ∈ [0, 1] is a fraction that determines how many of the
hardest negatives are kep2t7[]. Formally, lect ∈ ℝ
 be
extension to the main dataset. Specifically, the usage offor ( − 1) ∗ 
augmenting the train set with word-definition pairs can
dictionary data is twofold: (a) to understand whethlaerrity to the positive target, w heirsethe batch size and
hard negatives that have the highest
simition to diferent linguistic settings that exhibit the same
input-output structure of crosswords, ofering a natural
some word-definitions pairs maintain the same infer- words clues, dictionar6y, ONLI and the 100-neologisms
ential relation that occurs for most clue-solution pdaeifinristions. After merging all the data sources we split
(excluding nuanced and specific crossword cases), aug-the resulting dataset into 90% train, 5% validation and 5%
menting the dataset with these specific resources allowtsest (see Tabl1e).
us to assess the performance variations and generaliza-We train our SDE and ADE architectures to minimize
the symmetric InfoNCE loss used in CLIP1[4] with
inbatch negatives. During training, for each step, we mine
dataset will be used to test the robustness and
generalization of our systems, therefore simulating a scenario
where a novel term appears in a crossword, as is often
the case. The ONLI covers a wide range of neologisms

appearing on national and local newspapers, thus strictly∑ − log
related to the Italian culture, including newly coine d=1
or derived formations, internationalisms, foreignerisms,

1
while the 100-neologism dataset consists of lemmas
extracted from various online dictionaries (lexicalized after1 
namics and contain several foreignerisms.
2020) that focus mostly on politics, COVID-19 social dy- =1
∑ − log
2.3. Models
technical terms and some authorial neologisms until 201S9im;ilarly, thesolution-to-clue loss isℒ→ :
 =</p>
        <p>exp() be a learnable temperature parameter, and let
clue-to-solution contrastive loss is defined asℒ→ :
  denote the indices of the top-hardest negatives. The
∈ 
∈ 
exp ( ⋅ cos(c , s ))
exp ( ⋅ cos(c , s )) +</p>
        <p>∑exp ( ⋅ cos(c , s ))
exp ( ⋅ cos(s , c ))
exp ( ⋅ cos(s , c )) +
∑exp ( ⋅ cos(s , c ))</p>
      </sec>
      <sec id="sec-2-7">
        <title>As backbone models, we choose several pre-trained</title>
      </sec>
      <sec id="sec-2-8">
        <title>Specifically, we picked the encoders of IT5-small</title>
        <p>encoders available for the Italian language, va3rDye-epMount00/Italian-ModernBERT-ba.se
ing in parameter size and pre-training approaches4n.ickprock/Italian-ModernBERT-base-embed-mmarco-trip.let</p>
      </sec>
      <sec id="sec-2-9">
        <title>5sentence-transformers/paraphrase-multilingual-mpnet-ba.se-v2</title>
        <p>(35M) and IT5-base (110M) from the IT5 family 6When augmenting the dataset with dictionary definitions, all
inlfected forms are dropped.</p>
      </sec>
      <sec id="sec-2-10">
        <title>The final symmetric contrastive loss is the average of the two losses:</title>
        <p>1
2
ℒ =</p>
        <p>(ℒ→ + ℒ→ )
100</p>
        <p>1000</p>
        <p>The training setup is the same across all models,
architectures and dataset configurations. Each model is
trained for a maximum of six epochs with a batch si ze Table 2
of 256 using AdamW [28] with a linearly decaying learnT-est results for ADE and SDE architectures across the four
ing rate. The hard negatives fraction decays linearly dtuerst-ed domains. Top scores per dataset are marked in bold.
ing training from 0.8 to 0.05 (for detailed hyperparameter
see Appendix B). the correct solution word given the corresponding clue,</p>
        <p>Before the test phase, all available solution w ordcsonsidering the top 1/10/100/1000 most similar words
are encoded into their relative embeddings, normalizerdetrieved by our system as valid.
and stored into a vector database. During inference, forMaean Reciprocal Rank (MRR) represents how well a
normalized clue embeddingc , the retrieval is performedsystem ranks the first relevant result by averaging the
releveraging the FAISS librar2y9[] by inner product on the ciprocal ranks of the first relevant item across all queries.
stored embedding matrixE| |× , where | | = 106, 988
is the cardinality of the finite set of available solutiToon simulate a more realistic crossword puzzle solving
words and is the embeddings dimension. scenario, we also report metrics for candidate words
retrieved from the filtered setℓ ⊆  containing only words
Baselines In order to further assess the performancwe ith the same character lengℓtahs the target word ,
of our models, we include and compare several basef-ormally ℓ = { ∈  ∣ len() = ℓ} . We append an asterisk
lines based on two main approaches: (ac)lues to clues when reporting metrics that include this filtering process
(e.g. Acc@10* or MRR*).
(c2c), where, given an input clue, the most similar clues
and their corresponding solutions are retrieved from
the training set, as commonly done in the crosswor3d. Results
solving literatur1e3[, 30, 31]; and (b) clues to
solutions (c2s), where solutions are retrieved by directlWye begin by comparing the two architectures under
comparing the given clue against the set of all possibelvealuation, SDE and ADE, and then report the
perforsolutions. For c2c we computed the similarity scoremsance of all tested models for all datasets using the
bestbetween clues using (1) Levenshtein distance (c2c-levp),erforming architecture.
(2) BM25 (c2c-BM25) and (3) the cosine similarities
between clues representations obtained wpiatrhaphrase- 3.1. Siamese vs. Asymmetric Encoders
multilingual-mpnet-base-v2 (c2c-MPNet) as a
standalone sentence embedder and without any finetuning.Table 2 reports our test results for
ptahreaphraseFor the c2s baseline, we rank the answers by cosine simm-ultilingual-mpnet-base-v2 model, the largest we
ilarity between the clue and all solutions using, astbrea-ined, which guided our choice between the siamese
fore mentioned, theparaphrase-multilingual-mpnet- and asymmetric architecture variants. Interesttinhgely,
base-v2 (c2s-MPNet). To ensure a fair comparison be-asymmetric architecture shows a substantial gain in
tween models and baselines, the c2c retrieval is conductpederformance only for crossword clues and especially
against the clues in the training set, augmented with dinic-ranking terms (Acc@1 +13%, MRR +10%), while being
tionary definitions. outperformed by SDE in all other linguistic settings,
although with a narrower gap. We hypothesize that due
2.5. Evaluation to the peculiar inference links that relate clues and
target words, an asymmetric architecture could be better
To evaluate the retrieval performance of our trained maodt-enriching representations with input/output nuances
els, we adopt the following standard metrics: separately, rather than jointly as in ADE models. Indeed,
Accuracy@1/10/100/1000 is the accuracy in retrievingmany puzzles feature clues with wordplay intended to
100
.15+.01 .36+.01
.26+.01 .52+.01
.34+.01 .55+.01
.16+.02 .37+.02
.43−.01 .64−.00
.13+.06 .31+.10
.25+.08 .50+.09
.09+.05 .26+.11
.10+.04 .28+.07
.12+.08 .33+.13
might be can (half of the dance namedcancan). In this test sets:
setting, an encoder specialized in enriching the
represenbe taken metaphorically or in other non-literal sens3e.s2.. Main results
For example, a correct answer for the clue “half a dance”</p>
        <p>Table3 shows the results of all models across the various
subsequent evaluations.
tation of the clue with dance names might be necessary MPNet-base, ModernSBert and IT5-base
to achieve good performances. On the other hand, forCrosswords
dictionary-like entries, there is no suficient need to de-strongly outperform all baselines, especially at higher
velop uniquely independent representations (as shown bycandidate sizes and when applying length filtering (“*”).
the ADE performance drop) since word-definition pairs Overall, the MPNet-base yields the best result,
suggestare typically symmetric in meaning and structure. iInng that model size has a positive efect on improving
these settings, the same encoder can efectively capturteask performance. In terms of MRR, ModernSBert is the
both sides of the pair, benefiting from shared parameterssecond-best performer, substantially outperforming its
that reinforce semantic alignment. Given that our pornil-y pre-trained counterpart, ModernBert, underscoring
mary interest in this work lies in crosswords, we adoptthe additional value of using models that have already
the ADE architecture with a shared linear projectionufonrdergone a sentence finetuning phase for boosting
retrieval performance. All baselines leveraging the c2c
approach are superior when confronted with IT5-small and</p>
      </sec>
      <sec id="sec-2-11">
        <title>ModernBert, especially in terms of MRR. Interestingly, incorporating dictionary data into the training set yields only moderate overall gains and does not significantly</title>
        <p>impact the results, further emphasizing that definitionmsay partially explain why models trained on crossword
and crossword clues originate from diferent linguisticlues generalize better to ONLI neologisms than to
standistributions. dard dictionary definitions, which are often more rigid
and semantically grounded.</p>
        <p>Dictionary All models, and especially baselines,
severely drop in performance when dealing with dic-Neos. Models perform poorly in this setting. However,
tionary data. Furthermore, the rank changes: IT5-batsheey still widely outperform all c2c baselines, which are
obtains higher results than the multilingual MPNet-balsme,ost fully incapable of retrieving correct answers.
Indespite having half of the parameters. As expected, ent-erestingly, the simple c2s-MPNet approach yields strong
hancing the training set with dictionary samples yielrdessults, achieving top Acc@1 and Acc@∗s1cores.
Oversubstantial gains across all models; especially, the MPNaeltl-, IT5-base achieves the best results, beating the
c2sbase increases results-wise more than the IT5-base, rbea-seline from Acc@10, followed by the multilingual
sulting in similar scores for both models. MPNet-base. As for ONLI and Dict., all models
beneift importantly from training on dictionary definitions
ONLI For ONLI neologisms, all c2c baselines con-and, especially, the MPNet-base in this configuration
betinue to decline while c2s-MPNet gains significantly w.r.tc.omes the top performer in terms of Acc@∗1,0Acc@100,
crossword clues and dictionary definitions. IT5-baseAcc@100∗ and Acc@1000.
achieves the best results, with a substantial gap from the
MPNet-base. As in the dictionary setting, augmenting th3e.2.1. Discussion
dataset with dictionary definitions yields improvements,
although more moderate. ONLI neologisms are retrieveOdverall, we observe an interesting trend concerning
basebetter than dictionary words, even when augmentinlgines: while all c2c (clues to clues) approaches perform
reasonably well on crosswords, their performance
drasthe dataset. One hypothesis for this phenomenon is that
crossword clues are more aligned with the definitions oftically drops when dealing with dictionary terms and
neologisms. On the other hand, the c2s-MPNet
baseneologisms, as they may reflect similar linguistic
strate</p>
        <p>line, which directly confronts clues and solutions
durgies. Both crossword clues, particularly those involving
wordplay, and journalistic neologism definitions often ing retrieval, exhibits an inverse trend, performing
betrely on compositionality. For example, clues such as “hatlfer with definition-like clues than with crossword clues.
a dance” or “prefix meaning new” require the decomposi- These results further corroborate the hypothesis that
tion and reinterpretation of word parts, similarly to macnluyes and definitions have a diferent relation to target
neologisms in ONLI are defined through transparent com-words: words and definitions are more semantically
pounds or afix-based constructions (e.g.,mafiocracy = aligned, from a distributional point of view, than
mafia + -cracy). This shared reliance on compositionalitycrossword clues and solutions. Furthermore, the
extremely low performance of c2c-baselines on neologisms
confirms that clues-to-clues mappings are insuficient Automated Crossword Solving Despite not being
to handle lexical innovation in crossword puzzles. Thtishe main focus of this article, we tried to leverage our
supports our initial motivation for a joint latent spsaycsetem to automatically solve crossword puzzles as a
conthat leverages rich distributed representations, enablcirnegte application of clues answeringcrossword. Figures
the modeling of unseen clues and solutions for the tas1kand 3 show an example of a crossword puzzle, built
of crossword retrieval. Finally, tmhaejority of our entirely from clues in the test sets, automatically filled
trained systems achieved better results than base- using the Z3 SMT (Satisfiability Modulo Theories) solver
lines on crossword clues with the biggest and multi-[34]7, leveraging candidates retrieved by the MPNet-base
lingual model, MPNet-base, achieving the best resultmso,del. Specifically, by treating crossword puzzles as a
closely followed by the IT5-base. For neologisms in pars-atisfiability problem, we can define a set of first-order
ticular, the better performances of the monolingual IlTo5g-ical constraints that must be satisfied across all
varibase encoder despite its smaller parameter count suagb-les (grid cells) to find valid solutions: each clue
corgest thatlanguage-specific training might benefit responds to a sequence of grid variables constrained to
retrieval in domains heavily influenced by culture match one of its candidate answers, forming a disjunctive
and language-specific lexical innovation dynamics . (OR) group. These candidate-level constraints are then
combined conjunctively (AND) across all clues.
Additionally, for intersecting cells, equality constraints are
4. Analysis and Applications enforced to ensure character consistency between
overlapping horizontal and vertical words. The final formula,
This section provides further explorations in applicatiocnosmposed of these conjunctive and disjunctive logical
and properties of our crossword embeddings systems. statements, is passed to the solver, which searches for a
globally consistent solution that satisfies all constraints
Examples Analysis Table4 reports some examples simultaneously. Despite the complexity of this approach,
of the Top2 retrieved answers across baselines, modelws hich requires that each candidate set contains the
corand test sets. For this purpose, we manually selectreedct solution, our biggest model, MPNet-base, was able
cases showing the limitations of traditional baselinteos,solve entirely some small-medium grids using a
cane.g. crossword clues carrying a non-literal meaning. Fdoridate size10 ≤  ≤ 50 , confirming the efectiveness of
example, the cryptic-style clue ”Lido senza pari” (tranosulr. system. We posit that a strategy iterating Z3 solving
Beach without even) requires interpretinegven as refer- attempts over progressively larger candidate sizes could
ring to the characters in even positions inside the worpdrovide a strong baseline for crossword solving systems
lido. Baselines do not capture this meaning nuance, whilweith a given computational budget, and we leave such
some of our models arrive at the correct solution, daes-sessment to future work.
spite the well-known problem of character awareness in
character-blind model3s2[, 33]. Another interesting case
involves neologisms: baselines are unable to retrieve th5e. Conclusion and Future Work
correct answers since they represent a fringe minority
in the available pool of definitions and solutions. On theIn this work, we introduced and evaluated dual encoder
other hand, our models, especially the monolingual ITa5,rchitectures for retrieving solutions of Italian crossword
show signs of generalization and were able to retrievcelues by learning a shared latent space between clues
the correct answers despite not being trained on thema.nd solutions. Our experiments demonstrated that the</p>
      </sec>
      <sec id="sec-2-12">
        <title>7We partially modified the implementation found ahtttps://github.</title>
        <p>com/pncnmnp/Crossword-Solve r.</p>
        <p>Asymmetric Dual Encoder (ADE) architecture, with its ings of the Eleventh Italian Conference on
Compuindependent encoders for clues and solutions, outper- tational Linguistics (CLiC-it 2025), 2025.
formed the Siamese Dual Encoder (SDE) in handling the [2] P. Basile, M. de Gemmis, P. Lops, G. Semeraro,
Solvnuanced and often non-literal relationships characteris- ing a complex language game by using
knowledgetic of crossword puzzles. Our results also highlighted based word associations discovery, IEEE
Transthe limitations of traditional retrieval-based approaches actions on Computational Intelligence and AI in
(e.g., clues-to-clues methods), particularly when testing Games 8 (2016) 13–26. doi:10.1109/TCIAIG.2014.
their generalization towards neologisms’ definitions. In 2355859.
contrast, our dual encoder-based models, especially th[e3] R. Manna, M. P. di Buono, J. Monti, Riddle me
larger and multilingual MPNet-base and the monolin- this: Evaluating large language models in solving
gual IT5-base, exhibited signs of generalization across word-based games, in: C. Madge, J. Chamberlain,
diverse linguistic settings, including newly coined terms K. Fort, U. Kruschwitz, S. Lukin (Eds.), Proceedings
and culturally specific references. This underscores the of the 10th Workshop on Games and Natural
Lanimportance of leveraging rich distributed representations guage Processing @ LREC-COLING 2024, ELRA
to model the complex interplay between clues and solu- and ICCL, Torino, Italia, 2024, pp. 97–106. URL:
tions. https://aclanthology.org/2024.games-1..11</p>
        <p>In future work, it could be interesting to explore en[4-] P. Giadikiaroglou, M. Lymperaiou, G.
Filandrisemble methods that combine traditional information re- anos, G. Stamou, Puzzle solving using
reasontrieval approaches with dual encoder models, including ing of large language models: A survey, in:
clues-to-clues retrieval techniques, to leverage their com- Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.),
Proplementary strengths. Training a cross-encoder reranker ceedings of the 2024 Conference on Empirical
on top of retrieved candidate solutions may also prove Methods in Natural Language Processing,
Asbeneficial, as it would enable the exploitation of contex- sociation for Computational Linguistics, Miami,
tual relationships between clues and solutions, an ap- Florida, USA, 2024, pp. 11574–11591. URL:https:
proach that is standard in retrieval-based systems. More- //aclanthology.org/2024.emnlp-main.64.6d/oi:10.
over, conducting a detailed linguistic analysis of clues, 18653/v1/2024.emnlp-main.646.
examining categories, frequency distributions, and other[5] G. Sarti, T. Caselli, M. Nissim, A. Bisazza, Non
verproperties, could provide deeper insights into their char- bis, sed rebus: Large language models are weak
acteristics. Finally, extending the methodology toward solvers of Italian rebuses, in: F. Dell’Orletta,
an automatic completion system for crossword puzzle A. Lenci, S. Montemagni, R. Sprugnoli (Eds.),
Progrids represents a promising direction for supporting full ceedings of the 10th Italian Conference on
Compupuzzle solving. tational Linguistics (CLiC-it 2024), CEUR Workshop
Proceedings, Pisa, Italy, 2024, pp. 888–897. URL:
https://aclanthology.org/2024.clicit- 1..96/
Acknowledgments [6] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak,
M. Ginsberg, D. Klein, Automated crossword
solvThis work has been supported by the FAIR - Future AI ing, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.),
Research (PE00000013) project under the NRRP MUR Proceedings of the 60th Annual Meeting of the
Asprogram funded by the NextGenerationEU, the PRIN sociation for Computational Linguistics (Volume 1:
PNRR 2022 Project EKEEL - Empowering Knowledge Long Papers), Association for Computational
LinExtraction to Empower Learners (P20227PEPK) and the guistics, Dublin, Ireland, 2022, pp. 3073–3085. URL:
XAI-CARE-PNRR-MAD-2022-12376692 project under the https://aclanthology.org/2022.acl-long..2d1o9i:10.
NRRP MUR program funded by the NextGenerationEU.</p>
        <p>Partial support was also received by the pro“jUecntder- [7] 1J.86R5o3z/nve1r/,2C0.22P.oatctls-, lKo.ngM.a21h9o.wald, Decrypting
standing and Enhancing Preference Alignment in Large cryptic crosswords: Semantically complex
wordLanguage Models Through Controlled Text Generation” play puzzles as a target for nlp, in: M. Ranzato,
(IsCc8_ALIGNLLM), funded by CINECA under the IS- A. Beygelzimer, Y. Dauphin, P. Liang, J. W.
CRA initiative, for the availability of HPC resources and Vaughan (Eds.), Advances in Neural Information
support. Processing Systems, volume 34, Curran
Associates, Inc., 2021, pp. 11409–11421. URL: https:
References //proceedings.neurips.cc/paper_files/paper/2021/
file/5f1d3986fae10ed2994d14ecd89892d7-Paper.pd.f
[1] C. Bosco, E. Ježek, M. Polignano, M. Sanguinetti, [8] S. Saha, S. Chakraborty, S. Saha, U. Garain,
Preface to the Eleventh Italian Conference on Com- Language models are crossword solvers, in:
putational Linguistics (CLiC-it 2025), in: Proceed- L. Chiruzzo, A. Ritter, L. Wang (Eds.), Proceedings
of the 2025 Conference of the Nations of the Amer- F. Boschetti, G. E. Lebani, B. Magnini, N. Novielli
icas Chapter of the Association for Computational (Eds.), Proceedings of the 9th Italian Conference on
Linguistics: Human Language Technologies (Vol- Computational Linguistics, Venice, Italy,
Novemume 1: Long Papers), Association for Computa- ber 30 - December 2, 2023, volume 3596 ofCEUR
tional Linguistics, Albuquerque, New Mexico, 2025, Workshop Proceedings, CEUR-WS.org, 2023. URL:
pp. 2074–2090. URL: https://aclanthology.org/2025. https://ceur-ws.org/Vol-3596/paper9.p.df
naacl-long.104./ [16] D. Gillick, A. Presta, G. S. Tomar, End-to-end
[9] A. Sadallah, D. Kotova, E. Kochmar, What makes retrieval in continuous space, arXiv preprint
cryptic crosswords challenging for LLMs?, in: arXiv:1811.08008 (2018).</p>
        <p>O. Rambow, L. Wanner, M. Apidianaki, H. Al- [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
Khalifa, B. D. Eugenio, S. Schockaert (Eds.), Pro- Pre-training of deep bidirectional transformers for
ceedings of the 31st International Conference on language understanding, in: J. Burstein, C.
DoComputational Linguistics, Association for Com- ran, T. Solorio (Eds.), Proceedings of the 2019
Conputational Linguistics, Abu Dhabi, UAE, 2025, ference of the North American Chapter of the
Aspp. 5102–5114. URL: https://aclanthology.org/2025. sociation for Computational Linguistics: Human
coling-main.342/. Language Technologies, Volume 1 (Long and Short
[10] M. Ernandes, G. Angelini, M. Gori, We- Papers), Association for Computational Linguistics,
bcrow: A web-based system for crossword solv- Minneapolis, Minnesota, 2019, pp. 4171–4186. URL:
ing, in: AAAI Conference on Artificial Intelligence, https://aclanthology.org/N19-14.2d3o/i:10.18653/
2005. URL: https://link.springer.com/chapter/10. v1/N19-1423.</p>
        <p>1007/11590323_37. [18] Z. Dong, J. Ni, D. Bikel, E. Alfonseca, Y. Wang,
[11] G. Angelini, M. Ernandes, M. Gori, Solving ital- C. Qu, I. Zitouni, Exploring dual encoder
archiian crosswords using the web, in: International tectures for question answering, in: Y.
GoldConference of the Italian Association for Artificial berg, Z. Kozareva, Y. Zhang (Eds.), Proceedings
Intelligence, 2005. URLh:ttps://link.springer.com/ of the 2022 Conference on Empirical Methods
chapter/10.1007/11558590_40. in Natural Language Processing, Association for
[12] G. Barlacchi, M. Nicosia, A. Moschitti, A retrieval Computational Linguistics, Abu Dhabi, United
model for automatic resolutionof crossword puz- Arab Emirates, 2022, pp. 9414–9419. URL:https:
zles in italian language, in: Proceedings of the //aclanthology.org/2022.emnlp-main.64.0d/oi:10.
First Italian Conference on Computational Linguis- 18653/v1/2022.emnlp-main.640.
tics CLiC-it 2014 &amp; and of the Fourth Internation[a1l9] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,
Workshop EVALITA 2014: 9-11 December 2014, L. Rigutini, M. Maggini, M. Gori, Italian crossword
Pisa, Pisa University Press, 2014, pp. 33–37. generator: Enhancing education through
interac[13] A. Moschitti, M. Nicosia, G. Barlacchi, SACRY: tive word puzzles, in: Proceedings of the 9th Italian
Syntax-based automatic crossword puzzle resolu- Conference on Computational Linguistics (CLiC-it
tion sYstem, in: H.-H. Chen, K. Markert (Eds.), 2023), 2023. URL: https://ceur-ws.org/Vol-35.96
Proceedings of ACL-IJCNLP 2015 System Demon- [20] C. Ciaccio, A. Miaschi, F. Dell’Orletta, Evaluating
strations, Association for Computational Linguis- lexical proficiency in neural language models, in:
tics and The Asian Federation of Natural Language Proceedings of the 63rd Annual Meeting of the
AsProcessing, Beijing, China, 2015, pp. 79–84. URL: sociation for Computational Linguistics (Volume
https://aclanthology.org/P15-40.1d4o/i:10.3115/ 1: Long Papers), Association for Computational
v1/P15-4014. Linguistics, Vienna, Austria, 2025.
[14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, [21] G. Sarti, M. Nissim, IT5: Text-to-text pretraining for
G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, Italian language understanding and generation, in:
J. Clark, G. Krueger, I. Sutskever, Learning trans- N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti,
ferable visual models from natural language su- N. Xue (Eds.), Proceedings of the 2024 Joint
Inpervision, in: M. Meila, T. Zhang (Eds.), Pro- ternational Conference on Computational
Linguisceedings of the 38th International Conference tics, Language Resources and Evaluation
(LRECon Machine Learning, volume 139 ofProceed- COLING 2024), ELRA and ICCL, Torino, Italia, 2024,
ings of Machine Learning Research, PMLR, 2021, pp. 9422–9433. URL: https://aclanthology.org/2024.
pp. 8748–8763. URL: https://proceedings.mlr.press/ lrec-main.823./
v139/radford21a.htm.l [22] L. Xue, N. Constant, A. Roberts, M. Kale, R.
Al[15] F. Bianchi, G. Attanasio, R. Pisoni, S. Terragni, Rfou, A. Siddhant, A. Barua, C. Rafel, mT5:
G. Sarti, D. Balestri, Contrastive language- A massively multilingual pre-trained text-to-text
image pre-training for the italian language, in: transformer, in: Proceedings of the 2021
Conference of the North American Chapter of the
Association for Computational Linguistics:
Human Language Technologies, Association for
Computational Linguistics, Online, 2021, pp. 483–498.</p>
        <p>URL: https://aclanthology.org/2021.naacl-mai.n.41
doi:10.18653/v1/2021.naacl-main.41.
[23] B. Warner, A. Chafin, B. Clavié, O. Weller, O.
Hallström, S. Taghadouini, A. Gallagher, R. Biswas,
F. Ladhak, T. Aarsen, et al., Smarter, better, faster,
longer: A modern bidirectional encoder for fast,
memory eficient, and long context finetuning and Figure 4: Solution for the autonomously solved crossword
inference, arXiv preprint arXiv:2412.13663 (2024). puzzle in Figure 1.
[24] N. Reimers, I. Gurevych, Sentence-BERT: Sentence
embeddings using Siamese BERT-networks, in: the 7th International Joint Conference on Natural
K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of Language Processing (Volume 2: Short Papers),
Asthe 2019 Conference on Empirical Methods in Nat- sociation for Computational Linguistics, Beijing,
ural Language Processing and the 9th International China, 2015, pp. 199–204. URL: https://aclanthology.
Joint Conference on Natural Language Processing org/P15-2033/. doi:10.3115/v1/P15-2033.
(EMNLP-IJCNLP), Association for Computational[32] L. Edman, H. Schmid, A. Fraser, CUTE:
MeaLinguistics, Hong Kong, China, 2019, pp. 3982–3992. suring LLMs’ understanding of their tokens, in:
URL: https://aclanthology.org/D19-14.10d/oi:10. Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.),
Pro18653/v1/D19-1410. ceedings of the 2024 Conference on Empirical
[25] L. Bonifacio, I. Campiotti, R. de Alencar Lotufo, Methods in Natural Language Processing,
AsR. F. Nogueira, mmarco: A multilingual version sociation for Computational Linguistics, Miami,
of MS MARCO passage ranking dataset, CoRR Florida, USA, 2024, pp. 3017–3026. URL: https:
abs/2108.13897 (2021). URL: https://arxiv.org/abs/ //aclanthology.org/2024.emnlp-main.17.7d/oi:10.
2108.13897. arXiv:2108.13897. 18653/v1/2024.emnlp-main.177.
[26] N. Reimers, I. Gurevych, Making monolingual [33] C. Ciaccio, M. Sartor, A. Miaschi, F. Dell’Orletta,
Besentence embeddings multilingual using knowl- yond the spelling miracle: Investigating substring
edge distillation, in: B. Webber, T. Cohn, Y. He, awareness in character-blind language models, in:
Y. Liu (Eds.), Proceedings of the 2020 Conference Proceedings of the 63rd Annual Meeting of the
Ason Empirical Methods in Natural Language Process- sociation for Computational Linguistics (Volume
ing (EMNLP), Association for Computational Lin- 1: Long Papers), Association for Computational
guistics, Online, 2020, pp. 4512–4525. URL:https: Linguistics, Vienna, Austria, 2025.
//aclanthology.org/2020.emnlp-main.36.5d/oi:10. [34] L. de Moura, N. Bjørner, Z3: An eficient smt solver,
18653/v1/2020.emnlp-main.365. in: C. R. Ramakrishnan, J. Rehof (Eds.), Tools and
[27] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Con- Algorithms for the Construction and Analysis of
trastive learning with hard negative samples, Inter- Systems, Springer Berlin Heidelberg, Berlin,
Heinational Conference on Learning Representations delberg, 2008, pp. 337–340.</p>
        <p>(2021).
[28] I. Loshchilov, F. Hutter, Decoupled weight decay
regularization, arXiv preprint arXiv:1711.0510A1 . Solved crossword puzzle
(2017).
[29] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szil- Figure4 report the solution of the crossword presented
vasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou, in Figure1.</p>
        <p>The faiss library, arXiv preprint arXiv:2401.08281
(2024).
[30] A. Zugarini, M. Ernandes, A multi-strategy apB- . Further details on the
proach to crossword clue answer retrieval and rank- hyperparameters
ing, in: CLiC-it, 2021.
[31] A. Severyn, M. Nicosia, G. Barlacchi, A. MoschittiB, oth the siamese and asymmetric architectures were
deDistributional neural networks for automatic resisgon-ed using PyTorch and the training was conducted on
lution of crossword puzzles, in: C. Zong, M. Strubetwo Nvidia GeForce RTX 4090 GPUs. For the
asymmet(Eds.), Proceedings of the 53rd Annual Meeting ofric architecture we leverage parallelization by assigning
the Association for Computational Linguistics anedach encoder to a diferent GPU. Each model was trained
to produce representations of dimensionality equals to
768. We used the default betas andAdamW parameters.</p>
        <p>Table5 reports the specific hyperparameters used with
each model. Due to limited computational resources, we
did not perform an extensive hyperparamters
optimization, rather, we relied on the configurations suggested
by the models creators. The maximum token length of
the clues and solutions were set to respectively 64 and 16.</p>
        <p>The learnable temperature parame tweras initialized to
the equivalent of 0.07 from and clipped as done in CLIP
paper. During batch generation, in order to avoid false
negatives during hard batch mining, each batch cannot
contain the same solution two or more times.
weight decay</p>
      </sec>
      <sec id="sec-2-13">
        <title>During training, we kept track of the model’s performance on the validation dataset and we picked the checkpoint with lowest validation loss.</title>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Drafting content, Paraphrase and reword, Improve writing style, Grammar and spelling check,
and Formatting assistance. After using these tool(s)/service(s), the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>