<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>K. Zeinalipour); concept of Surprisal, defined as the negative logarithm
https://github.com/cristianochesi (C. Chesi) of the probability of a word given its context. This</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Surprisal and Crossword Clues dificulty: Evaluating Linguistic Processing between LLMs and Humans</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Iaquinta</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asya Zanollo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achille Fusco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamyar Zeinalipour</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Chesi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory for Neurocognition</institution>
          ,
          <addr-line>Epistemology, and Theoretical Syntax - NeTS-IUSS Pavia</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di Firenze</institution>
          ,
          <addr-line>Piazza S. Marco 4, 50121 Firenze</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi di Siena (UNISI)</institution>
          ,
          <addr-line>Via Roma 56, 53100 Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University School for Advanced Studies IUSS Pavia</institution>
          ,
          <addr-line>Piazza della Vittoria 15, 27100 Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>Crossword clue dificulty is traditionally judged by human setters, leaving automated puzzle generators without an objective yard-stick. We model dificulty as the Surprisal of the answer given the clue, estimating it with token probabilities from large language models. Comparing three models three causal LLMs-Llama-3-8B, Llama-2-7B, and Ita-GPT-2-121M. with 60 human solvers on 160 hand-balanced clues, Surprisal correlates negatively with accuracy (r = -0.62 for nominal clues). These results show that language-model Surprisal captures some of the cognitive load humans experience and that language-specific training and model scale both matter; the metric therefore enables adaptive crossword generation and provides a new test-bed for probing the alignment between human and model linguistic processing.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;surprisal</kwd>
        <kwd>llm</kwd>
        <kwd>gpt</kwd>
        <kwd>crossword</kwd>
        <kwd>education</kwd>
        <kwd>linguistic games</kwd>
        <kwd>puzzle</kwd>
        <kwd>Crossword dificulty</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Crossword (CW) puzzles are among the most popular
language games, captivating millions through newspapers,
mobile apps, voice assistants, and even televised
competitions [1, 2]. The enduring appeal of crosswords across
formats stems from the careful calibration of clue
dificulty, which can range from accessible, beginner-friendly
prompts to highly intricate, expert-level challenges.</p>
      <p>Despite advancements in automated puzzle
generation, state-of-the-art systems like Dr. Fill [3] and
the Berkeley Crossword Solver [1], while capable of
outperforming many human solvers, still lack a reliable,
objective measure to assess the challenge posed by the
clues they generate. Traditional heuristics, such as clue
length, grid density, historical solve statistics, and letter</p>
      <sec id="sec-1-1">
        <title>Microcategory</title>
      </sec>
      <sec id="sec-1-2">
        <title>Macrocategory</title>
      </sec>
      <sec id="sec-1-3">
        <title>Accuracy</title>
        <p>RTs (log10)</p>
      </sec>
      <sec id="sec-1-4">
        <title>Surprisal</title>
        <p>bareNP:rel
nominal
0.526
4.214
5.207</p>
      </sec>
      <sec id="sec-1-5">
        <title>Microcategory</title>
      </sec>
      <sec id="sec-1-6">
        <title>Macrocategory</title>
      </sec>
      <sec id="sec-1-7">
        <title>Accuracy</title>
        <p>RTs (log10)</p>
      </sec>
      <sec id="sec-1-8">
        <title>Surprisal</title>
        <p>defDP
nominal
1.0
3.973
3.926
language models (LLMs), which naturally compute
token probabilities, Surprisal becomes readily accessible.
Recent studies further emphasize the influence of model
scale and training domain on the alignment between
model-derived Surprisal and human cognitive patterns
[10, 11]. Notably, despite its potential, Surprisal has yet
to be explored specifically as a metric for crossword
dificulty.</p>
        <p>Given the increasing prevalence and sophistication
of automated CW generation systems, there is now
a pressing need for a principled, data-driven metric
capable of accurately gauging puzzle dificulty. Such a
metric could facilitate adaptive tutoring tools, ensure
fairness in online competitions, and provide richer
psycholinguistic experimentation frameworks. In this
paper, we propose and investigate token-level Surprisal,
delivered by LLMs, as an innovative and robust candidate
for objectively quantifying crossword puzzle dificulty.
The current research represents the first attempt to
apply the surprisal metric in the context of crossword
puzzles, marking a novel approach to defining crossword
dificulty through computational linguistics measures.
To guide our investigation and evaluate the viability
of token-level Surprisal as an efective measure, we
formulate a central research question, summarized
clearly below. From this overarching inquiry, we
derive four specific, actionable research questions
(RQs) designed to systematically unpack the predictive
capabilities of Surprisal.
(2 880 judgments), provides accuracy and
solvingtime gold standards.
• Surprisal estimation framework —Five
generic concatenation rules turn any clue–answer
pair into a well-formed sentence with the answer
in final position; open-source code computes
multi-token Surprisal from any causal LM.
• Empirical findings —(i) Surprisal correlates
strongly and negatively with accuracy (best
 = 0−.57 ) but only weakly with raw solving
times—stronger after log transform. (ii) Ita-GPT-2
and Llama-3 outperform larger, non-specialised
models. (iii) Predictive strength is
categorydependent; metalinguistic and copular clues
remain challenging. (iv) Picking the right
concatenation rule per category boosts correlation by up
to 0.15 -points.
• Recipe for adaptive generation —A
demonstrator workflow assigns category-specific Surprisal
thresholds, selects clues at desired dificulty, and
sketches integration with full-grid generation.
• Open resources —All data, annotation scripts,
Surprisal code, and analysis notebooks are
released to foster reproducibility and future
research on cognitively informed puzzle
generation.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>• Fine-grained linguistic taxonomy &amp; bench- 2.1. Surprisal as a Psycholinguistic Metric
mark —A curated set of 160 Italian clues
spanning 20 syntactic categories, solved by 60 natives</p>
      <p>In recent years, Surprisal has been employed to
evaluate LLMs performances in psycholinguistic studies, in
correlation with online processing measures taken from
corpora, like Reading Times (RTs) [12, 13, 14, 15, 16], and
Event-Related Potentials (ERPs) [17, 18]. A key issue in
comparing LLMs linguistic competence and Human
competence consist in understanding at which human-like
degree LLMs represent Natural Language (NL). Human
linguistic competence does not rely on probability alone
[19, 20] and it is structure-driven, in contrast to LLMs
data-driven training [21, 22] and tend to underestimate
syntax with respect to human processing, in virtue of
their diferent mechanism of learning and
understanding [13] In this scenario Surprisal represents a ‘neutral’
measure which can account also for diferences deriving
from various linguistic sources in a probabilistic frame- Figure 1: Methodology overview. Colour-coded blocks show
work. [23] The understanding of the diference between data (blue), processing (grey), models (orange) and results
language in models and humans remains a central and (green); arrows trace the workflow.
extremely relevant point in all the comparative studies
and in the analysis of the results. Following the line of
research described above, we aim at investigating whether lenge. Early generators searched word-list constraints
the same correlation - between processing dificulty and for Italian crosswords and beyond [32, 33], later adapting
Surprisal values – holds also for CW clue-answer pairs. to Malay [34], Spanish [35] and Indian languages for
eduNo prior work supplies a token-level, psycholinguisti- cation [36]. More recently, Zeinalipour and collaborators
cally grounded metric for per-clue dificulty. We import have spearheaded a multilingual, education-oriented
reLLMs Surprisal, validate it against 60 human solvers, and search programme: Italian educational grids [37], the
Weshow how it plugs into adaptive generation workflows. bCrow French solver [38], Arabic generators—including
both clue-focused ArabIcros [39] and a text-to-puzzle
2.2. LLMs and Cognitive Alignment pipeline [40]—, a Turkish generator [41], and the
ClueInstruct dataset for pedagogy-centred clues [42].
Together, these works illustrate a fast-growing ecosystem
of LLM-driven solvers and generators that operate across
languages and educational settings.</p>
      <p>Despite this progress, no prior work proposes an
objective, cognitively grounded dificulty metric. Published
systems label puzzles informally (“easy”, “hard”) or rely
on surface heuristics (grid density, answer length). By
linking LLM-derived surprisal to human accuracy and
solving times, our study closes this evaluation gap and
enables adaptive puzzle generation across languages.</p>
      <p>Large language models (LLMs) supply token probabilities
out of the box, enabling fine-grained surprisal estimates.</p>
      <p>Layer-wise activations in GPT-, BERT- and Llama-style
models predict fMRI and MEG responses to naturalistic
text with striking accuracy [24, 25]. Model scale and
training data modulate that alignment: bigger is not always
better for eye-movement predictivity, whereas deeper
layers in larger models often map best to slower neural
signals [26]. Tokenisation also matters: sub-word splits
can blur the link between model surprise and human
lexical access; aggregating sub-tokens or using
morphologically aware tokenisers improves fit [ 27]. By
comparing three Italian-capable LLMs (Ita-GPT-2, Llama-2,
Llama-3), we contribute new evidence on how family,
size and training regime afect cognitive alignment in a
puzzle-solving context.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>2.3. Crossword Solving &amp; Generation</title>
        <p>AI interest in crosswords began with the probabilistic
solver Proverb [28] and the web–based WebCrow
system [29]. Dr. Fill later recast clue filling as a
singleweighted CSP [3], while subsequent systems introduced
neural rerankers and hybrid IR–NLP pipelines [30]. Large
language models now push solver accuracy above 90 %
on New York Times puzzles [31].</p>
        <p>Grid construction and clue writing pose a diferent
chalOur four–step pipeline (Fig.1) is: (1) scrape, clean, and
tag approximately 125 000 Italian clue–answer pairs into
20 syntactic categories; (2) turn each pair into a sentence
via five lightweight templates and compute answer–level
surprisal with Llama – 3, Llama – 2, and Ita – GPT – 2;
(3) obtain a human baseline from 60 native speakers
solving 160 balanced clues, yielding accuracy and
logtransformed solving times; and (4) correlate surprisal
with those measures and use category-specific
thresholds to power an adaptive crossword generator.
using Regular Expressions (RegEx) and Part-of-Speech
(PoS) tagging that have been employed to extract
examples of diferent syntactic constructions and see whether
their distribution was significant or not. The extraction
has then been improved using the python library spaCy
[43] and the dataset has been parsed using the \nlp
function which allows us to identify the head node of each
clue. We identified 20 pertinent clue typologies for our
experiment summarized in Table 3. For further details
see the original work on CW linguistic analysis [44].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>3.1. Data and Preprocessing</title>
        <p>To evaluate the dificulty of crossword puzzles, we
leveraged a comprehensive collection of Italian CW clues
and answers. The sources of the clues-answer pairs are
both internet sites that release solutions for CW clues,
https://www.dizy.com/ and https://www.cruciverba.it/,
that we scraped through apposite scripts. And also pdf
versions of famous Italian CW papers like Settimana
Enigmistica and Repubblica, that we suitably converted to
clueanswer pairs. The various sources where than cleaned,
merged and the duplicates were removed. This dataset
consists of 125,600 entries that correspond to unique
clue-answer pairs. It includes clues related to diferent
domains, such as history, geography, literature, and pop
culture. The dataset under investigation contains a
diverse array of linguistic features, including grammatical
structures, syntactic patterns, and lexical elements.</p>
        <p>The research question that guides our experiment is
whether the probability of LLMs token can be used to
predict the dificulty of a clue-answer pair. The
underlying assumption is that Surprisal, as a complexity metric,
correlates to online measures of processing dificulty. For
3.2. Linguistic Classification this reason, we can consider Surprisal in relation to
measures that we took as index of the dificulty of a CW clue,
The dataset of Italian clue-answer pairs has been syntacti- which is expected to be visible in:
cally analysed and diferent clue constructions have been
categorized with the aim of investigating what kinds of • Response Times (RTs): how long does it take to
structural operations can be applied to derive CW clues solve the clue, i.e. reading,guessing and typing
from well-formed sentences. Being based on the syn- the answer;
tax of clue-answer pairs, the classification presented is • Accuracy: How accurate is the answer.
language-dependent on Italian.</p>
        <p>In general terms, clues have been initially distinguished Consequently, a trivial answer would have low Surprisal,
into clausal and non-clausal structures depending on the which means a high probability, and vice versa we can
presence or absence of an inflected verb in the matrix consider high Surprisal, or low probability of the
tarclause and, secondly, non-clausal clues can be articulated get word, as indicating a non-obvious, original answer.
in diferent structures varying in the nature of their heads: Several psycholinguistic studies investigate language
proNoun Phrases (NP), Determiner Phrases (DP), Preposi- cessing in predicting next word, but no use of CW data
tional Phrases (PP), Adjectival Phrases (AdjP) and Adver- have been found on this task. Finding the word-answer,
bial Phrases (AdvP). given a definition, could be considered a type of next
Clausal clues, on the other side, represent syntactically word prediction task. In this case not only the
probarelevant items in virtue of the presence of an inflected bility of the word must be considered, but more than
verb in the matrix clause and they can be categorized on that the Accuracy. Indeed, the right choice of the exact
that basis. Indeed these include clauses with verbal or word needed to fill the grid characterizes a CW task. The
nominal predicates (i.e. copular sentences), and relative current experimental proposal configures as an
exploclauses. These main categories diferentiate internally, rative approach for a psycholinguistic treatment of CW
and some subcategories can be accordingly defined. Once language, and as an attempt to investigate LLMs abilities
some significant syntactic structures have been outlined to grasp diferent levels of surprise, linguistic originality
we can proceed with the classification of our unstruc- in CW clues. The experimental setup consists of two
tured corpus. It is important to highlight that the pro- diferent paths, the results of which will be compared.
posed categorization is based on the generative grammar
approach thus, in the computation of classification rules
we considered the diference between the parser
(dependencies) and our hierarchical categorization. Categories
have been identified on the basis of the type of head,
and then further specified by additional features (if any)
like in the case of DP which can be of type definite or
indefinite.</p>
        <p>First of all a qualitative data analysis has been carried out
• Human Experiment: the first step consists of a</p>
        <p>Solving Task to test participants and collect
human responses. The absence of already annotated
corpora for CW language leads to the limitation
of having a constrained number of tested items,
for reasons of time and because they are
handdesigned.
• LLMs Surprisal Calculation: this limitation is
not encountered on the LLMs side, with which</p>
        <sec id="sec-4-1-1">
          <title>Macrocategory Typologies</title>
          <p>copular
copular
copular
verbal predicate
verbal predicate
verbal predicate
verbal predicate
verbal predicate
verbal predicate
verbal predicate
infinitive
nominal
nominal
nominal
nominal
nominal
prepositional
adjectival
adjectival
metalinguistic
cop:missSubj, copular sentence with subject omission
cop:clitic, copular sentence with a clitic in object
position
cop:pron, copular sentence with a pronoun in object
position
act:missSubj, active verbal sentences with subject
omission
act:clitic, active verbal sentences with a clitic in object
position
act:pron, active verbal sentences with a pronoun in
object position
pass:missSubj, passive sentence with subject omission
pass:other, other kinds of passive sentences
imp_refl:missSubj, active sentence with impersonal
pronoun or reflexive verb with subject omission
imp_refl:other, other kinds of active sentence with
impersonal pronoun or reflexive verb
inf_VP, infinitival verb phrases (VP)
bare_NP, bare noun phrases (NP)
bare_NP:rel, bare NP followed by a relative clause
def_DP, definite determiner phrases (DP)
def_DP:rel, DP followed by a relative clause
ind_DP, indefinite DP
PP, prepositional phrases
adjP, adjectival phrases
adjP:pron, adjectival phrases with pronoun
two-letters answer</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Examples</title>
          <p>Fu Cancelliere della Germania dal 1949 al 1963
= Adenauer
Venere ne era la dea = bellezza
È celebre quella di Trinità dei Monti = scalinata
Risiede in uno spazio geografico determinato
= abitante
La segue il medico = ammalata
Quelli d’America hanno per capitale
Washington = Stati uniti
È detta Il Continente Bianco = Antartide
Vi furono ritrovati noti bronzi = Riace
Si reca spesso al catasto = geometra
Che si riferisce all’Università = accademico
Investire di un grado = nominare
Infuso paglierino = tè
Cilindri commestibili che vengono afettati =
polpettoni
Il conto delle spese da farsi = preventivo
Lo Stato di cui fanno parte le Isole Azzorre =
Portogallo
Una brutta abitudine perdonabile = vizietto
Davanti a Rodrigo = Don
Probo, retto = onesto
Pittoresco quello siciliano = carretto
Il centro di Matera = TE</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.1. Solving Task</title>
        <p>Starting from our reference dataset, a set of clue-answer
pairs has been selected consisting of a limited number
of 8 items for 20 categories presented in 3.2. A total of
160 items have been organized into four lists, all equally
representative of the categories. Hence, a subject was
presented with one of these four lists and asked to solve
40 CW clues. 60 Italian native speakers were recruited
for the experiment. Participants were presented with a
clue, and they had to guess the solution, having at their
disposal only the length of the answer, represented as a
grid, and its initial letter. No time constraint was given
during the experiment. For each subject and each item
(2880 data points) in the experimental list we collected:
• The string representing the given answer.
• RT (response time) was measured as the interval
in milliseconds between the appearance of the
crossword clue and the submission of the answer.
This includes reading, comprehension, and typing
time.</p>
        <p>Results will be presented in the following sections.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2. LLMs Surprisal Calculation</title>
        <p>To assess how predictable crossword answers are for a
language model, we use the notion of surprisal, defined as
the negative logarithm of a token’s predicted probability.</p>
        <p>In the case of full-word answers, we compute:</p>
        <p>AnswerSurprisal = − log (︀  (answer))︀</p>
        <p>This diference provides an interpretable
surprisalbased signal even when the answer appears before the
clue, a configuration that, as said, arises in certain
experimental concatenation schemes. The assumption is that
if the answer helps predict the clue, the clue’s surprisal
should be lower when preceded by the answer.</p>
        <p>Both Answer Surprisal and Surprisal Diference rely
on the autoregressive, left-to-right prediction behavior of
causal models. For each concatenation strategy, the
suitable Surprisal measure is calculated. To ensure
linguistically accurate tokenization and probability estimates, we
use models that are pre-trained or fine-tuned on Italian
data.</p>
        <p>Complete sentences composed of clue and answer are
AnswerSurprisal = − ∑︁ log(︀  ()︀) (2) given in input to the models, thus it must be faced the
=1 issue of concatenating clue and answer in grammatical</p>
        <p>This captures the cumulative surprisal of all the answer and coherent structures without substantially modifying
tokens, assuming the clue and previous answer tokens the clue style, syntactic characterization and meaning
have already been processed. and having the answer as final word so as to calculate its</p>
        <p>In some cases, however, the format of the input Surprisal value after the context represented by the clue.
may place the answer at the beginning of the sequence, In most cases, the answer maintains a synonymy
rerather than at the end, recalling a topicalized structure lationship with the clue, which can often be expressed
[45, 46, 47, 48, 49]. The interesting thing is that, given using the Italian adverb cioè. This allows for an automatic
how the clues are phrased (as definitions or comments), concatenation of clue-answer pairs, forming sentences
the most general structure would actually be that of topic where the answer appears as the final word, such as
+ comment in which the comment or clue provides rele- &lt;clue&gt; cioè &lt;answer&gt;.
vant information about the answer that represents accord- To analyze how diferent concatenation strategies
imingly the topic of the clue. This structure then constitutes pact Surprisal values, various concatenation rules have
the most suitable strategy of concatenation in line with been applied to the dataset, ensuring that each
cluethe CW puzzle logic. For such reverse concatenations (e.g., answer pair is formatted appropriately for model
evaluaanswer + clue), however, standard Answer Surprisal tion. The employed concatenation methods are:
is no longer applicable because causal models, in virtue Diferent concatenations has been then employed:
of their incremental progressive nature, cannot condition
on future tokens. To address this, we introduce a comple- Cioè rule &lt;clue&gt; cioè ART &lt;answer&gt;
mentary measure: Surprisal Diference. This measure
is used in all the concatenation rules that do not permit to Subject-based rule ART &lt;answer&gt; &lt;clue&gt;
use the standard Answer Surprisal like the Topic-based Topic-based rule ART &lt;answer&gt; , &lt;clue&gt;
rule. So concatenation rules that have the answer at the
end use AnswerSurprisal while concatenation rules that Copular rule ART &lt;answer&gt; VERB(TO BE) &lt;clue&gt;
have the answer in the beginning use SuprisalDiference
as their surprisal score. Inverse-copular rule &lt;clue&gt; VERB(TO BE) ART</p>
        <p>Surprisal Diference compares the surprisal of the clue &lt;answer&gt;
in isolation with the surprisal of the same clue following
the answer. It captures how much the presence of the
answer facilitates (or reduces the unexpectedness of) the
clue:
Prompt rule Sei un cruciverbista esperto.</p>
        <p>Ti verrà fornita una definizione a cui
dovrai rispondere correttamente. La
definizione è: &lt;clue&gt;. La risposta ha
&lt;answer length&gt; lettere, inizia con
&lt;answer’s first letter&gt;, &lt;answer&gt;
SurprisalDif = ( + ) − ()</p>
        <p>(3)
where (·) denotes surprisal,  is the clue, and  is the
answer.</p>
        <p>These diferent formulations allow for a comparative
analysis of Surprisal variations across clue structures,
1meta-llama/Meta-Llama-3-8B
2meta-llama/Llama-2-7b-hf
3GroNLP/gpt2-small-italian
ensuring that the most efective concatenation strategy
can be identified for each category.</p>
        <p>For each item in the dataset, the model will calculate
the probability of each token, then the token
composing the answer are used to estimate the Surprisal of the
answer given the other tokens. High Surprisal values
at the answer final word will tell us that the answer
is unexpected in that context, and consequently harder
to guess. Diferent types of Surprisal are so defined by
means of how data are labelled, by means of the diferent
concatenation rules. This opens the door to fine-grained
investigation in diferent directions. One rule could work
better with some categories than the others in enabling
the model to do more reliable predictions. The possibility
exists of elaborating specific rules for each structure of
clue-answer pair, in order to make input items as realistic
as possible and hence improve the model performance in
predicting human responses. To evaluate models’
performances in predicting Accuracy and RTs, Surprisal values
will be compared with results collected in the human
experiment. The comparison should highlight:
To examine the relationship between Surprisal values
and human Accuracy, we first conducted a Pearson
correlation analysis using mean per-item accuracy scores. The
results revealed a negative correlation, consistent with
our hypothesis that higher Surprisal values correspond
to more dificult clues. Among the tested models, Llama3
and Ita-GPT2 yielded higher Pearson coeficients, which
may reflect Llama3’s extensive multilingual capacity and
Ita-GPT2’s fine-tuning on Italian. Figure 2 illustrates the
correlation between Surprisal and Accuracy for the three
models on a representative concatenation rule. In
addition, Tables 11, 12, and 13 in the Appendix report a
Generalized Linear Mixed Model (GLMM) analysis, which
incorporates individual variability without aggregating
accuracy values. This analysis further confirms Surprisal
as a significant predictor of Accuracy, and therefore of
clue dificulty.</p>
        <p>We also investigated the relationship between surprisal
and response times (RTs) using a series of Linear Mixed</p>
        <p>Models (LMMs) fitted separately for each concatenation
• A positive correlation between Surprisal and RTs; type. RTs were log-transformed to correct for positive
• A negative correlation between Surprisal and Ac- skew and stabilize variance, in line with standard
psycuracy. cholinguistic practice. This transformation helped reduce
the impact of outliers and enabled the use of
parametDiferent Surprisal have been calculated with diferent ric modeling techniques. In each model, surprisal was
models and with diferent concatenations rules. Pearson included as a fixed efect, and subject-specific intercepts
coeficient will tell us more on the correlation between were modeled as random efects to account for baseline
these variables, human data and Surprisal (for the three variation across participants.
models employed). For both Accuracy and RTs we will The results consistently showed a statistically
signifhave: icant positive relationship between surprisal and
logtransformed RTs across all concatenation types as
sum• A global comparison, which tells us whether each marized in table 4 for Llama3 and the other two models in
model’s Surprisal output is in a significant corre- the appendix (table 14, 15). This indicates that clues with
lation with human measures; higher surprisal values led to longer response times,
sup• The correlations between Surprisal and Accuracy porting the hypothesis that surprisal reflects processing
or RTs for each category, to observe whether more dificulty. Although the magnitude of the efect varied
relevant correlations are there for some of the by concatenation rule, all coeficients were positive, and
categories. confidence intervals did not include zero.</p>
        <p>These findings demonstrate that surprisal is a robust
5. Experimental Results predictor of reading latency in the crossword task, even
under minimal context and with sparse surface cues.
ImThe experimental results focus on the correlation be- portantly, this efect emerges despite the lack of explicit
tween Surprisal values and human performance in solv- time pressure, suggesting that surprisal exerts an
autoing CW clue-answer pairs. We tested this approach on matic influence on processing efort.
three models: Llama-3-8B 1 , Llama-2-7B 2, and Ita-GPT-2 While the overall pattern is clear, future research could
Medium-121M 3. The mean Accuracy of participants in further refine the temporal precision of RTs by
decomthe human experiment was found to be 0.63. posing the overall response into distinct phases.
Specifically, logging (i) the time to initiate typing, (ii) the
typing duration, and (iii) the post-completion delay would
help distinguish comprehension time from motor and
decision-related delays. This would allow a more direct
mapping between linguistic dificulty and behavioral
latency, providing an even clearer picture of the cognitive</p>
        <sec id="sec-4-3-1">
          <title>5.1.1. Correlation in Diferent Categories</title>
          <p>To further investigate how Surprisal correlates with
human performance across diferent types of clues, we
analyzed the correlation separately for diferent
macrocategories and individual categories. The results are
visualized in Figures 3 for the Ita-GPT-2 model. Our findings Table 5
indicate that the strength of the correlation between Sur- Best correlation coeficients (r) and p-values for each macro
prisal and Accuracy varies significantly depending on category and concatenation type (Ita-GPT-2 Medium-121M).
the type of clue. In particular, two categories showed
notably weak correlations:
• Metalinguistic Clues: This category exhibited
no correlation between Surprisal and Accuracy.</p>
          <p>A likely explanation is the dificulty transformers
face when processing metalinguistic cues, such
as wordplays and abbreviations. Since these
models rely on token probabilities, and not on sin- Other categories, particularly nominal and verbal
predgle characters they struggle to accurately predict icate structures, displayed stronger correlations,
suggestnon-standard or unconventional relationships be- ing that Surprisal works better for categories where the
tween clues and answers, which are common in clue-answer relationship is more straightforwardly
semetalinguistic clues. mantic rather than dependent on linguistic nuances like
• Copular Clues: The correlation was also absent wordplay or syntactic constraints.</p>
          <p>for copular structures. One probable reason is A more robust analysis with GLMMs, to account for
that the cioè concatenation rule does not naturally individual variability, will require more data for each
catift the syntactic structure of these clues. Copular
constructions often require a more flexible
paraphrasing strategy, rather than a simple
equivalence statement, leading to suboptimal Surprisal
estimations.
prisal with three causal LLMs (Ita-GPT-2-121M,
Llama2-7B, Llama-3-8B).</p>
          <p>Answers to the research questions
1. RQ1: Higher Surprisal predicts lower solver
accuracy (best  = −0.57 ) and longer log-RTs,
showing that information-theoretic “surprise” mirrors
cognitive load.
2. RQ2: Language match beats raw size: the
Italianspecific Ita-GPT-2 and multilingual Llama-3
surpass the larger, English-leaning Llama-2.
3. RQ3: No single template sufices.</p>
          <p>Topic–comment placement works best for
nominal and verbal clues, the cioè rule for
many adjectival/infinitival ones, while copular
and metalinguistic items need ad-hoc rewrites;
selecting the best rule per macro-category adds
up to 0.15 -points.
4. RQ4: Category-specific Surprisal thresholds
separate “easy”, “medium” and “hard” clues, enabling
an adaptive generator that targets any solver
level.
egory. We leave this further efort to future experimental
work.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>5.2. Efect of Concatenation Strategies</title>
        <p>We also explored the impact of diferent concatenation
strategies on model performance. The concatenation Main finding. LLM-derived Surprisal is a reliable,
finemethod influenced Surprisal values diferently across clue grained predictor of human crossword dificulty,
explaincategories. Some structures benefited from the cioè rule, ing more than half of the variance in accuracy for the
while others yielded more reliable Surprisal estimates most common clue types.
under diferent approach.</p>
        <p>Table 5 shows, for each macro category, the concate- Limitations (i) Italian-only data; other languages may
nation that yields the best correlation results and it’s need new tokenisers. (ii) The 160-item set limits power
value. These results highlight the importance of category- for rare structures. (iii) RTs blend reading, reasoning
specific approaches when applying Surprisal-based difi- and typing; keystroke logs would isolate
comprehenculty estimation. sion latency. (iv) Only decoder-style LLMs were tested;
encoder–decoder or retrieval-augmented models might
5.3. Summary of Findings align diferently. (v) Clues were scored in isolation,
ignoring cross-checks within full grids.</p>
        <p>Overall, our findings confirm that Surprisal serves as
a useful predictor of CW puzzle dificulty, particularly
when considering Accuracy as a measure of challenge.</p>
        <p>However, its predictive power for solving times remains
limited, likely due to the nature of short CW clues. The
choice of concatenation strategy also plays a crucial
role in model performance, suggesting that tailored
approaches could further refine Surprisal-based dificulty
estimations.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>This paper provides the first cognitively grounded,
automatic gauge of crossword–clue dificulty. We compiled a
160-item Italian benchmark (2 880 human judgements), Anchoring puzzle evaluation in probabilistic language
converted each clue–answer pair into well-formed sen- theory links NLP, psycholinguistics and game AI,
promistences with five templates, and estimated token-level Sur- ing crosswords that scale from novice amusement to
expert challenge while ofering a fresh lens on
human–machine language alignment.</p>
      <sec id="sec-5-1">
        <title>Future work</title>
        <p>1. Scale the benchmark to thousands of clues,
mul</p>
        <p>tiple languages and complete grids.
2. Log richer behaviour (eye-tracking, keystrokes,</p>
        <p>EEG) to separate processing stages.
3. Probe new architectures and character-level
to</p>
        <p>kenisers for closer cognitive fidelity.
4. Fuse Surprisal with real-time solver profiles for</p>
        <p>personalised tutoring.
5. Couple Surprisal-based clue ranking with
constraint-based fills to deliver fully adaptive
crosswords.
web-based system for crossword solving, in: AAAI, [46] T. Reinhart, Pragmatics and linguistics: an analysis
2005. of sentence topics (1981).
[30] D. R. Radev, R. Zhang, S. Wilson, Cruciform: Solv- [47] S. Cruschina, The syntactic role of discourse-related
ing crosswords with nlp, in: Workshop on Struc- features (2009).</p>
        <p>tured Prediction for NLP, 2016. [48] S. Cruschina, Topicalization in Romance Languages,
[31] S. Saha, S. Chakraborty, S. Saha, U. Garain, Lan- 2021.</p>
        <p>guage models are crossword solvers, arXiv preprint [49] S. Cruschina, Topicalization, dislocation and clitic
arXiv:2406.09043 (2024). resumption, 2022.
[32] L. Rigutini, M. Maggini, M. Gori, Automatic
gener</p>
        <p>ation of crossword puzzles, in: IEA/AIE, 2008.
[33] L. Rigutini, M. Maggini, M. Gori, Automatic cross- 7. Appendices
word puzzle generation and its educational
applications, in: AI*IA, 2012. In the following section we report the complete results
[34] H. Ranaivo-Malançon, M. R. Sazali, Automatic fill- for all llms and concatenation rules divided by macro
catein crosswords in malay and english, Journal of gory and languege model, The appendix already contains
Computer Science (2013). one correlation table for each model; see their individual
[35] A. Esteche, R. Rosito, Automatic generation of span- captions.</p>
        <p>ish crossword puzzles from news, in: Proceedings
of Clei, 2017.
[36] A. Arora, A. Kumar, SEEKH: Generating
educational crosswords for indian languages, in:
International Conference on Educational Data, 2019.
[37] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,</p>
        <p>L. Rigutini, M. Maggini, M. Gori, Italian crossword
generator: Enhancing education through
interactive word puzzles, in: Proceedings of CLiC-it, 2023.
[38] G. Angelini, M. Ernandes, T. Iaquinta, C. Stehlé,</p>
        <p>F. Simões, K. Zeinalipour, A. Zugarini, M. Gori, The
webcrow french crossword solver, arXiv preprint
arXiv:2311.15626 (2023).
[39] K. Zeinalipour, M. Z. Saad, M. Maggini, M. Gori,</p>
        <p>ArabIcros: Ai-powered arabic crossword puzzle
generation for educational applications, arXiv
preprint arXiv:2312.01339 (2023).
[40] K. Zeinalipour, M. Z. Saad, M. Maggini, M. Gori,</p>
        <p>From arabic text to puzzles: Llm-driven
development of arabic educational crosswords, in:
Proceedings of the Workshop on Language Models for</p>
        <p>Low-Resource Languages, 2025.
[41] K. Zeinalipour, Y. G. Keptiğ, M. Maggini, L. Rigutini,</p>
        <p>M. Gori, A turkish educational crossword puzzle
generator, arXiv preprint arXiv:2405.07035 (2024).
[42] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Maggini,</p>
        <p>M. Gori, L. Rigutini, Clue-instruct: Text-based clue
generation for educational crossword puzzles, in:</p>
        <p>Proceedings of LREC-COLING, 2024.
[43] M. Honnibal, I. Montani, S. Van Landeghem,</p>
        <p>A. Boyd, et al., spacy: Industrial-strength natural
language processing in python (2020).
[44] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,</p>
        <p>L. Rigutini, M. Maggini, M. Gori, et al., Italian
crossword generator: an in-depth linguistic analysis in
educational word puzzles, IJCOL 11 (2025) 47–72.
[45] L. Rizzi, On the form of chains: Criterial positions</p>
        <p>and ecp efects (2006).</p>
        <p>Category
inf_VP
pass:other
metalinguistic
imp_refl:missSubj
def_DP
cop:missSubj
PP
cop:pron
ind_DP
cop:clitic
bare_NP:rel
adjP:pron
bare_NP
adjP
act:pron
act:missSubj
def_DP:rel
imp_refl:other
act:clitic
pass:missSubj</p>
        <p>Concatenation Type
concatenation_subj_art
concatenation_cop
concatenation_cioè_art
concatenation_topic_art
concatenation_topic_art
concatenation_prompt
concatenation_topic_art
concatenation_cop
concatenation_inv_cop
concatenation_subj_art
concatenation_cioè_art
concatenation_prompt
concatenation_cop
concatenation_cioè_art
concatenation_inv_cop
concatenation_cop
concatenation_inv_cop
concatenation_cioè_art
concatenation_subj_art
concatenation_prompt
concatenation_topic_art
concatenation_subj_art
concatenation_cioè_art
concatenation_cop
concatenation_inv_cop
concatenation_prompt</p>
        <p>Coef</p>
        <p>Coef
Coef
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>