1. Introduction

K. Zeinalipour); concept of Surprisal, defined as the negative logarithm https://github.com/cristianochesi (C. Chesi) of the probability of a word given its context. This

Surprisal and Crossword Clues dificulty: Evaluating Linguistic Processing between LLMs and Humans

Tommaso Iaquinta

Asya Zanollo

0 3

Achille Fusco

0 1

Kamyar Zeinalipour

Cristiano Chesi

0 3 0 Laboratory for Neurocognition , Epistemology, and Theoretical Syntax - NeTS-IUSS Pavia 1 Università degli Studi di Firenze , Piazza S. Marco 4, 50121 Firenze , Italy 2 Università degli Studi di Siena (UNISI) , Via Roma 56, 53100 Siena , Italy 3 University School for Advanced Studies IUSS Pavia , Piazza della Vittoria 15, 27100 Pavia , Italy

2025

000 9 0009

Crossword clue dificulty is traditionally judged by human setters, leaving automated puzzle generators without an objective yard-stick. We model dificulty as the Surprisal of the answer given the clue, estimating it with token probabilities from large language models. Comparing three models three causal LLMs-Llama-3-8B, Llama-2-7B, and Ita-GPT-2-121M. with 60 human solvers on 160 hand-balanced clues, Surprisal correlates negatively with accuracy (r = -0.62 for nominal clues). These results show that language-model Surprisal captures some of the cognitive load humans experience and that language-specific training and model scale both matter; the metric therefore enables adaptive crossword generation and provides a new test-bed for probing the alignment between human and model linguistic processing.

eol>surprisal llm gpt crossword education linguistic games puzzle Crossword dificulty

1. Introduction

Crossword (CW) puzzles are among the most popular language games, captivating millions through newspapers, mobile apps, voice assistants, and even televised competitions [1, 2]. The enduring appeal of crosswords across formats stems from the careful calibration of clue dificulty, which can range from accessible, beginner-friendly prompts to highly intricate, expert-level challenges.

Despite advancements in automated puzzle generation, state-of-the-art systems like Dr. Fill [3] and the Berkeley Crossword Solver [1], while capable of outperforming many human solvers, still lack a reliable, objective measure to assess the challenge posed by the clues they generate. Traditional heuristics, such as clue length, grid density, historical solve statistics, and letter

Microcategory Macrocategory Accuracy

RTs (log10)

Surprisal

bareNP:rel nominal 0.526 4.214 5.207

Microcategory Macrocategory Accuracy

RTs (log10)

Surprisal

defDP nominal 1.0 3.973 3.926 language models (LLMs), which naturally compute token probabilities, Surprisal becomes readily accessible. Recent studies further emphasize the influence of model scale and training domain on the alignment between model-derived Surprisal and human cognitive patterns [10, 11]. Notably, despite its potential, Surprisal has yet to be explored specifically as a metric for crossword dificulty.

Given the increasing prevalence and sophistication of automated CW generation systems, there is now a pressing need for a principled, data-driven metric capable of accurately gauging puzzle dificulty. Such a metric could facilitate adaptive tutoring tools, ensure fairness in online competitions, and provide richer psycholinguistic experimentation frameworks. In this paper, we propose and investigate token-level Surprisal, delivered by LLMs, as an innovative and robust candidate for objectively quantifying crossword puzzle dificulty. The current research represents the first attempt to apply the surprisal metric in the context of crossword puzzles, marking a novel approach to defining crossword dificulty through computational linguistics measures. To guide our investigation and evaluate the viability of token-level Surprisal as an efective measure, we formulate a central research question, summarized clearly below. From this overarching inquiry, we derive four specific, actionable research questions (RQs) designed to systematically unpack the predictive capabilities of Surprisal. (2 880 judgments), provides accuracy and solvingtime gold standards. • Surprisal estimation framework —Five generic concatenation rules turn any clue–answer pair into a well-formed sentence with the answer in final position; open-source code computes multi-token Surprisal from any causal LM. • Empirical findings —(i) Surprisal correlates strongly and negatively with accuracy (best = 0−.57 ) but only weakly with raw solving times—stronger after log transform. (ii) Ita-GPT-2 and Llama-3 outperform larger, non-specialised models. (iii) Predictive strength is categorydependent; metalinguistic and copular clues remain challenging. (iv) Picking the right concatenation rule per category boosts correlation by up to 0.15 -points. • Recipe for adaptive generation —A demonstrator workflow assigns category-specific Surprisal thresholds, selects clues at desired dificulty, and sketches integration with full-grid generation. • Open resources —All data, annotation scripts, Surprisal code, and analysis notebooks are released to foster reproducibility and future research on cognitively informed puzzle generation.

2. Related Work

• Fine-grained linguistic taxonomy & bench- 2.1. Surprisal as a Psycholinguistic Metric mark —A curated set of 160 Italian clues spanning 20 syntactic categories, solved by 60 natives

In recent years, Surprisal has been employed to evaluate LLMs performances in psycholinguistic studies, in correlation with online processing measures taken from corpora, like Reading Times (RTs) [12, 13, 14, 15, 16], and Event-Related Potentials (ERPs) [17, 18]. A key issue in comparing LLMs linguistic competence and Human competence consist in understanding at which human-like degree LLMs represent Natural Language (NL). Human linguistic competence does not rely on probability alone [19, 20] and it is structure-driven, in contrast to LLMs data-driven training [21, 22] and tend to underestimate syntax with respect to human processing, in virtue of their diferent mechanism of learning and understanding [13] In this scenario Surprisal represents a ‘neutral’ measure which can account also for diferences deriving from various linguistic sources in a probabilistic frame- Figure 1: Methodology overview. Colour-coded blocks show work. [23] The understanding of the diference between data (blue), processing (grey), models (orange) and results language in models and humans remains a central and (green); arrows trace the workflow. extremely relevant point in all the comparative studies and in the analysis of the results. Following the line of research described above, we aim at investigating whether lenge. Early generators searched word-list constraints the same correlation - between processing dificulty and for Italian crosswords and beyond [32, 33], later adapting Surprisal values – holds also for CW clue-answer pairs. to Malay [34], Spanish [35] and Indian languages for eduNo prior work supplies a token-level, psycholinguisti- cation [36]. More recently, Zeinalipour and collaborators cally grounded metric for per-clue dificulty. We import have spearheaded a multilingual, education-oriented reLLMs Surprisal, validate it against 60 human solvers, and search programme: Italian educational grids [37], the Weshow how it plugs into adaptive generation workflows. bCrow French solver [38], Arabic generators—including both clue-focused ArabIcros [39] and a text-to-puzzle 2.2. LLMs and Cognitive Alignment pipeline [40]—, a Turkish generator [41], and the ClueInstruct dataset for pedagogy-centred clues [42]. Together, these works illustrate a fast-growing ecosystem of LLM-driven solvers and generators that operate across languages and educational settings.

Despite this progress, no prior work proposes an objective, cognitively grounded dificulty metric. Published systems label puzzles informally (“easy”, “hard”) or rely on surface heuristics (grid density, answer length). By linking LLM-derived surprisal to human accuracy and solving times, our study closes this evaluation gap and enables adaptive puzzle generation across languages.

Large language models (LLMs) supply token probabilities out of the box, enabling fine-grained surprisal estimates.

Layer-wise activations in GPT-, BERT- and Llama-style models predict fMRI and MEG responses to naturalistic text with striking accuracy [24, 25]. Model scale and training data modulate that alignment: bigger is not always better for eye-movement predictivity, whereas deeper layers in larger models often map best to slower neural signals [26]. Tokenisation also matters: sub-word splits can blur the link between model surprise and human lexical access; aggregating sub-tokens or using morphologically aware tokenisers improves fit [ 27]. By comparing three Italian-capable LLMs (Ita-GPT-2, Llama-2, Llama-3), we contribute new evidence on how family, size and training regime afect cognitive alignment in a puzzle-solving context.

3. Methodology 2.3. Crossword Solving & Generation

AI interest in crosswords began with the probabilistic solver Proverb [28] and the web–based WebCrow system [29]. Dr. Fill later recast clue filling as a singleweighted CSP [3], while subsequent systems introduced neural rerankers and hybrid IR–NLP pipelines [30]. Large language models now push solver accuracy above 90 % on New York Times puzzles [31].

Grid construction and clue writing pose a diferent chalOur four–step pipeline (Fig.1) is: (1) scrape, clean, and tag approximately 125 000 Italian clue–answer pairs into 20 syntactic categories; (2) turn each pair into a sentence via five lightweight templates and compute answer–level surprisal with Llama – 3, Llama – 2, and Ita – GPT – 2; (3) obtain a human baseline from 60 native speakers solving 160 balanced clues, yielding accuracy and logtransformed solving times; and (4) correlate surprisal with those measures and use category-specific thresholds to power an adaptive crossword generator. using Regular Expressions (RegEx) and Part-of-Speech (PoS) tagging that have been employed to extract examples of diferent syntactic constructions and see whether their distribution was significant or not. The extraction has then been improved using the python library spaCy [43] and the dataset has been parsed using the \nlp function which allows us to identify the head node of each clue. We identified 20 pertinent clue typologies for our experiment summarized in Table 3. For further details see the original work on CW linguistic analysis [44].

4. Experimental Setup 3.1. Data and Preprocessing

To evaluate the dificulty of crossword puzzles, we leveraged a comprehensive collection of Italian CW clues and answers. The sources of the clues-answer pairs are both internet sites that release solutions for CW clues, https://www.dizy.com/ and https://www.cruciverba.it/, that we scraped through apposite scripts. And also pdf versions of famous Italian CW papers like Settimana Enigmistica and Repubblica, that we suitably converted to clueanswer pairs. The various sources where than cleaned, merged and the duplicates were removed. This dataset consists of 125,600 entries that correspond to unique clue-answer pairs. It includes clues related to diferent domains, such as history, geography, literature, and pop culture. The dataset under investigation contains a diverse array of linguistic features, including grammatical structures, syntactic patterns, and lexical elements.

The research question that guides our experiment is whether the probability of LLMs token can be used to predict the dificulty of a clue-answer pair. The underlying assumption is that Surprisal, as a complexity metric, correlates to online measures of processing dificulty. For 3.2. Linguistic Classification this reason, we can consider Surprisal in relation to measures that we took as index of the dificulty of a CW clue, The dataset of Italian clue-answer pairs has been syntacti- which is expected to be visible in: cally analysed and diferent clue constructions have been categorized with the aim of investigating what kinds of • Response Times (RTs): how long does it take to structural operations can be applied to derive CW clues solve the clue, i.e. reading,guessing and typing from well-formed sentences. Being based on the syn- the answer; tax of clue-answer pairs, the classification presented is • Accuracy: How accurate is the answer. language-dependent on Italian.

In general terms, clues have been initially distinguished Consequently, a trivial answer would have low Surprisal, into clausal and non-clausal structures depending on the which means a high probability, and vice versa we can presence or absence of an inflected verb in the matrix consider high Surprisal, or low probability of the tarclause and, secondly, non-clausal clues can be articulated get word, as indicating a non-obvious, original answer. in diferent structures varying in the nature of their heads: Several psycholinguistic studies investigate language proNoun Phrases (NP), Determiner Phrases (DP), Preposi- cessing in predicting next word, but no use of CW data tional Phrases (PP), Adjectival Phrases (AdjP) and Adver- have been found on this task. Finding the word-answer, bial Phrases (AdvP). given a definition, could be considered a type of next Clausal clues, on the other side, represent syntactically word prediction task. In this case not only the probarelevant items in virtue of the presence of an inflected bility of the word must be considered, but more than verb in the matrix clause and they can be categorized on that the Accuracy. Indeed, the right choice of the exact that basis. Indeed these include clauses with verbal or word needed to fill the grid characterizes a CW task. The nominal predicates (i.e. copular sentences), and relative current experimental proposal configures as an exploclauses. These main categories diferentiate internally, rative approach for a psycholinguistic treatment of CW and some subcategories can be accordingly defined. Once language, and as an attempt to investigate LLMs abilities some significant syntactic structures have been outlined to grasp diferent levels of surprise, linguistic originality we can proceed with the classification of our unstruc- in CW clues. The experimental setup consists of two tured corpus. It is important to highlight that the pro- diferent paths, the results of which will be compared. posed categorization is based on the generative grammar approach thus, in the computation of classification rules we considered the diference between the parser (dependencies) and our hierarchical categorization. Categories have been identified on the basis of the type of head, and then further specified by additional features (if any) like in the case of DP which can be of type definite or indefinite.

First of all a qualitative data analysis has been carried out • Human Experiment: the first step consists of a

Solving Task to test participants and collect human responses. The absence of already annotated corpora for CW language leads to the limitation of having a constrained number of tested items, for reasons of time and because they are handdesigned. • LLMs Surprisal Calculation: this limitation is not encountered on the LLMs side, with which

Macrocategory Typologies

copular copular copular verbal predicate verbal predicate verbal predicate verbal predicate verbal predicate verbal predicate verbal predicate infinitive nominal nominal nominal nominal nominal prepositional adjectival adjectival metalinguistic cop:missSubj, copular sentence with subject omission cop:clitic, copular sentence with a clitic in object position cop:pron, copular sentence with a pronoun in object position act:missSubj, active verbal sentences with subject omission act:clitic, active verbal sentences with a clitic in object position act:pron, active verbal sentences with a pronoun in object position pass:missSubj, passive sentence with subject omission pass:other, other kinds of passive sentences imp_refl:missSubj, active sentence with impersonal pronoun or reflexive verb with subject omission imp_refl:other, other kinds of active sentence with impersonal pronoun or reflexive verb inf_VP, infinitival verb phrases (VP) bare_NP, bare noun phrases (NP) bare_NP:rel, bare NP followed by a relative clause def_DP, definite determiner phrases (DP) def_DP:rel, DP followed by a relative clause ind_DP, indefinite DP PP, prepositional phrases adjP, adjectival phrases adjP:pron, adjectival phrases with pronoun two-letters answer

Examples

Fu Cancelliere della Germania dal 1949 al 1963 = Adenauer Venere ne era la dea = bellezza È celebre quella di Trinità dei Monti = scalinata Risiede in uno spazio geografico determinato = abitante La segue il medico = ammalata Quelli d’America hanno per capitale Washington = Stati uniti È detta Il Continente Bianco = Antartide Vi furono ritrovati noti bronzi = Riace Si reca spesso al catasto = geometra Che si riferisce all’Università = accademico Investire di un grado = nominare Infuso paglierino = tè Cilindri commestibili che vengono afettati = polpettoni Il conto delle spese da farsi = preventivo Lo Stato di cui fanno parte le Isole Azzorre = Portogallo Una brutta abitudine perdonabile = vizietto Davanti a Rodrigo = Don Probo, retto = onesto Pittoresco quello siciliano = carretto Il centro di Matera = TE

4.1. Solving Task

Starting from our reference dataset, a set of clue-answer pairs has been selected consisting of a limited number of 8 items for 20 categories presented in 3.2. A total of 160 items have been organized into four lists, all equally representative of the categories. Hence, a subject was presented with one of these four lists and asked to solve 40 CW clues. 60 Italian native speakers were recruited for the experiment. Participants were presented with a clue, and they had to guess the solution, having at their disposal only the length of the answer, represented as a grid, and its initial letter. No time constraint was given during the experiment. For each subject and each item (2880 data points) in the experimental list we collected: • The string representing the given answer. • RT (response time) was measured as the interval in milliseconds between the appearance of the crossword clue and the submission of the answer. This includes reading, comprehension, and typing time.

Results will be presented in the following sections.

4.2. LLMs Surprisal Calculation

To assess how predictable crossword answers are for a language model, we use the notion of surprisal, defined as the negative logarithm of a token’s predicted probability.

In the case of full-word answers, we compute:

AnswerSurprisal = − log (︀ (answer))︀

This diference provides an interpretable surprisalbased signal even when the answer appears before the clue, a configuration that, as said, arises in certain experimental concatenation schemes. The assumption is that if the answer helps predict the clue, the clue’s surprisal should be lower when preceded by the answer.

Both Answer Surprisal and Surprisal Diference rely on the autoregressive, left-to-right prediction behavior of causal models. For each concatenation strategy, the suitable Surprisal measure is calculated. To ensure linguistically accurate tokenization and probability estimates, we use models that are pre-trained or fine-tuned on Italian data.

Complete sentences composed of clue and answer are AnswerSurprisal = − ∑︁ log(︀ ()︀) (2) given in input to the models, thus it must be faced the =1 issue of concatenating clue and answer in grammatical

This captures the cumulative surprisal of all the answer and coherent structures without substantially modifying tokens, assuming the clue and previous answer tokens the clue style, syntactic characterization and meaning have already been processed. and having the answer as final word so as to calculate its

In some cases, however, the format of the input Surprisal value after the context represented by the clue. may place the answer at the beginning of the sequence, In most cases, the answer maintains a synonymy rerather than at the end, recalling a topicalized structure lationship with the clue, which can often be expressed [45, 46, 47, 48, 49]. The interesting thing is that, given using the Italian adverb cioè. This allows for an automatic how the clues are phrased (as definitions or comments), concatenation of clue-answer pairs, forming sentences the most general structure would actually be that of topic where the answer appears as the final word, such as + comment in which the comment or clue provides rele- <clue> cioè <answer>. vant information about the answer that represents accord- To analyze how diferent concatenation strategies imingly the topic of the clue. This structure then constitutes pact Surprisal values, various concatenation rules have the most suitable strategy of concatenation in line with been applied to the dataset, ensuring that each cluethe CW puzzle logic. For such reverse concatenations (e.g., answer pair is formatted appropriately for model evaluaanswer + clue), however, standard Answer Surprisal tion. The employed concatenation methods are: is no longer applicable because causal models, in virtue Diferent concatenations has been then employed: of their incremental progressive nature, cannot condition on future tokens. To address this, we introduce a comple- Cioè rule <clue> cioè ART <answer> mentary measure: Surprisal Diference. This measure is used in all the concatenation rules that do not permit to Subject-based rule ART <answer> <clue> use the standard Answer Surprisal like the Topic-based Topic-based rule ART <answer> , <clue> rule. So concatenation rules that have the answer at the end use AnswerSurprisal while concatenation rules that Copular rule ART <answer> VERB(TO BE) <clue> have the answer in the beginning use SuprisalDiference as their surprisal score. Inverse-copular rule <clue> VERB(TO BE) ART

Surprisal Diference compares the surprisal of the clue <answer> in isolation with the surprisal of the same clue following the answer. It captures how much the presence of the answer facilitates (or reduces the unexpectedness of) the clue: Prompt rule Sei un cruciverbista esperto.

Ti verrà fornita una definizione a cui dovrai rispondere correttamente. La definizione è: <clue>. La risposta ha <answer length> lettere, inizia con <answer’s first letter>, <answer> SurprisalDif = ( + ) − ()

(3) where (·) denotes surprisal, is the clue, and is the answer.

These diferent formulations allow for a comparative analysis of Surprisal variations across clue structures, 1meta-llama/Meta-Llama-3-8B 2meta-llama/Llama-2-7b-hf 3GroNLP/gpt2-small-italian ensuring that the most efective concatenation strategy can be identified for each category.

For each item in the dataset, the model will calculate the probability of each token, then the token composing the answer are used to estimate the Surprisal of the answer given the other tokens. High Surprisal values at the answer final word will tell us that the answer is unexpected in that context, and consequently harder to guess. Diferent types of Surprisal are so defined by means of how data are labelled, by means of the diferent concatenation rules. This opens the door to fine-grained investigation in diferent directions. One rule could work better with some categories than the others in enabling the model to do more reliable predictions. The possibility exists of elaborating specific rules for each structure of clue-answer pair, in order to make input items as realistic as possible and hence improve the model performance in predicting human responses. To evaluate models’ performances in predicting Accuracy and RTs, Surprisal values will be compared with results collected in the human experiment. The comparison should highlight: To examine the relationship between Surprisal values and human Accuracy, we first conducted a Pearson correlation analysis using mean per-item accuracy scores. The results revealed a negative correlation, consistent with our hypothesis that higher Surprisal values correspond to more dificult clues. Among the tested models, Llama3 and Ita-GPT2 yielded higher Pearson coeficients, which may reflect Llama3’s extensive multilingual capacity and Ita-GPT2’s fine-tuning on Italian. Figure 2 illustrates the correlation between Surprisal and Accuracy for the three models on a representative concatenation rule. In addition, Tables 11, 12, and 13 in the Appendix report a Generalized Linear Mixed Model (GLMM) analysis, which incorporates individual variability without aggregating accuracy values. This analysis further confirms Surprisal as a significant predictor of Accuracy, and therefore of clue dificulty.

We also investigated the relationship between surprisal and response times (RTs) using a series of Linear Mixed

Models (LMMs) fitted separately for each concatenation • A positive correlation between Surprisal and RTs; type. RTs were log-transformed to correct for positive • A negative correlation between Surprisal and Ac- skew and stabilize variance, in line with standard psycuracy. cholinguistic practice. This transformation helped reduce the impact of outliers and enabled the use of parametDiferent Surprisal have been calculated with diferent ric modeling techniques. In each model, surprisal was models and with diferent concatenations rules. Pearson included as a fixed efect, and subject-specific intercepts coeficient will tell us more on the correlation between were modeled as random efects to account for baseline these variables, human data and Surprisal (for the three variation across participants. models employed). For both Accuracy and RTs we will The results consistently showed a statistically signifhave: icant positive relationship between surprisal and logtransformed RTs across all concatenation types as sum• A global comparison, which tells us whether each marized in table 4 for Llama3 and the other two models in model’s Surprisal output is in a significant corre- the appendix (table 14, 15). This indicates that clues with lation with human measures; higher surprisal values led to longer response times, sup• The correlations between Surprisal and Accuracy porting the hypothesis that surprisal reflects processing or RTs for each category, to observe whether more dificulty. Although the magnitude of the efect varied relevant correlations are there for some of the by concatenation rule, all coeficients were positive, and categories. confidence intervals did not include zero.

These findings demonstrate that surprisal is a robust 5. Experimental Results predictor of reading latency in the crossword task, even under minimal context and with sparse surface cues. ImThe experimental results focus on the correlation be- portantly, this efect emerges despite the lack of explicit tween Surprisal values and human performance in solv- time pressure, suggesting that surprisal exerts an autoing CW clue-answer pairs. We tested this approach on matic influence on processing efort. three models: Llama-3-8B 1 , Llama-2-7B 2, and Ita-GPT-2 While the overall pattern is clear, future research could Medium-121M 3. The mean Accuracy of participants in further refine the temporal precision of RTs by decomthe human experiment was found to be 0.63. posing the overall response into distinct phases. Specifically, logging (i) the time to initiate typing, (ii) the typing duration, and (iii) the post-completion delay would help distinguish comprehension time from motor and decision-related delays. This would allow a more direct mapping between linguistic dificulty and behavioral latency, providing an even clearer picture of the cognitive

5.1.1. Correlation in Diferent Categories

To further investigate how Surprisal correlates with human performance across diferent types of clues, we analyzed the correlation separately for diferent macrocategories and individual categories. The results are visualized in Figures 3 for the Ita-GPT-2 model. Our findings Table 5 indicate that the strength of the correlation between Sur- Best correlation coeficients (r) and p-values for each macro prisal and Accuracy varies significantly depending on category and concatenation type (Ita-GPT-2 Medium-121M). the type of clue. In particular, two categories showed notably weak correlations: • Metalinguistic Clues: This category exhibited no correlation between Surprisal and Accuracy.

A likely explanation is the dificulty transformers face when processing metalinguistic cues, such as wordplays and abbreviations. Since these models rely on token probabilities, and not on sin- Other categories, particularly nominal and verbal predgle characters they struggle to accurately predict icate structures, displayed stronger correlations, suggestnon-standard or unconventional relationships be- ing that Surprisal works better for categories where the tween clues and answers, which are common in clue-answer relationship is more straightforwardly semetalinguistic clues. mantic rather than dependent on linguistic nuances like • Copular Clues: The correlation was also absent wordplay or syntactic constraints.

for copular structures. One probable reason is A more robust analysis with GLMMs, to account for that the cioè concatenation rule does not naturally individual variability, will require more data for each catift the syntactic structure of these clues. Copular constructions often require a more flexible paraphrasing strategy, rather than a simple equivalence statement, leading to suboptimal Surprisal estimations. prisal with three causal LLMs (Ita-GPT-2-121M, Llama2-7B, Llama-3-8B).

Answers to the research questions 1. RQ1: Higher Surprisal predicts lower solver accuracy (best = −0.57 ) and longer log-RTs, showing that information-theoretic “surprise” mirrors cognitive load. 2. RQ2: Language match beats raw size: the Italianspecific Ita-GPT-2 and multilingual Llama-3 surpass the larger, English-leaning Llama-2. 3. RQ3: No single template sufices.

Topic–comment placement works best for nominal and verbal clues, the cioè rule for many adjectival/infinitival ones, while copular and metalinguistic items need ad-hoc rewrites; selecting the best rule per macro-category adds up to 0.15 -points. 4. RQ4: Category-specific Surprisal thresholds separate “easy”, “medium” and “hard” clues, enabling an adaptive generator that targets any solver level. egory. We leave this further efort to future experimental work.

5.2. Efect of Concatenation Strategies

We also explored the impact of diferent concatenation strategies on model performance. The concatenation Main finding. LLM-derived Surprisal is a reliable, finemethod influenced Surprisal values diferently across clue grained predictor of human crossword dificulty, explaincategories. Some structures benefited from the cioè rule, ing more than half of the variance in accuracy for the while others yielded more reliable Surprisal estimates most common clue types. under diferent approach.

Table 5 shows, for each macro category, the concate- Limitations (i) Italian-only data; other languages may nation that yields the best correlation results and it’s need new tokenisers. (ii) The 160-item set limits power value. These results highlight the importance of category- for rare structures. (iii) RTs blend reading, reasoning specific approaches when applying Surprisal-based difi- and typing; keystroke logs would isolate comprehenculty estimation. sion latency. (iv) Only decoder-style LLMs were tested; encoder–decoder or retrieval-augmented models might 5.3. Summary of Findings align diferently. (v) Clues were scored in isolation, ignoring cross-checks within full grids.

Overall, our findings confirm that Surprisal serves as a useful predictor of CW puzzle dificulty, particularly when considering Accuracy as a measure of challenge.

However, its predictive power for solving times remains limited, likely due to the nature of short CW clues. The choice of concatenation strategy also plays a crucial role in model performance, suggesting that tailored approaches could further refine Surprisal-based dificulty estimations.

6. Conclusion

This paper provides the first cognitively grounded, automatic gauge of crossword–clue dificulty. We compiled a 160-item Italian benchmark (2 880 human judgements), Anchoring puzzle evaluation in probabilistic language converted each clue–answer pair into well-formed sen- theory links NLP, psycholinguistics and game AI, promistences with five templates, and estimated token-level Sur- ing crosswords that scale from novice amusement to expert challenge while ofering a fresh lens on human–machine language alignment.

Future work

1. Scale the benchmark to thousands of clues, mul

tiple languages and complete grids. 2. Log richer behaviour (eye-tracking, keystrokes,

EEG) to separate processing stages. 3. Probe new architectures and character-level to

kenisers for closer cognitive fidelity. 4. Fuse Surprisal with real-time solver profiles for

personalised tutoring. 5. Couple Surprisal-based clue ranking with constraint-based fills to deliver fully adaptive crosswords. web-based system for crossword solving, in: AAAI, [46] T. Reinhart, Pragmatics and linguistics: an analysis 2005. of sentence topics (1981). [30] D. R. Radev, R. Zhang, S. Wilson, Cruciform: Solv- [47] S. Cruschina, The syntactic role of discourse-related ing crosswords with nlp, in: Workshop on Struc- features (2009).

tured Prediction for NLP, 2016. [48] S. Cruschina, Topicalization in Romance Languages, [31] S. Saha, S. Chakraborty, S. Saha, U. Garain, Lan- 2021.

guage models are crossword solvers, arXiv preprint [49] S. Cruschina, Topicalization, dislocation and clitic arXiv:2406.09043 (2024). resumption, 2022. [32] L. Rigutini, M. Maggini, M. Gori, Automatic gener

ation of crossword puzzles, in: IEA/AIE, 2008. [33] L. Rigutini, M. Maggini, M. Gori, Automatic cross- 7. Appendices word puzzle generation and its educational applications, in: AI*IA, 2012. In the following section we report the complete results [34] H. Ranaivo-Malançon, M. R. Sazali, Automatic fill- for all llms and concatenation rules divided by macro catein crosswords in malay and english, Journal of gory and languege model, The appendix already contains Computer Science (2013). one correlation table for each model; see their individual [35] A. Esteche, R. Rosito, Automatic generation of span- captions.

ish crossword puzzles from news, in: Proceedings of Clei, 2017. [36] A. Arora, A. Kumar, SEEKH: Generating educational crosswords for indian languages, in: International Conference on Educational Data, 2019. [37] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,

L. Rigutini, M. Maggini, M. Gori, Italian crossword generator: Enhancing education through interactive word puzzles, in: Proceedings of CLiC-it, 2023. [38] G. Angelini, M. Ernandes, T. Iaquinta, C. Stehlé,

F. Simões, K. Zeinalipour, A. Zugarini, M. Gori, The webcrow french crossword solver, arXiv preprint arXiv:2311.15626 (2023). [39] K. Zeinalipour, M. Z. Saad, M. Maggini, M. Gori,

ArabIcros: Ai-powered arabic crossword puzzle generation for educational applications, arXiv preprint arXiv:2312.01339 (2023). [40] K. Zeinalipour, M. Z. Saad, M. Maggini, M. Gori,

From arabic text to puzzles: Llm-driven development of arabic educational crosswords, in: Proceedings of the Workshop on Language Models for

Low-Resource Languages, 2025. [41] K. Zeinalipour, Y. G. Keptiğ, M. Maggini, L. Rigutini,

M. Gori, A turkish educational crossword puzzle generator, arXiv preprint arXiv:2405.07035 (2024). [42] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Maggini,

M. Gori, L. Rigutini, Clue-instruct: Text-based clue generation for educational crossword puzzles, in:

Proceedings of LREC-COLING, 2024. [43] M. Honnibal, I. Montani, S. Van Landeghem,

A. Boyd, et al., spacy: Industrial-strength natural language processing in python (2020). [44] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,

L. Rigutini, M. Maggini, M. Gori, et al., Italian crossword generator: an in-depth linguistic analysis in educational word puzzles, IJCOL 11 (2025) 47–72. [45] L. Rizzi, On the form of chains: Criterial positions

and ecp efects (2006).

Category inf_VP pass:other metalinguistic imp_refl:missSubj def_DP cop:missSubj PP cop:pron ind_DP cop:clitic bare_NP:rel adjP:pron bare_NP adjP act:pron act:missSubj def_DP:rel imp_refl:other act:clitic pass:missSubj

Concatenation Type concatenation_subj_art concatenation_cop concatenation_cioè_art concatenation_topic_art concatenation_topic_art concatenation_prompt concatenation_topic_art concatenation_cop concatenation_inv_cop concatenation_subj_art concatenation_cioè_art concatenation_prompt concatenation_cop concatenation_cioè_art concatenation_inv_cop concatenation_cop concatenation_inv_cop concatenation_cioè_art concatenation_subj_art concatenation_prompt concatenation_topic_art concatenation_subj_art concatenation_cioè_art concatenation_cop concatenation_inv_cop concatenation_prompt

Coef

Coef Coef Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.