=Paper=
{{Paper
|id=Vol-3878/138_calamita_short
|storemode=property
|title=ECWCA - Educational CrossWord Clues Answering: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/138_calamita_short.pdf
|volume=Vol-3878
|authors=Andrea Zugarini,Kamyar Zeinalipour,Achille Fusco,Asya Zanollo
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZugariniZFZ24
}}
==ECWCA - Educational CrossWord Clues Answering: A CALAMITA Challenge==
ECWCA - Educational CrossWord Clues Answering
A CALAMITA Challenge
Andrea Zugarini1,∗ , Kamyar Zeinalipour2 , Achille Fusco3 and Asya Zanollo3
1
expert.ai, Siena, Italy
2
University of Siena, DIISM, Via Roma 56, 53100 Siena, Italy
3
USS Pavia, Piazza della Vittoria 15, 27100 Pavia (PV)
Abstract
This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge
and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks:
a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is
receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct
the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated
clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.
Keywords
Educational Crosswords Dataset, Large Language Models, CALAMITA
1. Challenge: Introduction and LLM to reply with the correct answer. In the second case,
the goal is analogous, but we assist the model with hints
Motivation related to the length of the words in the answer. Sugges-
Crossword puzzles are well-known linguistic games that tions reduce the number of possible answers, therefore
are usually used for entertainment, but they are also ap- models with reasoning skills are supposed to take advan-
plied in education as a tool to assess knowledge, reason- tage of that.
ing skills and linguistic abilities of students [1, 2, 3]. Large To build ECWCA, we created a dataset of synthetic
Language Models (LLMs) [4, 5, 6] have shown impressive clues grounded on entities and facts extracted from Ital-
abilities and strong knowledge about the world. Recently, ian Wikipedia pages. Clue-answer pairs were generated
Language Models have been extensively used to both following the same methodology of clue-instruct [13]. In
solve [7, 8, 9, 10, 11] and create crossword clues [12, 13] a nutshell, we create multiple clues for a given answer.
for educational purposes. The generation is grounded to a content that is about the
In this challenge instead, we make use of educational given answer, and a topic. A sketch of the method is out-
crossword clues to build a benchmark to assess the LLM lined in Figure 1. Since the approach produces multiple
clue-answering skills on popular entities and facts about definitions for a single answer, and the quality may not
the world. We refer to it as ECWCA, standing for Ed- be good enough for all of them, we perform a manual
ucational CrossWord Clues Answering. ECWCA is an selection step to preserve only high-quality clues.
Italian benchmark presented at [14], designed to include
Entities and Facts that are popular in the Italian culture.
3. Data description
2. Challenge: Description 3.1. Origin of data
The dataset was constructed following the clue-
In this challenge, we evaluate the knowledge abilities
instruct [13] approach. In clue-instruct it was faced a
of LLMs by testing them on crossword clue-answering
clues generation problem. Indeed, the task was to gen-
tasks. We propose two slightly different tasks in the chal-
erate multiple clues given a certain answer, its context
lenge. The first one, is essentially a Question Answering
and its category. Here instead, we exploit the approach
problem, where the question is a clue and we expect the
to build a QA dataset of clue-answer pairs. This hap-
pens in two steps, first we generate a set of examples
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy constituted by an answer and the generated clues (as in
∗
Corresponding author. clue-instruct), then we manually select the most suited
Envelope-Open azugarini@expert.ai (A. Zugarini); kamyar.zeinalipour2@unisi.it clue-answer pairs (see Section 3.2 for further details).
(K. Zeinalipour); achille.fusco@iusspavia.it (A. Fusco); In order to construct the examples with clue-instruct,
zanolloasya@gmail.com (A. Zanollo)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Data Data Craft the Clues
Retrieval Screening Prompt Generation
(a) (b) (c) (d)
Figure 1: Sketch of clue-instruct method. Picture taken from [13].
we identified the most visited Italian Wikipedia1 pages. information, thereby ensuring the integrity of the dataset.
To count visits, we considered a period between Septem-
ber 10, 2023 and May 31, 2024 and gathered stats from Answerability. Annotators were instructed to
Wikimedia APIs2 . We considered the page title as the choose a clue that could be answered without a high
answer. Titles with non-alphabetic characters, with less degree of ambiguity. The focus was on clues that
than two characters or more than 20 were excluded. On provided enough information to infer the correct answer
the remaining pages, we extracted their content. Differ- with confidence. Clues that left room for multiple
ently from clue-instruct, we did not dispose of the cate- interpretations or guesses were rejected. For example,
gory information, therefore we generated it by querying generic definitions, such as ’a large mammal’, does not
GPT-4o [6], asking to choose the category of the answer fit this criteria, since there are many possible species
given its page content within a set of 20 predefined cat- fitting for this answer.
egories. We then randomly sampled the pages and we
interrogated GPT-4o to create three clues for the answer. No clue-answer overlap. Clues including the
Finally, those examples underwent through the manual answer or a significant portion of it should be discarded.
selection process, to keep only one clue amongst the
three. The dataset is publicly available3 . In cases where more than one clue satisfied all the
criteria, annotators were directed to select the clue that
3.2. Annotation details provided the most relevant information with most clarity
and simplicity. When no clue matched the criteria, the
The clue-instruct method produces three different clues whole example was discarded.
for each given answer and its context. To select only
one clue we add a human selection step. Doing so, we
3.3. Data format
avoid the presence of multiple occurrences for the same
answer. Moreover, we guarantee high quality definitions Each example includes the clue-answer pair, the word
and answers. length hint, some additional metadata (such as the
The example selection process was carried out by three category and the page views) and the reference to
native Italian speaking annotators. Examples were split the wikipedia page url, whose content was exploited
in 18 chunks of 100 examples each, equally distributed to generate the clue. More precisely, there are the
among the annotators. following columns: clue, answer, answer_len,
Each example was presented with the answer, the url, content, views, category, length_hint,
three generated clues and the Wikipedia page paragraph raw_entity . A few examples are showcased in Table 1,
that was used to create the clues. Annotators were where for the sake of simplicity, we only report the
tasked with selecting the best one, if any, based on the clue-answer pair, the hint and the category of the
following criteria: example.
Truthfulness and Accuracy. It was imperative 3.4. Example of prompts used for zero
that the content of the selected clue was factually
correct. Annotators cross-verified the accuracy of
or/and few shots
the clue from the provided Wikipedia page content We defined two different prompts, one with and the other
to ensure that it did not contain misleading or false without indications about the words length of the answer.
The two prompts are presented in Figure 4 and Figure 3,
1
https://it.wikipedia.org/ respectively.
2
wikimedia.org
3
https://huggingface.co/datasets/azugarini/crossword-clues-QA
Table 1
Some examples of generated clues in the dataset, their answers, the hint suggesting the character length of each word in the
answer and the category representing the topic of the clue.
Clue Length Hint Category Answer
Sovrana che instaurò rapporti con Giulio Cesare e Marco Antonio (9) History Cleopatra
Autore de I Malavoglia e Mastro-don Gesualdo (8,5) Literature Giovanni Verga
Pilota austriaco tre volte campione del mondo di Formula 1 (4,5) Sports Niki Lauda
Attore canadese protagonista di Blade Runner 2049 (4,7) Entertainment Ryan Gosling
Opera divisa in tre cantiche: Inferno, Purgatorio e Paradiso (6,8) Literature Divina Commedia
Stato dell’Oceania con capitale Canberra (9) Geography Australia
50
Sei un esperto di enigmistica. Devi risolvere
definizioni di cruciverba.
40 Trova la risposta alla definizione. Ritorna solo la
risposta, nient'altro.
30
Esempi:
Count
20 DEFINIZIONE: Protagonista di Titanic al fianco di
Kate Winslet
RISPOSTA: leonardo dicaprio
10
DEFINIZIONE: capitale dell'Impero romano d'Occidente
0
0.0 0.2 0.4 0.6 0.8 1.0
nel 313 d.C.
# Views 1e6 RISPOSTA: milano
Figure 2: Page views distribution (the very few examples Ora tocca a te:
above one million visits were excluded).
DEFINIZIONE: {clue}
RISPOSTA:
Task without hints. We construct a 2-shot prompt Figure 3: Prompt task without hints.
(Figure 3) for the task. First, we instruct the model to
act as an expert in solving crossword clues without any
additional hints related to the structure of the answer characters. Sports, Geography, History and Society are
(such as words length). The format is clear and concise, also well represented, whereas the remaining categories
focusing on the core task: resolving the crossword defini- are less frequent, which some, like Applied Science, Phi-
tion and providing only the solution. Then, the two static losophy and Education being rare.
demonstration examples are showcased to illustrate to The pages from which clue-answer pairs were built
the model how to approach the task. Finally, following have about 234 thousand views each on average, with a
the same layout, we present a new clue and expect the minimum of 1,108 up to almost five million views. How-
model to complete it with the answer. ever, only a few examples outreach the million and the
vast majority of them is within the half million visits, as
Task with word length hints. This prompt (see Fig- we can observe from Figure 2.
ure 4) is very similar to the first one, but introduces an
hint indicating the words length of the expected answer.
The hint is a constraint that reduces the number of valid 4. Metrics
answers, giving indications on both how many words
there are and their lengths, therefore, ideally, it should To evaluate the performance on the tasks we rely on the
aid the language model. following metrics: Edit Distance (ED), Exact Match (EM),
and average F1 score on words (F1).
3.5. Detailed data statistics Edit Distance. Edit Distance (also known as Leven-
Overall we collected 1,171 clue-answer pairs belonging shtein Distance) measures the minimum number of
to 16 different categories. The distribution of answers single-character edits (insertions, deletions, or substi-
among categories is outlined in Figure 5. Most of the ex- tutions) required to change one sequence into another.
amples belong to Entertainment topic, indeed the dataset In this context, ED measures how close the generated
includes many actors, tv shows, movies and fictional
Sei un esperto di enigmistica. Devi risolvere Llama3.1 8B
definizioni di cruciverba. Llama3.1 8B-instruct
8 Llama3.1 70B-instruct
Ti verrà data una definizione corredata da un
suggerimento, una sequenza di numeri indicante di
quanti caratteri è composta ciascuna parola della 7
ED (Edit Distance)
risposta.
Trova la risposta alla definizione. 6
Ritorna solo la risposta, nient'altro.
5
Esempi:
4
DEFINIZIONE: Protagonista di Titanic al fianco
di Kate Winslet
SUGGERIMENTO: (8,8) 3
RISPOSTA: leonardo dicaprio
[103, 104) [104, 105) [105, 106) [106, )
# Views
DEFINIZIONE: capitale dell'Impero romano
d'Occidente nel 313 d.C. Llama3.1 8B
70
SUGGERIMENTO: (6) Llama3.1 8B-instruct
Llama3.1 70B-instruct
RISPOSTA: milano
60
Ora tocca a te:
50
EM (Exact Match)
DEFINIZIONE: {clue}
SUGGERIMENTO: {length_hint} 40
RISPOSTA:
30
Figure 4: Prompt task with word length hints.
20
10
400 [103, 104) [104, 105) [105, 106) [106, )
# Views
300 Llama3.1 8B
Llama3.1 8B-instruct
70 Llama3.1 70B-instruct
Count
60
200
50
F1 Score
100
40
30
0
od G ion
Ge Spor t
og ts
His hy
So ory
Sc ty
era e
e
mp s
Re ting
d es
ng ks
uc s
pli oso n
Sc y
ce
en
Co New
Ed uage
tur
ed ph
Lit ienc
Ap hil atio
cie
La Drin
an am
ien
rap
lig
t
20
nm
u
tai
ter
P
En
Fo
10
Category [104, 105)
[103, 104) [105, 106) [106, )
# Views
Figure 5: Distribution of the examples across the categories.
Figure 6: ED, EM and F1 score performance varying with
respect to the number of page views for 3.1 llama models.
response is to the ground truth answer. A lower ED indi-
cates better performance, as it signifies that the predicted
text is more similar to the target text. F1 score. The F1 score evaluates how well the pre-
dicted words overlap with the ground truth answer. For
example, if the ground truth is ”leonardo dicaprio” and
Exact Match. Exact Match (EM) is a binary metric that
the model predicts ”dicaprio”, the model would have per-
evaluates whether the generated answer exactly matches
fect precision, but imperfect recall (50%), resulting in a
the ground truth. We report in percentage the EM score
66.67% F1 score.
obtained in each example, which corresponds to the per-
centage of correctly predicted answers.
Table 2 5. Limitations
Performance on the task with and without word length hints.
Large Language Models have all been exposed to vast
Model Hint ED ↓ EM F1
Llama3 8B No 11.43 14.82 16.37
amount of data. The clues proposed in this dataset were
Llama 8B Yes 11.52 10.82 11.91 created from Wikipedia pages that were definitely seen by
Llama3 8B-instruct No 11.43 14.82 16.37 the LLMs during training. Clues are also generally very
Llama3 8B-instruct Yes 12.07 14.48 16.07 adherent to the pages content, since they were created
Llama3.1 8B No 6.99 34.16 37.35 from it. Indeed, one of the goals of the benchmark is to
Llama3.1 8B Yes 8.01 25.72 27.51 assess their memorization capabilities on facts that were
Llama3.1 8B-instruct No 7.31 39.69 44.47 likely to be well known by them. However, the proposed
Llama3.1 8B-instruct Yes 6.14 40.80 44.58 dataset is new, hence it could not have been part of the
Llama3.1 70B-instruct No 3.32 66.61 70.16 training set of such LLMs.
Llama3.1 70B-instruct Yes 3.27 67.89 71.24
6. Data license and copyright
Preliminary Results. We establish baseline results on issues
ECWCA, testing some of the models in the Llama family.
In particular, we consider Llama3 8B and Llama3.1 8B Data is released under apache-2.0 license.
in both instructed and non-instructed versions, and the
Llama3.1 70B-instruct, to observe how model size affects
the results. Table 2 illustrates the performance of the References
LLMs on the two tasks (with and without word-length
hints), both evaluated on the defined scores. We can [1] R. Nickerson, Crossword puzzles and lexical mem-
observe that Llama3.1 8B consistently outperforms its ory, in: Attention and performance VI, Routledge,
predecessor across all the metrics, both with and without 1977, pp. 699–718.
hints. The gap between smaller LLMs and Llama3.1 70B- [2] E. Yuriev, B. Capuano, J. L. Short, Crossword puz-
instruct is remarkable, proving once again that larger zles for chemistry education: learning goals beyond
LLMs preserve much more knowledge. vocabulary, Chemistry education research and prac-
Word-length hints instead are generally not helping tice 17 (2016) 532–554.
the models, actually harming the performance in non- [3] C. Sandiuc, A. Balagiu, The use of crossword puz-
instructed models. For example, the F1 score of Llama3.1 zles as a strategy to teach maritime english vocabu-
8B drops significantly, from 37.35 without hints to 27.51 lary, Scientific Bulletin” Mircea cel Batran” Naval
with hints, and similarly, EM decreases from 34.16 to Academy 23 (2020) 236A–242.
25.72 as well. Instructed models instead are not affected [4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
by this, but the suggestions lead to a small increase in plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
all the metrics. Only in Llama3.1 70B-instruct, we can try, A. Askell, et al., Language models are few-shot
observe some statistically significant improvement. This learners, Advances in neural information process-
may suggest that constraints are beneficial only on mod- ing systems 33 (2020) 1877–1901.
els with stronger understanding capabilities. [5] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
In Figure 6, we show how the performance of Llama3.1 M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
family models vary with respect to the number of page E. Hambro, F. Azhar, et al., Llama: Open and effi-
views. We group examples in intervals, then we compute cient foundation language models, arXiv preprint
the metrics on each of them. Edit distance shows no sig- arXiv:2302.13971 (2023).
nificant trends, whereas EM and F1 exhibit an increasing [6] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
trend on more visited pages for 8B sized models, whereas F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt-
the 70B model has a behaviour that seems uncorrelated man, S. Anadkat, et al., Gpt-4 technical report,
with the number of views. This suggests that the larger arXiv preprint arXiv:2303.08774 (2023).
number of weights in 70B model, stored a broader and [7] A. Zugarini, M. Ernandes, A multi-strategy ap-
deeper knowledge about world facts and entities, cov- proach to crossword clue answer retrieval and rank-
ering also less popular ones, whereas smaller LLMs did ing (2021).
embody only the most popular factual knowledge seen [8] E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak,
during training. M. Ginsberg, D. Klein, Automated crossword solv-
ing, arXiv preprint arXiv:2205.09665 (2022).
[9] A. Zugarini, T. Rothenbacher, K. Klede, M. Ernan-
des, B. M. Eskofier, D. Zanca, Die rätselrevolution:
Automated german crossword solving., in: CLiC-it,
2023.
[10] G. Angelini, M. Ernandes, T. Iaquinta, C. Stehlé,
F. Simões, K. Zeinalipour, A. Zugarini, M. Gori, The
webcrow french crossword solver, in: International
Conference on Intelligent Technologies for Interac-
tive Entertainment, Springer, 2023, pp. 193–209.
[11] S. Saha, S. Chakraborty, S. Saha, U. Garain, Lan-
guage models are crossword solvers, arXiv preprint
arXiv:2406.09043 (2024).
[12] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,
L. Rigutini, M. Maggini, M. Gori, Italian crossword
generator: Enhancing education through interac-
tive word puzzles (2023).
[13] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Mag-
gini, M. Gori, L. Rigutini, Clue-instruct: Text-based
clue generation for educational crossword puzzles,
in: Proceedings of the 2024 Joint International Con-
ference on Computational Linguistics, Language
Resources and Evaluation (LREC-COLING 2024),
ELRA and ICCL, Torino, Italia, 2024, pp. 3347–3356.
URL: https://aclanthology.org/2024.lrec-main.297.
[14] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
naldi, D. Scalena, CALAMITA: Challenge the Abili-
ties of LAnguage Models in ITAlian, in: Proceed-
ings of the 10th Italian Conference on Computa-
tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
ber 4 - December 6, 2024, CEUR Workshop Proceed-
ings, CEUR-WS.org, 2024.