=Paper=
{{Paper
|id=Vol-3878/138_calamita_short
|storemode=property
|title=ECWCA - Educational CrossWord Clues Answering: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/138_calamita_short.pdf
|volume=Vol-3878
|authors=Andrea Zugarini,Kamyar Zeinalipour,Achille Fusco,Asya Zanollo
|dblpUrl=https://dblp.org/rec/conf/clic-it/ZugariniZFZ24
}}
==ECWCA - Educational CrossWord Clues Answering: A CALAMITA Challenge==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/138_calamita_short.pdf</pdf>
<pre>
                                ECWCA - Educational CrossWord Clues Answering
                                A CALAMITA Challenge
                                Andrea Zugarini1,∗ , Kamyar Zeinalipour2 , Achille Fusco3 and Asya Zanollo3
                                1
                                  expert.ai, Siena, Italy
                                2
                                  University of Siena, DIISM, Via Roma 56, 53100 Siena, Italy
                                3
                                  USS Pavia, Piazza della Vittoria 15, 27100 Pavia (PV)


                                                Abstract
                                                This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge
                                                and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks:
                                                a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is
                                                receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct
                                                the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated
                                                clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.

                                                Keywords
                                                Educational Crosswords Dataset, Large Language Models, CALAMITA


                                1. Challenge: Introduction and                                                                           LLM to reply with the correct answer. In the second case,
                                                                                                                                         the goal is analogous, but we assist the model with hints
                                   Motivation                                                                                            related to the length of the words in the answer. Sugges-
                                Crossword puzzles are well-known linguistic games that                                                   tions reduce the number of possible answers, therefore
                                are usually used for entertainment, but they are also ap-                                                models with reasoning skills are supposed to take advan-
                                plied in education as a tool to assess knowledge, reason-                                                tage of that.
                                ing skills and linguistic abilities of students [1, 2, 3]. Large                                            To build ECWCA, we created a dataset of synthetic
                                Language Models (LLMs) [4, 5, 6] have shown impressive                                                   clues grounded on entities and facts extracted from Ital-
                                abilities and strong knowledge about the world. Recently,                                                ian Wikipedia pages. Clue-answer pairs were generated
                                Language Models have been extensively used to both                                                       following the same methodology of clue-instruct [13]. In
                                solve [7, 8, 9, 10, 11] and create crossword clues [12, 13]                                              a nutshell, we create multiple clues for a given answer.
                                for educational purposes.                                                                                The generation is grounded to a content that is about the
                                   In this challenge instead, we make use of educational                                                 given answer, and a topic. A sketch of the method is out-
                                crossword clues to build a benchmark to assess the LLM                                                   lined in Figure 1. Since the approach produces multiple
                                clue-answering skills on popular entities and facts about                                                definitions for a single answer, and the quality may not
                                the world. We refer to it as ECWCA, standing for Ed-                                                     be good enough for all of them, we perform a manual
                                ucational CrossWord Clues Answering. ECWCA is an                                                         selection step to preserve only high-quality clues.
                                Italian benchmark presented at [14], designed to include
                                Entities and Facts that are popular in the Italian culture.
                                                                                                                                         3. Data description
                                2. Challenge: Description                                                                                3.1. Origin of data
                                                                                                              The dataset was constructed following the clue-
                                In this challenge, we evaluate the knowledge abilities
                                                                                                              instruct [13] approach. In clue-instruct it was faced a
                                of LLMs by testing them on crossword clue-answering
                                                                                                              clues generation problem. Indeed, the task was to gen-
                                tasks. We propose two slightly different tasks in the chal-
                                                                                                              erate multiple clues given a certain answer, its context
                                lenge. The first one, is essentially a Question Answering
                                                                                                              and its category. Here instead, we exploit the approach
                                problem, where the question is a clue and we expect the
                                                                                                              to build a QA dataset of clue-answer pairs. This hap-
                                                                                                              pens in two steps, first we generate a set of examples
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                Dec 04 — 06, 2024, Pisa, Italy                                                constituted by an answer and the generated clues (as in
                                ∗
                                     Corresponding author.                                                    clue-instruct), then we manually select the most suited
                                Envelope-Open azugarini@expert.ai (A. Zugarini); kamyar.zeinalipour2@unisi.it clue-answer pairs (see Section 3.2 for further details).
                                (K. Zeinalipour); achille.fusco@iusspavia.it (A. Fusco);                         In order to construct the examples with clue-instruct,
                                zanolloasya@gmail.com (A. Zanollo)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                          Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                  Data                      Data                 Craft the                 Clues
                                Retrieval                 Screening               Prompt                 Generation

                                     (a)                       (b)                    (c)                     (d)


Figure 1: Sketch of clue-instruct method. Picture taken from [13].


we identified the most visited Italian Wikipedia1 pages.             information, thereby ensuring the integrity of the dataset.
To count visits, we considered a period between Septem-
ber 10, 2023 and May 31, 2024 and gathered stats from                Answerability.         Annotators were instructed to
Wikimedia APIs2 . We considered the page title as the                choose a clue that could be answered without a high
answer. Titles with non-alphabetic characters, with less             degree of ambiguity. The focus was on clues that
than two characters or more than 20 were excluded. On                provided enough information to infer the correct answer
the remaining pages, we extracted their content. Differ-             with confidence. Clues that left room for multiple
ently from clue-instruct, we did not dispose of the cate-            interpretations or guesses were rejected. For example,
gory information, therefore we generated it by querying              generic definitions, such as ’a large mammal’, does not
GPT-4o [6], asking to choose the category of the answer              fit this criteria, since there are many possible species
given its page content within a set of 20 predefined cat-            fitting for this answer.
egories. We then randomly sampled the pages and we
interrogated GPT-4o to create three clues for the answer.            No clue-answer overlap.            Clues including the
Finally, those examples underwent through the manual                 answer or a significant portion of it should be discarded.
selection process, to keep only one clue amongst the
three. The dataset is publicly available3 .                             In cases where more than one clue satisfied all the
                                                                     criteria, annotators were directed to select the clue that
3.2. Annotation details                                              provided the most relevant information with most clarity
                                                                     and simplicity. When no clue matched the criteria, the
The clue-instruct method produces three different clues              whole example was discarded.
for each given answer and its context. To select only
one clue we add a human selection step. Doing so, we
                                                                     3.3. Data format
avoid the presence of multiple occurrences for the same
answer. Moreover, we guarantee high quality definitions              Each example includes the clue-answer pair, the word
and answers.                                                         length hint, some additional metadata (such as the
   The example selection process was carried out by three            category and the page views) and the reference to
native Italian speaking annotators. Examples were split              the wikipedia page url, whose content was exploited
in 18 chunks of 100 examples each, equally distributed               to generate the clue. More precisely, there are the
among the annotators.                                                following columns:      clue, answer, answer_len,
   Each example was presented with the answer, the                   url, content, views, category, length_hint,
three generated clues and the Wikipedia page paragraph               raw_entity . A few examples are showcased in Table 1,
that was used to create the clues. Annotators were                   where for the sake of simplicity, we only report the
tasked with selecting the best one, if any, based on the             clue-answer pair, the hint and the category of the
following criteria:                                                  example.

Truthfulness and Accuracy. It was imperative                         3.4. Example of prompts used for zero
that the content of the selected clue was factually
correct. Annotators cross-verified the accuracy of
                                                                          or/and few shots
the clue from the provided Wikipedia page content                    We defined two different prompts, one with and the other
to ensure that it did not contain misleading or false                without indications about the words length of the answer.
                                                                     The two prompts are presented in Figure 4 and Figure 3,
1
    https://it.wikipedia.org/                                        respectively.
2
    wikimedia.org
3
    https://huggingface.co/datasets/azugarini/crossword-clues-QA
Table 1
Some examples of generated clues in the dataset, their answers, the hint suggesting the character length of each word in the
answer and the category representing the topic of the clue.
                                    Clue                               Length Hint       Category              Answer
    Sovrana che instaurò rapporti con Giulio Cesare e Marco Antonio        (9)            History             Cleopatra
              Autore de I Malavoglia e Mastro-don Gesualdo                (8,5)          Literature        Giovanni Verga
       Pilota austriaco tre volte campione del mondo di Formula 1         (4,5)            Sports            Niki Lauda
           Attore canadese protagonista di Blade Runner 2049              (4,7)        Entertainment        Ryan Gosling
       Opera divisa in tre cantiche: Inferno, Purgatorio e Paradiso       (6,8)          Literature       Divina Commedia
                Stato dell’Oceania con capitale Canberra                   (9)          Geography             Australia


        50
                                                                Sei un esperto di enigmistica. Devi risolvere
                                                                definizioni di cruciverba.
        40                                                      Trova la risposta alla definizione. Ritorna solo la
                                                                risposta, nient'altro.

        30
                                                                Esempi:
Count


        20                                                      DEFINIZIONE: Protagonista di Titanic al fianco di
                                                                Kate Winslet
                                                                RISPOSTA: leonardo dicaprio
        10

                                                                DEFINIZIONE: capitale dell'Impero romano d'Occidente
        0
             0.0   0.2     0.4             0.6   0.8     1.0
                                                                nel 313 d.C.
                                 # Views                  1e6   RISPOSTA: milano

Figure 2: Page views distribution (the very few examples        Ora tocca a te:
above one million visits were excluded).
                                                                DEFINIZIONE: {clue}
                                                                RISPOSTA:

Task without hints. We construct a 2-shot prompt                Figure 3: Prompt task without hints.
(Figure 3) for the task. First, we instruct the model to
act as an expert in solving crossword clues without any
additional hints related to the structure of the answer    characters. Sports, Geography, History and Society are
(such as words length). The format is clear and concise,   also well represented, whereas the remaining categories
focusing on the core task: resolving the crossword defini- are less frequent, which some, like Applied Science, Phi-
tion and providing only the solution. Then, the two static losophy and Education being rare.
demonstration examples are showcased to illustrate to         The pages from which clue-answer pairs were built
the model how to approach the task. Finally, following     have about 234 thousand views each on average, with a
the same layout, we present a new clue and expect the      minimum of 1,108 up to almost five million views. How-
model to complete it with the answer.                      ever, only a few examples outreach the million and the
                                                           vast majority of them is within the half million visits, as
Task with word length hints. This prompt (see Fig- we can observe from Figure 2.
ure 4) is very similar to the first one, but introduces an
hint indicating the words length of the expected answer.
The hint is a constraint that reduces the number of valid 4. Metrics
answers, giving indications on both how many words
there are and their lengths, therefore, ideally, it should To evaluate the performance on the tasks we rely on the
aid the language model.                                    following metrics: Edit Distance (ED), Exact Match (EM),
                                                           and average F1 score on words (F1).

3.5. Detailed data statistics                                   Edit Distance. Edit Distance (also known as Leven-
Overall we collected 1,171 clue-answer pairs belonging          shtein Distance) measures the minimum number of
to 16 different categories. The distribution of answers         single-character edits (insertions, deletions, or substi-
among categories is outlined in Figure 5. Most of the ex-       tutions) required to change one sequence into another.
amples belong to Entertainment topic, indeed the dataset        In this context, ED measures how close the generated
includes many actors, tv shows, movies and fictional
   Sei un esperto di enigmistica. Devi risolvere                                                                                                        Llama3.1 8B
   definizioni di cruciverba.                                                                                                                           Llama3.1 8B-instruct
                                                                                     8                                                                  Llama3.1 70B-instruct
   Ti verrà data una definizione corredata da un
   suggerimento, una sequenza di numeri indicante di
   quanti caratteri è composta ciascuna parola della                                 7


                                                                ED (Edit Distance)
   risposta.
   Trova la risposta alla definizione.                                               6
   Ritorna solo la risposta, nient'altro.
                                                                                     5
   Esempi:

                                                                                     4
   DEFINIZIONE: Protagonista di Titanic al fianco
   di Kate Winslet
   SUGGERIMENTO: (8,8)                                                               3
   RISPOSTA: leonardo dicaprio
                                                                                     [103, 104)                    [104, 105)              [105, 106)                [106, )
                                                                                                                                 # Views
   DEFINIZIONE: capitale dell'Impero romano
   d'Occidente nel 313 d.C.                                                                     Llama3.1 8B
                                                                                     70
   SUGGERIMENTO: (6)                                                                            Llama3.1 8B-instruct
                                                                                                Llama3.1 70B-instruct
   RISPOSTA: milano
                                                                                     60
   Ora tocca a te:
                                                                                     50


                                                                EM (Exact Match)
   DEFINIZIONE: {clue}
   SUGGERIMENTO: {length_hint}                                                       40
   RISPOSTA:
                                                                                     30
Figure 4: Prompt task with word length hints.
                                                                                     20


                                                                                     10
        400                                                                              [103, 104)                 [104, 105)             [105, 106)                [106, )
                                                                                                                                 # Views

        300                                                                                     Llama3.1 8B
                                                                                                Llama3.1 8B-instruct
                                                                                     70         Llama3.1 70B-instruct
Count


                                                                                     60
        200
                                                                                     50
                                                                F1 Score


        100
                                                                                     40


                                                                                     30
         0
    od G ion
        Ge Spor t
           og ts
             His hy
            So ory
            Sc ty
            era e
                   e
           mp s
           Re ting

           d es
           ng ks
            uc s
     pli oso n
            Sc y
                  ce
                 en


        Co New


         Ed uage
                tur


        ed ph
         Lit ienc


 Ap hil atio
               cie


        La Drin
        an am


              ien
              rap


              lig
                t


                                                                                     20
             nm


               u
          tai
      ter


        P
En


 Fo


                                                                                     10
                             Category                                                                               [104, 105)
                                                                                         [103, 104)                                        [105, 106)                [106, )
                                                                                                                                 # Views
Figure 5: Distribution of the examples across the categories.
                                                                Figure 6: ED, EM and F1 score performance varying with
                                                                respect to the number of page views for 3.1 llama models.

response is to the ground truth answer. A lower ED indi-
cates better performance, as it signifies that the predicted
text is more similar to the target text.                     F1 score. The F1 score evaluates how well the pre-
                                                             dicted words overlap with the ground truth answer. For
                                                             example, if the ground truth is ”leonardo dicaprio” and
Exact Match. Exact Match (EM) is a binary metric that
                                                             the model predicts ”dicaprio”, the model would have per-
evaluates whether the generated answer exactly matches
                                                             fect precision, but imperfect recall (50%), resulting in a
the ground truth. We report in percentage the EM score
                                                             66.67% F1 score.
obtained in each example, which corresponds to the per-
centage of correctly predicted answers.
Table 2                                                       5. Limitations
Performance on the task with and without word length hints.
                                                              Large Language Models have all been exposed to vast
 Model                    Hint    ED ↓      EM      F1
 Llama3 8B                 No     11.43    14.82   16.37
                                                              amount of data. The clues proposed in this dataset were
 Llama 8B                 Yes     11.52    10.82   11.91      created from Wikipedia pages that were definitely seen by
 Llama3 8B-instruct        No     11.43    14.82   16.37      the LLMs during training. Clues are also generally very
 Llama3 8B-instruct       Yes     12.07    14.48   16.07      adherent to the pages content, since they were created
 Llama3.1 8B               No     6.99     34.16   37.35      from it. Indeed, one of the goals of the benchmark is to
 Llama3.1 8B              Yes     8.01     25.72   27.51      assess their memorization capabilities on facts that were
 Llama3.1 8B-instruct      No     7.31     39.69   44.47      likely to be well known by them. However, the proposed
 Llama3.1 8B-instruct     Yes     6.14     40.80   44.58      dataset is new, hence it could not have been part of the
 Llama3.1 70B-instruct     No     3.32     66.61   70.16      training set of such LLMs.
 Llama3.1 70B-instruct    Yes     3.27     67.89   71.24

                                                              6. Data license and copyright
Preliminary Results. We establish baseline results on         issues
ECWCA, testing some of the models in the Llama family.
In particular, we consider Llama3 8B and Llama3.1 8B Data is released under apache-2.0 license.
in both instructed and non-instructed versions, and the
Llama3.1 70B-instruct, to observe how model size affects
the results. Table 2 illustrates the performance of the References
LLMs on the two tasks (with and without word-length
hints), both evaluated on the defined scores. We can      [1] R. Nickerson, Crossword puzzles and lexical mem-
observe that Llama3.1 8B consistently outperforms its          ory, in: Attention and performance VI, Routledge,
predecessor across all the metrics, both with and without      1977,  pp. 699–718.
hints. The gap between smaller LLMs and Llama3.1 70B- [2] E. Yuriev, B. Capuano, J. L. Short, Crossword puz-
instruct is remarkable, proving once again that larger         zles for chemistry education: learning goals beyond
LLMs preserve much more knowledge.                             vocabulary,   Chemistry education research and prac-
   Word-length hints instead are generally not helping         tice 17 (2016) 532–554.
the models, actually harming the performance in non- [3] C. Sandiuc, A. Balagiu, The use of crossword puz-
instructed models. For example, the F1 score of Llama3.1       zles as a strategy to teach maritime english vocabu-
8B drops significantly, from 37.35 without hints to 27.51      lary, Scientific Bulletin” Mircea cel Batran” Naval
with hints, and similarly, EM decreases from 34.16 to          Academy 23 (2020) 236A–242.
25.72 as well. Instructed models instead are not affected [4] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
by this, but the suggestions lead to a small increase in       plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
all the metrics. Only in Llama3.1 70B-instruct, we can         try, A. Askell, et al., Language models are few-shot
observe some statistically significant improvement. This       learners, Advances in neural information process-
may suggest that constraints are beneficial only on mod-       ing systems 33 (2020) 1877–1901.
els with stronger understanding capabilities.             [5] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
   In Figure 6, we show how the performance of Llama3.1        M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
family models vary with respect to the number of page          E. Hambro, F. Azhar, et al., Llama: Open and effi-
views. We group examples in intervals, then we compute         cient foundation language models, arXiv preprint
the metrics on each of them. Edit distance shows no sig-       arXiv:2302.13971 (2023).
nificant trends, whereas EM and F1 exhibit an increasing  [6]  J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,
trend on more visited pages for 8B sized models, whereas       F. L. Aleman, D. Almeida, J. Altenschmidt, S. Alt-
the 70B model has a behaviour that seems uncorrelated          man, S. Anadkat, et al., Gpt-4 technical report,
with the number of views. This suggests that the larger        arXiv preprint arXiv:2303.08774 (2023).
number of weights in 70B model, stored a broader and      [7]  A. Zugarini, M. Ernandes, A multi-strategy ap-
deeper knowledge about world facts and entities, cov-          proach to crossword clue answer retrieval and rank-
ering also less popular ones, whereas smaller LLMs did         ing (2021).
embody only the most popular factual knowledge seen       [8]  E. Wallace, N. Tomlin, A. Xu, K. Yang, E. Pathak,
during training.                                               M. Ginsberg, D. Klein, Automated crossword solv-
                                                               ing, arXiv preprint arXiv:2205.09665 (2022).
                                                          [9] A. Zugarini, T. Rothenbacher, K. Klede, M. Ernan-
                                                               des, B. M. Eskofier, D. Zanca, Die rätselrevolution:
     Automated german crossword solving., in: CLiC-it,
     2023.
[10] G. Angelini, M. Ernandes, T. Iaquinta, C. Stehlé,
     F. Simões, K. Zeinalipour, A. Zugarini, M. Gori, The
     webcrow french crossword solver, in: International
     Conference on Intelligent Technologies for Interac-
     tive Entertainment, Springer, 2023, pp. 193–209.
[11] S. Saha, S. Chakraborty, S. Saha, U. Garain, Lan-
     guage models are crossword solvers, arXiv preprint
     arXiv:2406.09043 (2024).
[12] K. Zeinalipour, T. Iaquinta, A. Zanollo, G. Angelini,
     L. Rigutini, M. Maggini, M. Gori, Italian crossword
     generator: Enhancing education through interac-
     tive word puzzles (2023).
[13] A. Zugarini, K. Zeinalipour, S. S. Kadali, M. Mag-
     gini, M. Gori, L. Rigutini, Clue-instruct: Text-based
     clue generation for educational crossword puzzles,
     in: Proceedings of the 2024 Joint International Con-
     ference on Computational Linguistics, Language
     Resources and Evaluation (LREC-COLING 2024),
     ELRA and ICCL, Torino, Italia, 2024, pp. 3347–3356.
     URL: https://aclanthology.org/2024.lrec-main.297.
[14] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
     naldi, D. Scalena, CALAMITA: Challenge the Abili-
     ties of LAnguage Models in ITAlian, in: Proceed-
     ings of the 10th Italian Conference on Computa-
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
     ber 4 - December 6, 2024, CEUR Workshop Proceed-
     ings, CEUR-WS.org, 2024.

</pre>