-

USS Pavia, Piazza della Vittoria

1613-0073

WCA - Educational CrossWord Clues Answering A CALAMITA Challenge

Andrea Zugarini

azugarini@expert.ai 0 1 3

Kamyar Zeinalipour

kamyar.zeinalipour2@unisi.it 0 1 2

Achille Fusco

achille.fusco@iusspavia.it 0 1

Asya Zanollo

zanolloasya@gmail.com 0 1

Educational Crosswords Dataset, Large Language Models, CALAMITA

0 CLiC-it 2024: Tenth Italian Conference on Computational Linguistics 1 To build ECWCA, we created a dataset of synthetic 2 University of Siena, DIISM , Via Roma 56, 53100 Siena , Italy 3 expert.ai , Siena , Italy

15 27100

This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks: a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.

CEUR ceur-ws.org A

1. Challenge: Introduction and Motivation

Crossword puzzles are well-known linguistic games that are usually used for entertainment, but they are also applied in education as a tool to assess knowledge, reason- tage of that. ing skills and linguistic abilities of students [ 1, 2, 3 ]. Large Language Models (LLMs) [ 4, 5, 6 ] have shown impressive abilities and strong knowledge about the world. Recently, Language Models have been extensively used to both solve [ 7, 8, 9, 10, 11 ] and create crossword clues [ 12, 13 ] for educational purposes.

In this challenge instead, we make use of educational crossword clues to build a benchmark to assess the LLM clue-answering skills on popular entities and facts about the world. We refer to it as ECWCA, standing for Educational CrossWord Clues Answering. ECWCA is an Italian benchmark presented at [ 14 ], designed to include Entities and Facts that are popular in the Italian culture.

2. Challenge: Description

In this challenge, we evaluate the knowledge abilities of LLMs by testing them on crossword clue-answering tasks. We propose two slightly diferent tasks in the challenge. The first one, is essentially a Question Answering problem, where the question is a clue and we expect the to build a QA dataset of clue-answer pairs. This happens in two steps, first we generate a set of examples constituted by an answer and the generated clues (as in clue-instruct), then we manually select the most suited clue-answer pairs (see Section 3.2 for further details).

In order to construct the examples with clue-instruct,

Data Retrieval

(a)

Data Screening

(b)

Craft the Prompt

(c)

Clues Generation

(d) we identified the most visited Italian Wikipedia 1 pages. information, thereby ensuring the integrity of the dataset. To count visits, we considered a period between September 10, 2023 and May 31, 2024 and gathered stats from Answerability. Annotators were instructed to Wikimedia APIs2. We considered the page title as the choose a clue that could be answered without a high answer. Titles with non-alphabetic characters, with less degree of ambiguity. The focus was on clues that than two characters or more than 20 were excluded. On provided enough information to infer the correct answer the remaining pages, we extracted their content. Difer- with confidence. Clues that left room for multiple ently from clue-instruct, we did not dispose of the cate- interpretations or guesses were rejected. For example, gory information, therefore we generated it by querying generic definitions, such as ’a large mammal’, does not GPT-4o [ 6 ], asking to choose the category of the answer fit this criteria, since there are many possible species given its page content within a set of 20 predefined cat- iftting for this answer. egories. We then randomly sampled the pages and we interrogated GPT-4o to create three clues for the answer. No clue-answer overlap. Clues including the Finally, those examples underwent through the manual answer or a significant portion of it should be discarded. selection process, to keep only one clue amongst the three. The dataset is publicly available3.

3.2. Annotation details

The clue-instruct method produces three diferent clues for each given answer and its context. To select only one clue we add a human selection step. Doing so, we avoid the presence of multiple occurrences for the same answer. Moreover, we guarantee high quality definitions and answers.

The example selection process was carried out by three native Italian speaking annotators. Examples were split in 18 chunks of 100 examples each, equally distributed among the annotators.

Each example was presented with the answer, the three generated clues and the Wikipedia page paragraph that was used to create the clues. Annotators were tasked with selecting the best one, if any, based on the following criteria: Truthfulness and Accuracy. It was imperative that the content of the selected clue was factually correct. Annotators cross-verified the accuracy of the clue from the provided Wikipedia page content to ensure that it did not contain misleading or false 1https://it.wikipedia.org/ 2wikimedia.org 3https://huggingface.co/datasets/azugarini/crossword-clues-QA In cases where more than one clue satisfied all the criteria, annotators were directed to select the clue that provided the most relevant information with most clarity and simplicity. When no clue matched the criteria, the whole example was discarded.

3.3. Data format

Each example includes the clue-answer pair, the word length hint, some additional metadata (such as the category and the page views) and the reference to the wikipedia page url, whose content was exploited to generate the clue. More precisely, there are the following columns: clue, answer, answer_len, url, content, views, category, length_hint, raw_entity. A few examples are showcased in Table 1, where for the sake of simplicity, we only report the clue-answer pair, the hint and the category of the example.

3.4. Example of prompts used for zero or/and few shots

We defined two diferent prompts, one with and the other without indications about the words length of the answer. The two prompts are presented in Figure 4 and Figure 3, respectively.

Task without hints. We construct a 2-shot prompt (Figure 3) for the task. First, we instruct the model to act as an expert in solving crossword clues without any additional hints related to the structure of the answer (such as words length). The format is clear and concise, focusing on the core task: resolving the crossword definition and providing only the solution. Then, the two static demonstration examples are showcased to illustrate to the model how to approach the task. Finally, following the same layout, we present a new clue and expect the model to complete it with the answer.

Task with word length hints. This prompt (see Figure 4) is very similar to the first one, but introduces an hint indicating the words length of the expected answer. The hint is a constraint that reduces the number of valid answers, giving indications on both how many words there are and their lengths, therefore, ideally, it should aid the language model.

Sei un esperto di enigmistica. Devi risolvere definizioni di cruciverba.

Trova la risposta alla definizione. Ritorna solo la risposta, nient'altro.

Esempi: DEFINIZIONE: Protagonista di Titanic al fianco di Kate Winslet RISPOSTA: leonardo dicaprio DEFINIZIONE: capitale dell'Impero romano d'Occidente nel 313 d.C.

RISPOSTA: milano

4. Metrics

To evaluate the performance on the tasks we rely on the following metrics: Edit Distance (ED), Exact Match (EM), and average F1 score on words (F1).

3.5. Detailed data statistics

Edit Distance. Edit Distance (also known as LevenOverall we collected 1,171 clue-answer pairs belonging shtein Distance) measures the minimum number of to 16 diferent categories. The distribution of answers single-character edits (insertions, deletions, or substiamong categories is outlined in Figure 5. Most of the ex- tutions) required to change one sequence into another. amples belong to Entertainment topic, indeed the dataset In this context, ED measures how close the generated includes many actors, tv shows, movies and fictional

Sei un esperto di enigmistica. Devi risolvere definizioni di cruciverba.

Ti verrà data una definizione corredata da un suggerimento, una sequenza di numeri indicante di quanti caratteri è composta ciascuna parola della risposta.

Trova la risposta alla definizione.

Ritorna solo la risposta, nient'altro. response is to the ground truth answer. A lower ED indicates better performance, as it signifies that the predicted text is more similar to the target text. Preliminary Results. We establish baseline results on ECWCA, testing some of the models in the Llama family.

In particular, we consider Llama3 8B and Llama3.1 8B in both instructed and non-instructed versions, and the Llama3.1 70B-instruct, to observe how model size afects the results. Table 2 illustrates the performance of the LLMs on the two tasks (with and without word-length hints), both evaluated on the defined scores. We can observe that Llama3.1 8B consistently outperforms its predecessor across all the metrics, both with and without hints. The gap between smaller LLMs and Llama3.1 70Binstruct is remarkable, proving once again that larger LLMs preserve much more knowledge.

Word-length hints instead are generally not helping the models, actually harming the performance in noninstructed models. For example, the F1 score of Llama3.1 8B drops significantly, from 37.35 without hints to 27.51 with hints, and similarly, EM decreases from 34.16 to 25.72 as well. Instructed models instead are not afected by this, but the suggestions lead to a small increase in all the metrics. Only in Llama3.1 70B-instruct, we can observe some statistically significant improvement. This may suggest that constraints are beneficial only on models with stronger understanding capabilities.

In Figure 6, we show how the performance of Llama3.1 family models vary with respect to the number of page views. We group examples in intervals, then we compute the metrics on each of them. Edit distance shows no significant trends, whereas EM and F1 exhibit an increasing trend on more visited pages for 8B sized models, whereas the 70B model has a behaviour that seems uncorrelated with the number of views. This suggests that the larger number of weights in 70B model, stored a broader and deeper knowledge about world facts and entities, covering also less popular ones, whereas smaller LLMs did embody only the most popular factual knowledge seen during training.

5. Limitations

Large Language Models have all been exposed to vast amount of data. The clues proposed in this dataset were created from Wikipedia pages that were definitely seen by the LLMs during training. Clues are also generally very adherent to the pages content, since they were created from it. Indeed, one of the goals of the benchmark is to assess their memorization capabilities on facts that were likely to be well known by them. However, the proposed dataset is new, hence it could not have been part of the training set of such LLMs.

6. Data license and copyright issues

Data is released under apache-2.0 license.

[1]

Nickerson , Crossword puzzles and lexical memory , in: Attention and performance

, Routledge , 1977 , pp. 699 - 718 .

[2]

Yuriev ,

Capuano ,

J. L.

Short , Crossword puzzles for chemistry education: learning goals beyond vocabulary , Chemistry education research and practice 17 ( 2016 ) 532 - 554 .

[3]

Sandiuc ,

Balagiu , The use of crossword puzzles as a strategy to teach maritime english vocabulary , Scientific Bulletin” Mircea cel Batran” Naval Academy 23 ( 2020 ) 236A - 242 .

[4]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[5]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , et al., Llama: Open and eficient foundation language models , arXiv preprint arXiv:2302.13971 ( 2023 ).

[6]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).

[7]

Zugarini ,

Ernandes , A multi-strategy approach to crossword clue answer retrieval and ranking ( 2021 ).

[8]

Wallace ,

Tomlin ,

Xu ,

Yang ,

Pathak ,

Ginsberg ,

Klein , Automated crossword solving, arXiv preprint arXiv:2205.09665 ( 2022 ).

[9]

Zugarini ,

Rothenbacher ,

Klede ,

Ernandes ,

B. M.

Eskofier ,

Zanca , Die rätselrevolution: Automated german crossword solving ., in: CLiC-it, 2023 .

[10]

Angelini ,

Ernandes ,

Iaquinta ,

Stehlé ,

Simões ,

Zeinalipour ,

Zugarini ,

Gori , The webcrow french crossword solver , in: International Conference on Intelligent Technologies for Interactive Entertainment , Springer, 2023 , pp. 193 - 209 .

[11]

Saha ,

Chakraborty ,

Saha ,

Garain , Language models are crossword solvers , arXiv preprint arXiv:2406.09043 ( 2024 ).

[12]

Zeinalipour ,

Iaquinta ,

Zanollo ,

Angelini ,

Rigutini ,

Maggini ,

Gori , Italian crossword generator: Enhancing education through interactive word puzzles ( 2023 ).

[13]

Zugarini ,

Zeinalipour ,

S. S.

Kadali ,

Maggini ,

Gori , L. Rigutini, Clue-instruct: Text-based clue generation for educational crossword puzzles , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 3347 - 3356 . URL: https://aclanthology.org/ 2024 .lrec-main. 297 .

[14]

Attanasio ,

Basile ,

Borazio ,

Croce ,

Francis ,

Gili , E. Musacchio,

Nissim ,

Patti ,

Rinaldi ,

Scalena , CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian , in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), Pisa, Italy, December 4 - December 6, 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2024 .