ECWCA -Educational CrossWord Clues Answering A CALAMITA Challenge

ECWCA -Educational CrossWord Clues Answering A CALAMITA Challenge AndreaZugarini azugarini@expert.ai expert.ai

Siena Italy

KamyarZeinalipour kamyar.zeinalipour2@unisi.it University of Siena

DIISM, Via Roma 56 53100 Siena Italy

AchilleFusco achille.fusco@iusspavia.it USS Pavia

Piazza della Vittoria 15 27100 Pavia

AsyaZanollo zanolloasya@gmail.com USS Pavia

Piazza della Vittoria 15 27100 Pavia

ECWCA -Educational CrossWord Clues Answering A CALAMITA Challenge 1613-0073 35332E42D819AA2554EB6390E784DAE2 GROBID - A machine learning software for extracting information from scholarly documents Educational Crosswords Dataset Large Language Models CALAMITA 1. Challenge: Introduction and Motivation

This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks: a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.

LLM to reply with the correct answer. In the second case, the goal is analogous, but we assist the model with hints related to the length of the words in the answer. Suggestions reduce the number of possible answers, therefore models with reasoning skills are supposed to take advantage of that.

To build ECWCA, we created a dataset of synthetic clues grounded on entities and facts extracted from Italian Wikipedia pages. Clue-answer pairs were generated following the same methodology of clue-instruct [13]. In a nutshell, we create multiple clues for a given answer. The generation is grounded to a content that is about the given answer, and a topic. A sketch of the method is outlined in Figure 1. Since the approach produces multiple definitions for a single answer, and the quality may not be good enough for all of them, we perform a manual selection step to preserve only high-quality clues.

Data description

Origin of data

The dataset was constructed following the clueinstruct [13] approach. In clue-instruct it was faced a clues generation problem. Indeed, the task was to generate multiple clues given a certain answer, its context and its category. Here instead, we exploit the approach to build a QA dataset of clue-answer pairs. This happens in two steps, first we generate a set of examples constituted by an answer and the generated clues (as in clue-instruct), then we manually select the most suited clue-answer pairs (see Section 3.2 for further details).

In order to construct the examples with clue-instruct, we identified the most visited Italian Wikipedia 1 pages.

Clues Generation

To count visits, we considered a period between September 10, 2023 and May 31, 2024 and gathered stats from Wikimedia APIs 2 . We considered the page title as the answer. Titles with non-alphabetic characters, with less than two characters or more than 20 were excluded. On the remaining pages, we extracted their content. Differently from clue-instruct, we did not dispose of the category information, therefore we generated it by querying GPT-4o [6], asking to choose the category of the answer given its page content within a set of 20 predefined categories. We then randomly sampled the pages and we interrogated GPT-4o to create three clues for the answer. Finally, those examples underwent through the manual selection process, to keep only one clue amongst the three. The dataset is publicly available 3 .

Annotation details

The clue-instruct method produces three different clues for each given answer and its context. To select only one clue we add a human selection step. Doing so, we avoid the presence of multiple occurrences for the same answer. Moreover, we guarantee high quality definitions and answers. The example selection process was carried out by three native Italian speaking annotators. Examples were split in 18 chunks of 100 examples each, equally distributed among the annotators.

Each example was presented with the answer, the three generated clues and the Wikipedia page paragraph that was used to create the clues. Annotators were tasked with selecting the best one, if any, based on the following criteria: Truthfulness and Accuracy. It was imperative that the content of the selected clue was factually correct. Annotators cross-verified the accuracy of the clue from the provided Wikipedia page content to ensure that it did not contain misleading or false 1 https://it.wikipedia.org/ 2 wikimedia.org 3 https://huggingface.co/datasets/azugarini/crossword-clues-QA information, thereby ensuring the integrity of the dataset.

Answerability.

Annotators were instructed to choose a clue that could be answered without a high degree of ambiguity. The focus was on clues that provided enough information to infer the correct answer with confidence. Clues that left room for multiple interpretations or guesses were rejected. For example, generic definitions, such as 'a large mammal', does not fit this criteria, since there are many possible species fitting for this answer.

No clue-answer overlap.

Clues including the answer or a significant portion of it should be discarded.

In cases where more than one clue satisfied all the criteria, annotators were directed to select the clue that provided the most relevant information with most clarity and simplicity. When no clue matched the criteria, the whole example was discarded.

Data format

Each example includes the clue-answer pair, the word length hint, some additional metadata (such as the category and the page views) and the reference to the wikipedia page url, whose content was exploited to generate the clue. More precisely, there are the following columns: clue, answer, answer_len, url, content, views, category, length_hint, raw_entity. A few examples are showcased in Table 1, where for the sake of simplicity, we only report the clue-answer pair, the hint and the category of the example.

Example of prompts used for zero or/and few shots

We defined two different prompts, one with and the other without indications about the words length of the answer. The two prompts are presented in Figure 4 and Figure 3, respectively.

Table 1

Some examples of generated clues in the dataset, their answers, the hint suggesting the character length of each word in the answer and the category representing the topic of the clue. Task without hints. We construct a 2-shot prompt (Figure 3) for the task. First, we instruct the model to act as an expert in solving crossword clues without any additional hints related to the structure of the answer (such as words length). The format is clear and concise, focusing on the core task: resolving the crossword definition and providing only the solution. Then, the two static demonstration examples are showcased to illustrate to the model how to approach the task. Finally, following the same layout, we present a new clue and expect the model to complete it with the answer.

Clue

Task with word length hints. This prompt (see Figure 4) is very similar to the first one, but introduces an hint indicating the words length of the expected answer.

The hint is a constraint that reduces the number of valid answers, giving indications on both how many words there are and their lengths, therefore, ideally, it should aid the language model. characters. Sports, Geography, History and Society are also well represented, whereas the remaining categories are less frequent, which some, like Applied Science, Philosophy and Education being rare.

Detailed data statistics

The pages from which clue-answer pairs were built have about 234 thousand views each on average, with a minimum of 1,108 up to almost five million views. However, only a few examples outreach the million and the vast majority of them is within the half million visits, as we can observe from Figure 2.

Metrics

To evaluate the performance on the tasks we rely on the following metrics: Edit Distance (ED), Exact Match (EM), and average F1 score on words (F1).

Edit Distance. Edit Distance (also known as Levenshtein Distance) measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one sequence into another. In this context, ED measures how close the generated response is to the ground truth answer. A lower ED indicates better performance, as it signifies that the predicted text is more similar to the target text.

Exact Match. Exact Match (EM) is a binary metric that evaluates whether the generated answer exactly matches the ground truth. We report in percentage the EM score obtained in each example, which corresponds to the percentage of correctly predicted answers.

[10 3 , 10 4 ) [10 4 , 10 5 ) [10 5 , 10 6 ) [10 6 , ) # Views F1 score. The F1 score evaluates how well the predicted words overlap with the ground truth answer. For example, if the ground truth is "leonardo dicaprio" and the model predicts "dicaprio", the model would have perfect precision, but imperfect recall (50%), resulting in a 66.67% F1 score. Preliminary Results. We establish baseline results on ECWCA, testing some of the models in the Llama family.

In particular, we consider Llama3 8B and Llama3.1 8B in both instructed and non-instructed versions, and the Llama3.1 70B-instruct, to observe how model size affects the results. Table 2 illustrates the performance of the LLMs on the two tasks (with and without word-length hints), both evaluated on the defined scores. We can observe that Llama3.1 8B consistently outperforms its predecessor across all the metrics, both with and without hints. The gap between smaller LLMs and Llama3.1 70Binstruct is remarkable, proving once again that larger LLMs preserve much more knowledge. Word-length hints instead are generally not helping the models, actually harming the performance in noninstructed models. For example, the F1 score of Llama3.1 8B drops significantly, from 37.35 without hints to 27.51 with hints, and similarly, EM decreases from 34.16 to 25.72 as well. Instructed models instead are not affected by this, but the suggestions lead to a small increase in all the metrics. Only in Llama3.1 70B-instruct, we can observe some statistically significant improvement. This may suggest that constraints are beneficial only on models with stronger understanding capabilities.

In Figure 6, we show how the performance of Llama3.1 family models vary with respect to the number of page views. We group examples in intervals, then we compute the metrics on each of them. Edit distance shows no significant trends, whereas EM and F1 exhibit an increasing trend on more visited pages for 8B sized models, whereas the 70B model has a behaviour that seems uncorrelated with the number of views. This suggests that the larger number of weights in 70B model, stored a broader and deeper knowledge about world facts and entities, covering also less popular ones, whereas smaller LLMs did embody only the most popular factual knowledge seen during training.

Limitations

Large Language Models have all been exposed to vast amount of data. The clues proposed in this dataset were created from Wikipedia pages that were definitely seen by the LLMs during training. Clues are also generally very adherent to the pages content, since they were created from it. Indeed, one of the goals of the benchmark is to assess their memorization capabilities on facts that were likely to be well known by them. However, the proposed dataset is new, hence it could not have been part of the training set of such LLMs.

Data license and copyright issues

Data is released under apache-2.0 license.

Figure 1 :1Figure 1: Sketch of clue-instruct method. Picture taken from [13].

Figure 2 :2Figure 2: Page views distribution (the very few examples above one million visits were excluded).

Figure 3 :3Figure 3: Prompt task without hints.

Figure 4 :4Figure 4: Prompt task with word length hints.

Figure 5 :5Figure 5: Distribution of the examples across the categories.

Figure 6 :6Figure 6: ED, EM and F1 score performance varying with respect to the number of page views for 3.1 llama models.

Table 22Performance on the task with and without word length hints.ModelHint ED ↓EMF1Llama3 8BNo11.4314.8216.37Llama 8BYes11.5210.8211.91Llama3 8B-instructNo11.4314.8216.37Llama3 8B-instructYes12.0714.4816.07Llama3.1 8BNo6.9934.1637.35Llama3.1 8BYes8.0125.7227.51Llama3.1 8B-instructNo7.3139.6944.47Llama3.1 8B-instructYes6.1440.8044.58Llama3.1 70B-instructNo3.3266.6170.16Llama3.1 70B-instructYes3.2767.89 71.24

Crossword puzzles and lexical memory RNickerson Attention and performance VI Routledge 1977 Crossword puzzles for chemistry education: learning goals beyond vocabulary EYuriev BCapuano JLShort Chemistry education research and practice 17 2016 The use of crossword puzzles as a strategy to teach maritime english vocabulary, Scientific Bulletin" Mircea cel Batran CSandiuc ABalagiu Naval Academy 23 2020 Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell Advances in neural information processing systems 33 2020 HTouvron TLavril GIzacard XMartinet M.-ALachaux TLacroix BRozière NGoyal EHambro FAzhar arXiv:2302.13971 Llama: Open and efficient foundation language models 2023 arXiv preprint JAchiam SAdler SAgarwal LAhmad IAkkaya FLAleman DAlmeida JAltenschmidt SAltman SAnadkat arXiv:2303.08774 Gpt-4 technical report 2023 arXiv preprint A multi-strategy approach to crossword clue answer retrieval and ranking AZugarini MErnandes 2021 EWallace NTomlin AXu KYang EPathak MGinsberg DKlein arXiv:2205.09665 Automated crossword solving 2022 arXiv preprint Die rätselrevolution: Automated german crossword solving AZugarini TRothenbacher KKlede MErnandes BMEskofier DZanca CLiC-it 2023 The webcrow french crossword solver GAngelini MErnandes TIaquinta CStehlé FSimões KZeinalipour AZugarini MGori International Conference on Intelligent Technologies for Interactive Entertainment Springer 2023 SSaha SChakraborty SSaha UGarain arXiv:2406.09043 Language models are crossword solvers 2024 arXiv preprint Italian crossword generator: Enhancing education through interactive word puzzles KZeinalipour TIaquinta AZanollo GAngelini LRigutini MMaggini MGori 2023 Clue-instruct: Text-based clue generation for educational crossword puzzles AZugarini KZeinalipour SSKadali MMaggini MGori LRigutini Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian GAttanasio PBasile FBorazio DCroce MFrancis JGili EMusacchio MNissim VPatti MRinaldi DScalena Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024) CEUR Workshop Proceedings the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

Pisa, Italy

December 4 -December 6, 2024. 2024