<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>USS Pavia, Piazza della Vittoria</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>WCA - Educational CrossWord Clues Answering A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Zugarini</string-name>
          <email>azugarini@expert.ai</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamyar Zeinalipour</string-name>
          <email>kamyar.zeinalipour2@unisi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achille Fusco</string-name>
          <email>achille.fusco@iusspavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asya Zanollo</string-name>
          <email>zanolloasya@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Educational Crosswords Dataset, Large Language Models, CALAMITA</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>To build ECWCA, we created a dataset of synthetic</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Siena, DIISM</institution>
          ,
          <addr-line>Via Roma 56, 53100 Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>expert.ai</institution>
          ,
          <addr-line>Siena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>15</volume>
      <issue>27100</issue>
      <abstract>
        <p>This paper presents ECWCA (Educational CrossWord Clues Answering), a novel challenge designed to evaluate knowledge and reasoning capabilities of large language models through crossword clue-answering. The challenge consists of two tasks: a standard question-answering format where the LLM has to solve crossword clues, and a variation of it, where the model is receives hints about the word lengths of the answers, which is expected to help models with reasoning abilities. To construct the ECWCA dataset, synthetic clues were generated based on entities and facts extracted from Italian Wikipedia. Generated clues were then selected manually in order to ensure high-quality examples with factually correct and unambiguous clues.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org
A</p>
    </sec>
    <sec id="sec-2">
      <title>1. Challenge: Introduction and</title>
    </sec>
    <sec id="sec-3">
      <title>Motivation</title>
      <p>
        Crossword puzzles are well-known linguistic games that
are usually used for entertainment, but they are also
applied in education as a tool to assess knowledge, reason- tage of that.
ing skills and linguistic abilities of students [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Large
Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] have shown impressive
abilities and strong knowledge about the world. Recently,
Language Models have been extensively used to both
solve [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
        ] and create crossword clues [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]
for educational purposes.
      </p>
      <p>
        In this challenge instead, we make use of educational
crossword clues to build a benchmark to assess the LLM
clue-answering skills on popular entities and facts about
the world. We refer to it as ECWCA, standing for
Educational CrossWord Clues Answering. ECWCA is an
Italian benchmark presented at [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], designed to include
Entities and Facts that are popular in the Italian culture.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2. Challenge: Description</title>
      <p>In this challenge, we evaluate the knowledge abilities
of LLMs by testing them on crossword clue-answering
tasks. We propose two slightly diferent tasks in the
challenge. The first one, is essentially a Question Answering
problem, where the question is a clue and we expect the
to build a QA dataset of clue-answer pairs. This
happens in two steps, first we generate a set of examples
constituted by an answer and the generated clues (as in
clue-instruct), then we manually select the most suited
clue-answer pairs (see Section 3.2 for further details).</p>
      <p>In order to construct the examples with clue-instruct,</p>
      <sec id="sec-4-1">
        <title>Data</title>
      </sec>
      <sec id="sec-4-2">
        <title>Retrieval</title>
        <p>(a)</p>
      </sec>
      <sec id="sec-4-3">
        <title>Data</title>
      </sec>
      <sec id="sec-4-4">
        <title>Screening</title>
        <p>(b)</p>
      </sec>
      <sec id="sec-4-5">
        <title>Craft the</title>
      </sec>
      <sec id="sec-4-6">
        <title>Prompt</title>
        <p>(c)</p>
      </sec>
      <sec id="sec-4-7">
        <title>Clues</title>
      </sec>
      <sec id="sec-4-8">
        <title>Generation</title>
        <p>
          (d)
we identified the most visited Italian Wikipedia 1 pages. information, thereby ensuring the integrity of the dataset.
To count visits, we considered a period between
September 10, 2023 and May 31, 2024 and gathered stats from Answerability. Annotators were instructed to
Wikimedia APIs2. We considered the page title as the choose a clue that could be answered without a high
answer. Titles with non-alphabetic characters, with less degree of ambiguity. The focus was on clues that
than two characters or more than 20 were excluded. On provided enough information to infer the correct answer
the remaining pages, we extracted their content. Difer- with confidence. Clues that left room for multiple
ently from clue-instruct, we did not dispose of the cate- interpretations or guesses were rejected. For example,
gory information, therefore we generated it by querying generic definitions, such as ’a large mammal’, does not
GPT-4o [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], asking to choose the category of the answer fit this criteria, since there are many possible species
given its page content within a set of 20 predefined cat- iftting for this answer.
egories. We then randomly sampled the pages and we
interrogated GPT-4o to create three clues for the answer. No clue-answer overlap. Clues including the
Finally, those examples underwent through the manual answer or a significant portion of it should be discarded.
selection process, to keep only one clue amongst the
three. The dataset is publicly available3.
        </p>
        <sec id="sec-4-8-1">
          <title>3.2. Annotation details</title>
          <p>The clue-instruct method produces three diferent clues
for each given answer and its context. To select only
one clue we add a human selection step. Doing so, we
avoid the presence of multiple occurrences for the same
answer. Moreover, we guarantee high quality definitions
and answers.</p>
          <p>The example selection process was carried out by three
native Italian speaking annotators. Examples were split
in 18 chunks of 100 examples each, equally distributed
among the annotators.</p>
          <p>Each example was presented with the answer, the
three generated clues and the Wikipedia page paragraph
that was used to create the clues. Annotators were
tasked with selecting the best one, if any, based on the
following criteria:
Truthfulness and Accuracy. It was imperative
that the content of the selected clue was factually
correct. Annotators cross-verified the accuracy of
the clue from the provided Wikipedia page content
to ensure that it did not contain misleading or false
1https://it.wikipedia.org/
2wikimedia.org
3https://huggingface.co/datasets/azugarini/crossword-clues-QA
In cases where more than one clue satisfied all the
criteria, annotators were directed to select the clue that
provided the most relevant information with most clarity
and simplicity. When no clue matched the criteria, the
whole example was discarded.</p>
        </sec>
        <sec id="sec-4-8-2">
          <title>3.3. Data format</title>
          <p>Each example includes the clue-answer pair, the word
length hint, some additional metadata (such as the
category and the page views) and the reference to
the wikipedia page url, whose content was exploited
to generate the clue. More precisely, there are the
following columns: clue, answer, answer_len,
url, content, views, category, length_hint,
raw_entity. A few examples are showcased in Table 1,
where for the sake of simplicity, we only report the
clue-answer pair, the hint and the category of the
example.</p>
        </sec>
        <sec id="sec-4-8-3">
          <title>3.4. Example of prompts used for zero or/and few shots</title>
          <p>We defined two diferent prompts, one with and the other
without indications about the words length of the answer.
The two prompts are presented in Figure 4 and Figure 3,
respectively.</p>
          <p>Task without hints. We construct a 2-shot prompt
(Figure 3) for the task. First, we instruct the model to
act as an expert in solving crossword clues without any
additional hints related to the structure of the answer
(such as words length). The format is clear and concise,
focusing on the core task: resolving the crossword
definition and providing only the solution. Then, the two static
demonstration examples are showcased to illustrate to
the model how to approach the task. Finally, following
the same layout, we present a new clue and expect the
model to complete it with the answer.</p>
          <p>Task with word length hints. This prompt (see
Figure 4) is very similar to the first one, but introduces an
hint indicating the words length of the expected answer.
The hint is a constraint that reduces the number of valid
answers, giving indications on both how many words
there are and their lengths, therefore, ideally, it should
aid the language model.</p>
          <p>Sei un esperto di enigmistica. Devi risolvere
definizioni di cruciverba.</p>
          <p>Trova la risposta alla definizione. Ritorna solo la
risposta, nient'altro.</p>
          <p>Esempi:
DEFINIZIONE: Protagonista di Titanic al fianco di
Kate Winslet
RISPOSTA: leonardo dicaprio
DEFINIZIONE: capitale dell'Impero romano d'Occidente
nel 313 d.C.</p>
          <p>RISPOSTA: milano</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Metrics</title>
      <p>To evaluate the performance on the tasks we rely on the
following metrics: Edit Distance (ED), Exact Match (EM),
and average F1 score on words (F1).</p>
      <sec id="sec-5-1">
        <title>3.5. Detailed data statistics</title>
        <p>Edit Distance. Edit Distance (also known as
LevenOverall we collected 1,171 clue-answer pairs belonging shtein Distance) measures the minimum number of
to 16 diferent categories. The distribution of answers single-character edits (insertions, deletions, or
substiamong categories is outlined in Figure 5. Most of the ex- tutions) required to change one sequence into another.
amples belong to Entertainment topic, indeed the dataset In this context, ED measures how close the generated
includes many actors, tv shows, movies and fictional</p>
        <p>Sei un esperto di enigmistica. Devi risolvere
definizioni di cruciverba.</p>
        <p>Ti verrà data una definizione corredata da un
suggerimento, una sequenza di numeri indicante di
quanti caratteri è composta ciascuna parola della
risposta.</p>
        <p>Trova la risposta alla definizione.</p>
        <p>Ritorna solo la risposta, nient'altro.
response is to the ground truth answer. A lower ED
indicates better performance, as it signifies that the predicted
text is more similar to the target text.
Preliminary Results. We establish baseline results on
ECWCA, testing some of the models in the Llama family.</p>
        <p>In particular, we consider Llama3 8B and Llama3.1 8B
in both instructed and non-instructed versions, and the
Llama3.1 70B-instruct, to observe how model size afects
the results. Table 2 illustrates the performance of the
LLMs on the two tasks (with and without word-length
hints), both evaluated on the defined scores. We can
observe that Llama3.1 8B consistently outperforms its
predecessor across all the metrics, both with and without
hints. The gap between smaller LLMs and Llama3.1
70Binstruct is remarkable, proving once again that larger
LLMs preserve much more knowledge.</p>
        <p>Word-length hints instead are generally not helping
the models, actually harming the performance in
noninstructed models. For example, the F1 score of Llama3.1
8B drops significantly, from 37.35 without hints to 27.51
with hints, and similarly, EM decreases from 34.16 to
25.72 as well. Instructed models instead are not afected
by this, but the suggestions lead to a small increase in
all the metrics. Only in Llama3.1 70B-instruct, we can
observe some statistically significant improvement. This
may suggest that constraints are beneficial only on
models with stronger understanding capabilities.</p>
        <p>In Figure 6, we show how the performance of Llama3.1
family models vary with respect to the number of page
views. We group examples in intervals, then we compute
the metrics on each of them. Edit distance shows no
significant trends, whereas EM and F1 exhibit an increasing
trend on more visited pages for 8B sized models, whereas
the 70B model has a behaviour that seems uncorrelated
with the number of views. This suggests that the larger
number of weights in 70B model, stored a broader and
deeper knowledge about world facts and entities,
covering also less popular ones, whereas smaller LLMs did
embody only the most popular factual knowledge seen
during training.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Limitations</title>
      <p>Large Language Models have all been exposed to vast
amount of data. The clues proposed in this dataset were
created from Wikipedia pages that were definitely seen by
the LLMs during training. Clues are also generally very
adherent to the pages content, since they were created
from it. Indeed, one of the goals of the benchmark is to
assess their memorization capabilities on facts that were
likely to be well known by them. However, the proposed
dataset is new, hence it could not have been part of the
training set of such LLMs.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Data license and copyright issues</title>
      <p>Data is released under apache-2.0 license.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nickerson</surname>
          </string-name>
          ,
          <article-title>Crossword puzzles and lexical memory</article-title>
          , in: Attention and
          <string-name>
            <surname>performance</surname>
            <given-names>VI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Routledge</surname>
          </string-name>
          ,
          <year>1977</year>
          , pp.
          <fpage>699</fpage>
          -
          <lpage>718</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Yuriev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Capuano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Short</surname>
          </string-name>
          ,
          <article-title>Crossword puzzles for chemistry education: learning goals beyond vocabulary</article-title>
          ,
          <source>Chemistry education research and practice 17</source>
          (
          <year>2016</year>
          )
          <fpage>532</fpage>
          -
          <lpage>554</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sandiuc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Balagiu</surname>
          </string-name>
          ,
          <article-title>The use of crossword puzzles as a strategy to teach maritime english vocabulary</article-title>
          ,
          <source>Scientific Bulletin” Mircea cel Batran” Naval Academy</source>
          <volume>23</volume>
          (
          <year>2020</year>
          )
          <fpage>236A</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zugarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ernandes</surname>
          </string-name>
          ,
          <article-title>A multi-strategy approach to crossword clue answer retrieval and ranking (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tomlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ginsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <source>Automated crossword solving, arXiv preprint arXiv:2205.09665</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zugarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rothenbacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Klede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Eskofier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zanca</surname>
          </string-name>
          , Die rätselrevolution:
          <source>Automated german crossword solving</source>
          ., in: CLiC-it,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Angelini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Iaquinta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stehlé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Simões</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zeinalipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zugarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <article-title>The webcrow french crossword solver</article-title>
          ,
          <source>in: International Conference on Intelligent Technologies for Interactive Entertainment</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>193</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Garain</surname>
          </string-name>
          ,
          <article-title>Language models are crossword solvers</article-title>
          ,
          <source>arXiv preprint arXiv:2406.09043</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zeinalipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Iaquinta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zanollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Angelini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rigutini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maggini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <article-title>Italian crossword generator: Enhancing education through interactive word puzzles (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zugarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zeinalipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Kadali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maggini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          , L. Rigutini, Clue-instruct:
          <article-title>Text-based clue generation for educational crossword puzzles</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>3347</fpage>
          -
          <lpage>3356</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>297</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>