<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLMike: Exploring Large Language Models' Abilities in Wheel of Fortune Riddles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ejdis Gjinika</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Arici</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Loreggia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Putelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Serina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfonso Emilio Gerevini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Università degli Studi di Brescia</institution>
          ,
          <addr-line>Via Branze 38, Brescia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>A riddle from the game show “Wheel of Fortune” consists of a hidden sentence that can be discovered starting from a simple clue and by iteratively guessing its letters. Although the game is very popular and intuitive, solving one of these riddles is not trivial. In fact, for interpreting the clue, identifying the most probable letters, and leveraging the game's mechanics efectively, a player requires linguistic abilities, world knowledge, and even some form of strategic thinking. The goal of this study is to verify whether Large Language Models (LLMs) possess the necessary abilities to solve Wheel of Fortune riddles. We propose a software framework called LLMike in which an algorithmic Game Master interacts with an LLM: prompting it, enforcing the game's rules, updating the hidden sentence based on the model's guesses, and evaluating their correctness. We study several models with diferent sizes, evaluating their performance, behavioural patterns, and common types of errors. Our dataset and code are available at https://github.com/ejdisgjinika/LLMike.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Wheel of Fortune</kwd>
        <kwd>Model Evaluation</kwd>
        <kwd>Benchmarks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        does not need particular reasoning capabilities, such as
for choosing which words to complete first: LLMs may
Assessing linguistic and reasoning abilities of Large Lan- start wherever they want and complete the puzzle with
guage Models (LLMs) is an open challenge [
        <xref ref-type="bibr" rid="ref10 ref3 ref6 ref8">1, 2, 3, 4</xref>
        ]. knowledge alone.
      </p>
      <p>
        Especially in the last few years, LLMs have proved to ad- With non-textual games, such as Connect-4 or
Ticdress many Natural Language Processing tasks (such as Tac-Toe [
        <xref ref-type="bibr" rid="ref31">12, 18</xref>
        ] we can have a diferent situation. In fact,
text classification, summarization, machine translation, both of these games require a more refined strategy to
etc.) and their benchmarks, with performance that previ- win. For instance, Connect-4 is a game in which two
playously seemed unreachable. However, LLMs come with ers compete with each other. They insert coloured disks
several limitations, such as hallucinations [
        <xref ref-type="bibr" rid="ref15">5</xref>
        ], reason- into a board, trying to form a line (vertical, horizontal, or
ing issues [
        <xref ref-type="bibr" rid="ref1">6</xref>
        ], and lack of trustworthiness [
        <xref ref-type="bibr" rid="ref2">7, 8</xref>
        ]. There- diagonal) of four disks of the same colour, while
preventfore, researchers have started developing new methods ing the other player from doing the same. In order for
or more challenging tasks to assess diferent types of an LLM to win, clearly it would need a solid strategy to
abilities that LLMs may or may not possess [
        <xref ref-type="bibr" rid="ref12 ref16">9, 10, 11</xref>
        ]. choose all its actions in a specific order, to evaluate the
      </p>
      <p>
        A popular research line is based on games [
        <xref ref-type="bibr" rid="ref17">12, 13</xref>
        ], situation on the board and consider all its options.
especially text-based games such as word association Addressing linguistics, knowledge, and strategy, in this
games [
        <xref ref-type="bibr" rid="ref20 ref24">14, 15</xref>
        ] or crossword puzzles [
        <xref ref-type="bibr" rid="ref26 ref29">16, 17</xref>
        ] which focus work we propose a task based on the popular “Wheel
on linguistic aspects. For instance, in a crossword puzzle of Fortune” game show. An example of how this game
LLMs would obviously need linguistic abilities to inter- works is available in Figure 1. In order to win, a player has
pret the clues and to insert all the words correctly. More- to guess a sentence from a simple clue. At first, only the
over, the clues may refer to general knowledge and trivia, number of words and the number of letters for each word
which must be known by the LLM. However, this game are available. Next, the player has to spin a wheel (into
which each wedge gives a diferent amount of money)
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- and say a consonant which will be revealed in the hidden
tics, September 24 — 26, 2025, Cagliari, Italy sentence (if present). With some of the money earned,
*$Coerjdreiss.pgojinndikinag@auuntihbosr.it (E. Gjinika); nicola.arici@unibs.it the player can decide to buy a vowel, which will make
(N. Arici); andrea.loreggia@unibs.it (A. Loreggia); the guess easier. This procedure can be repeated
sevluca.putelli@unibs.it (L. Putelli); ivan.serina@unibs.it (I. Serina); eral times until the player decides to guess the hidden
alfonso.gerevini@unibs.it (A. E. Gerevini) sentence. If the guess is correct, the player efectively
0009-0006-9817-5846 (E. Gjinika); 0009-0000-9713-6630 takes the money and the overall goal is to accumulate as
((NL.. PAurtieclil)i;);0000000-00-000020-29-874768-50-1954792(A(I.. LSoerreingag)i;a0);000000-90-000010-89-050085-56-3688612 much money as possible. To solve this task, of course,
(A. E. Gerevini) an LLM would need linguistic capabilities to understand
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License the rules, expressed in natural language. World
knowlAttribution 4.0 International (CC BY 4.0).
      </p>
      <p>
        Category: Around the house approach combined with a local search to choose possible
word candidates and rank them for completing crossword
puzzles. This game covers diferent aspects, such as
common sense, general knowledge, and metalinguistic
patterns. Another work on crossword puzzles with human
evaluation has also been proposed in [
        <xref ref-type="bibr" rid="ref29">17</xref>
        ]. The authors
of [
        <xref ref-type="bibr" rid="ref20">14</xref>
        ] propose a challenge in which participants submit
Category: Around the house systems for the "Ghigliottina", an Italian text game where
some semantic knowledge is needed to link a group of
      </p>
      <p>L L words. Most of the proposed systems are based on
techL niques that leverage the similarity between the vector
representations of words.</p>
      <p>
        With the growing popularity of LLMs, rather than
creating ad-hoc models to play and complete games,
reCategory: Around the house searchers have begun using these games to benchmark
the general abilities of LLMs [21, 22]. Qiao et al. [20]
A V A S E F U L L introduce the concept of evaluating LLMs using
conversaO F F L O W E R S tional games, such as a round-based interaction between
a questioner and an answerer called Ask-Guess. One
of the main claims of this study is that conversational
Figure 1: Example of the gameplay of the Wheel of Fortune games can diferentiate the capabilities of diferent LLMs.
game. At the top, we show how the game starts, i.e., with a Manna et al. [
        <xref ref-type="bibr" rid="ref17">13</xref>
        ] assessed that the leading commercial
completely hidden riddle. In the middle, we show the partially models (i.e. GPT-4 and Gemini-Pro) struggle in
completcompleted riddle after one participant spins the wheel and ing a semantic connection game such as the “Ghigliottina”
chooses the letter “L”. At the bottom, we show the solution of [
        <xref ref-type="bibr" rid="ref20">14</xref>
        ]. A similar work was presented by Samardashi et al.
the game. [
        <xref ref-type="bibr" rid="ref24">15</xref>
        ], focusing on the New York Times Connections word
game, which similarly requires semantic knowledge.
      </p>
      <p>Another interesting work is [23], which focuses on
edge is also needed to solve many of the clues based on role-playing abilities of LLMs combined with external
places, movies, etc. Finally, choosing which consonants tools. Similarly, the authors of [19] evaluated the
abilito say, whether to buy a vowel, or when to try to guess ties of several LLMs in a multi-agent scenario to solve
the sentence also needs some basic strategic skills. a detective-style game. Although linguistic and world</p>
      <p>
        In this paper, we create LLMike, an algorithmic frame- knowledge are needed, their evaluation focuses more on
work that allows LLMs to play Wheel of Fortune games. the strategies the agents use to play the game.
The name comes from the TV presenter of the first edi- More generally, the knowledge possessed by LLMs
tions of the Italian version of Wheel of Fortune, Mike has been the subject of many studies [24], focusing on
Bongiorno. LLMike prompts the LLMs with all the pro- world knowledge [
        <xref ref-type="bibr" rid="ref30">25, 26</xref>
        ], semantics [
        <xref ref-type="bibr" rid="ref33">27</xref>
        ] and specific
cedures of the game and interacts with it depending on knowledge, such as the medical domain [
        <xref ref-type="bibr" rid="ref34">28</xref>
        ].
its responses. The framework allows simple budget
management and the checking of diferent types of errors. 3. Methodology
We tested both open-source and commercial models to
see whether these models are capable of completing such In this section, we explain how we structure our
evaldificult tasks. We manually created a dataset based on uation of the capabilities of LLMs in Wheel of Fortune
some publicly available riddles. Finally, we analysed the riddles. First, we describe the original rules of the game;
answers provided by the models in order to understand then, we describe our adaptation and implementation of
their behaviour in the games they won, their main errors, the game.
and to give some insight into their strategy.
      </p>
      <sec id="sec-1-1">
        <title>3.1. Wheel of Fortune</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>As introduced earlier, the Wheel of Fortune is a game</title>
        <p>
          Games and puzzles are a recurrent testbed for assessing show that lets multiple contestants compete with each
the capabilities of deep learning systems, especially to other to win the game and earn money. The goal is to
corimplement complex reasoning abilities [
          <xref ref-type="bibr" rid="ref17 ref24 ref26">16, 13, 15, 19, 20</xref>
          ]. rectly guess an hidden riddle by iteratively discovering its
For instance, Wallace et al. [
          <xref ref-type="bibr" rid="ref26">16</xref>
          ] use a neural network letters until the player is confident enough to formulate
You are a participant in the
famous tv quiz show "Wheel
of Fortune" and the user is
the game master.
[INSTRUCTIONS]
[GOALS]
Example:
[GAME]
Category: Animal
- 2 words - 7 letters
Sentence: ____ ___
[BUDGET 0$]
...
        </p>
        <p>Game Master</p>
        <p>SPIN or
BUY VOWEL ?</p>
        <p>LLM</p>
        <p>SPIN +
CONSONANT</p>
        <p>1,2</p>
        <p>LLM
BUY VOWEL +</p>
        <p>VOWEL 1,3,4</p>
        <p>Game Master
Add the reward
to the budget
Game Ma_ster
Subtract 250$
from the budget
PASS</p>
        <p>Game Master
Show to the player
the hidden riddle and
make the LLM choose</p>
        <p>LLM
GUESS or PASS
1</p>
        <p>GUESS
a guess. The game works in several rounds. In the begin- The Game Master gives the prompt, which contains the
ning, it is shown the word puzzle (with no letters present, rules, the goals, and an example of the game, and asks
as at the top of Figure 1) which can reveal a sentence, the LLM to select an action, starting a round. The LLM
a name of a person, a place, etc. Each participant has a selects an action and its budget is updated. Next, the
budget that starts at 0 $ and can gradually grow over the Game Master shows the new conditions of the game, i.e.
rounds. Starting from the first participant, he/she can the hidden riddle partially revealed and the new budget.
spin a wheel composed of several wedges, with diferent Finally, it asks the LLM to provide a guess or pass to the
amounts of money associated with each wedge. Next, next round.
the participant chooses a consonant: if the consonant is We redesign the game by adapting the rules to a
singlerevealed in the hidden riddle (as in the middle of Figure participant scenario with a slightly diefrent round
struc1), the participant earns the amount of cash indicated ture, as shown in Figure 2. First, we removed the
speby the wedge times the number of occurrences of the cial wedges from the wheel (i.e., “Bankrupt” and “Lose a
consonant chosen. Next, he/she can spin the wheel again turn”), because they depend only on luck, and this can
and continue to play another round. If the consonant is lead to a non-systematic analysis of the LLM’s abilities.
not present in the riddle, the participant passes the turn Therefore, our wheel has only cash wedges, all between
to another player. As the rounds progress and the player 100 $ and 1.000 $.
has enough money, he/she can buy a vowel for a fixed In our interaction schema, first the Game Master asks
amount of the budget and has to indicate which vowel the LLM to spin the wheel or to buy a vowel for 250 $.
he chooses. If the vowel is present in the riddle, it will After the choice made by the LLM, the riddle and the
be revealed, but if it is not, the player passes the turn. budget are adjusted accordingly and subsequently
comAt any time in his/her game, the player can guess the municated to the LLM. Then, the LLM has the option to
riddle by giving their final solution. If the correct answer give a guess or to pass and start another round. Since
is given, the player wins the budget he earned. However, we have only one LLM playing, a key diference is that
if the answer is wrong, the player passes the turn. in our adaptation of the game, if the LLM gives a letter</p>
        <p>In the original game show, some special wedges of that is not present in the riddle, it does not lose the turn
the wheel are also present: “Bankrupt”, which resets the in favour of another player, but only its budget is set to
player’s budget and passes the turn; and “Lose a turn”, 0 $. The goal we give to the LLM is to complete the game
which makes the player skip his/her turn. and to maximize the amount of money earned by solving
the riddles. These goals are in line with the goals a real
3.2. LLMike: Evaluating LLM’s Abilities at player playing the Wheel of Fortune would have.</p>
        <p>We also formalize some rules specifically for the LLMs’</p>
        <p>Wheel of Fortune interaction with the game, intending to control and better
In the adaptation we created for evaluating LLMs’ abil- understand the ability of the models to follow
instrucities at solving Wheel of Fortune riddles, we defined tions. This formalizations results in four rules:
two main roles: the Game Master, which is a specifically
coded algorithm (not based on artificial intelligence tools) • Rule 1: The LLM cannot choose to do an action
that interacts with the LLM and evaluates its answers, that is not possible in a given situation; for
inand the LLM, which acts as a player of the game. stance, the LLM can’t pass the turn when it is
An overview of our adaptation is presented in Figure 2. required to spin the wheel or buy a vowel.
• Rule 2: If the LLM spins the wheel, it has to</p>
        <p>choose a consonant and not a vowel.
• Rule 3: If the LLM buys a vowel, it has to choose</p>
        <p>a vowel and not a consonant.
• Rule 4: The LLM has to buy a vowel if and only</p>
        <p>if it has enough money to do so.</p>
      </sec>
      <sec id="sec-2-2">
        <title>If the model violates one of the rules, it will automatically</title>
        <p>lose the game.</p>
        <p>In Figure 2 also shows a brief version of the prompt
used during the games. The prompt contains a short
description of the context, followed by the instructions
for playing the game, the goals, and an example. The
goals are expressed in simple sentences, and the examples
represent a standard conversation between an LLM and
the Game Master. The complete prompt is available in
the GitHub repository.1</p>
        <p>Please note that the riddle cannot be solved by simply
choosing all the letters in it, one at a time. In fact, all
riddles are composed of consonants and vowels.
However, the player can choose only consonants, which leads
him/her to always deal with an incomplete riddle. This
leads to two major possible decisions: buying vowels
or guessing the sentence, which cannot be easily
implemented in simple baseline approaches.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental Evaluation</title>
      <sec id="sec-3-1">
        <title>Data. Our dataset is composed of 80 riddles in English</title>
        <p>taken from a publicly available dataset4 and repurposed.
The riddles are of variable length and divided into 16
categories. The shortest sentence is made up of 2 words
while the longest is made up of 9 words. In terms of the
number of characters, the range is from 9 to 47
characters. The average lengths are 19.47 and 3.16 in terms of
characters and words, respectively.</p>
      </sec>
      <sec id="sec-3-2">
        <title>In this section, we present how our experiments were</title>
        <p>conducted, the models and data we used, how the
performance was evaluated, and the results. Then, we present
an analysis of the main errors made by the models and
provide some intuition on their strategy.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Metrics. Several metrics were introduced to measure</title>
        <p>the performance of LLMs in our Wheel of Fortune task.</p>
        <p>First, we consider the number of games won (# Wins)
and the average amount of money won by the LLM (Total
Final Budget). Other metrics are more complex and are
based on the game rules listed in Section 3.2. First, we
Models and implementation details. We selected consider a group of metrics to evaluate the model
be29 open-source models available through Ollama2, which haviour, such as the number of letters chosen by the LLM
are available in Table 1. Ollama is a framework designed (# Letters), the percentage of the letters that were actually
to facilitate the local execution of open-source LLMs. found in the riddle (% Correct Letters), and the
percentThe models considered difer considerably in terms of age of completion of the riddle when the LLM gives the
architecture, family, and number of parameters. right guess (% Riddle Completion). Next, we consider
sev</p>
        <p>Moreover, we select three commercial models: GPT- eral error-related metrics, to understand when the model
4.1, Mistral Large 2 and Gemini 2.0 Flash3. The exact size does not follow the rules (perhaps, by not selecting a
of GPT-4.1 and Gemini 2.0 Flash has not been disclosed letter, or by trying to buy a vowel with an insuficient
publicly. However, they are much bigger than any of the budget), when it just provides a wrong guess or when it
open-source models we considered. Mistral Large 2 has reaches the maximum number of possible consonants.
about 123 parameters.</p>
        <p>For both open source and commercial models, the
responses are generated using the default parameters.</p>
        <sec id="sec-3-3-1">
          <title>4.1. Results of the Best Performing</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Models</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>1https://github.com/ejdisgjinika/LLMike</title>
        <p>2https://ollama.com/
3Specifically, we use the "mistral-large-2411", "gemini-2.0-flash-001",
and "gpt-4.1-2025-04-14" snapshots.</p>
      </sec>
      <sec id="sec-3-5">
        <title>In this section, we report the performance of LLMs in</title>
        <p>the Wheel of Fortune game. Of the more than 30
mod</p>
      </sec>
      <sec id="sec-3-6">
        <title>4https://www.kaggle.com/datasets/darrylljk/</title>
        <p>wheel-of-fortune-answers
els tested, only 9 managed to guess at least one solution: age 85.21 completion to solve a total of only 8 games.
three commercial models and six open-source LLMs, four This may suggest a higher understanding and knowledge
of which belong to the Gemma family. Except for Gemma possessed by Gemma 3 27, with respect to Phi 4. A
sim2 9, all models have more than 10 parameters. Fur- ilar comparison can be made with Gemma 3 12, which
thermore, all models with more than 25 parameters obtains only 5 wins with a riddle completion of 86.46.
can guess at least one correct solution, with the exception In this case, the diference seems entirely dependent on
of Aya Expanse and Command-R. the diferent number of parameters.</p>
        <p>In Table 2, we show the results ordered by the number Significantly better results are obtained with
commerof games won. The best open-source model, by far, is cial LLMs: GPT-4.1 gets 62 wins, Gemini 2.0 Flash 35,
Gemma 3 27 with 20 wins in 80 games, followed by and Mistral Large 2 25. Nevertheless, these models have
Gemma 2 27 and Phi 4 14 with 8 wins, and Gemma 3 similar performance with respect to the open-source
mod12 with 5. Although they reached one and two victories, els in terms of number of letters (all between 10.53 and
respectively, we did not include in Table 2 Gemma 2 13.23), percentage of correct letters (which does not
ex9 and Cogito 32 due to the low significance of their ceed 68%), and percentage of riddle completion. This
results with such a small sample. behaviour suggests that although these larger models</p>
        <p>However, these victories can come from two diferent possess a similar ability in guessing the correct letters
abilities. The first is that a model may guess as many and completing the masked riddle, they are much better
letters as possible and progressively fill in the riddle, at providing the correct solution.
until the guess becomes very simple. The second is that a Table 2 also reports the final budget earned by the
model may not need to fill the riddle as much as possible, models. The best performing model is GPT-4.1, with more
because it has enough knowledge to find the correct than 65 $. Notably, Gemma 3 27 obtains a higher
solution of a more complicated riddle. Analysing the amount of money (20.6) with respect to Mistral Large
ability of the model of choosing letters, the best open 2 (15.25), despite obtaining fewer wins (20 versus 25).
source model is Gemma 2 27, with 68.7% of correct Since every time a model chooses a wrong consonant,
letters. This ability is reflected in the number of letters the budget is set to 0, this is probably due to its higher
required to provide a correct solution, which is 8.38, the percentage of correct letters (62.73 versus 54.97).
lowest of all models. The other LLMs perform worse,
ranging from 51.19 (Gemma 3 12) to 62.73 (Gemma 4.2. Typical Errors
3 27). All the other open-source models tend to select
a higher number of letters, ranging from 11.00 to 16.8. In this section, we discuss the most common errors made
Interestingly, the former has the tendency to select as by the models considered. Since, an important first result
many letters as possible, filling the riddle up to 86.46%, of our experiments is that 23 LLMs over a total of 32
on average. were unable to give a single correct solution, we first</p>
        <p>Analysing the guessing capabilities, Gemma 3 27 analyse their main flaws.
obtains 20 victories not only by selecting letters, but also In Figure 3, we show six types of errors made by those
by guessing from a quite low completion of the riddle LLMs considered and their frequency calculated for all
(71.30), whereas the least performing models require a 80 games. The most common error (in blue) is definitely
higher completion. Instead, Phi 4 14 requires an aver- Insuficient Budget (33.1%), in which an LLM tries to
Overviews of the Error Made by the Best
Performing Models In the following, we investigate the flaws
Figure 3: Error frequency for the LLMs unable to guess a made by the best performing models, i.e. those reported
single riddle. Each colour represents a diferent error category. in Table 2. Starting from GPT-4.1, the major cause its
The frequency of each error, in the form of a percentage over losses is the Wrong Guess (55.56%): i.e. the model, at
all the 80 games for each LLM, is reported inside each sector. a certain riddle completion, has enough “confidence” to
try to guess the riddle but provides the wrong answer.</p>
        <p>Despite GPT-4.1 being the best model at following the
inbuy a vowel without the necessary money. The next structions, it still shows some limitations on letter
chooserror, Action Not Allowed (N/A), is quite more complex. ing (11.11% of Vowel N/A and 5.56% of Consonant N/A)
As we show in Figure 2, the model is forced to generate and managing the budget (11.11% of Insuficient Budget
specific text such as [SPIN], [BUY VOWEL] or a sin- Error). Gemini 2.0 Flash shows a diferent behaviour in
gle consonant at diferent times during the game. This terms of errors. In fact, it manifests lots of problems on
text indicates the choice of executing a specific action in instruction adherence and budget management
(respeca strict way and any other answer is considered as an tively 40% of Instruction Error and 33.3% on Insuficient
Action N/A error. This error recurs 20.2% of the time. Budget Error). Interestingly, Mistral Large 2 is good at
Similarly, Consonant N/A (19.4%) refers to those times following instructions, managing its budget and
choosthat the model, after choosing to buy a vowel, selects a ing the letters in the right contexts. However, it provides
consonant instead. Both Action N/A and Consonant N/A many wrong answers (Wrong Guess 87.27%). An
indenote a lack of understanding of the game rules and of teresting fact is that Mistral Large 2 and Gemma 3 27
the prompt instructions provided by the Game Master. obtain a comparable number of wins (respectively 25
Wrong Guess (14.0%) happens when the model simply and 20 wins) even if they have a significantly diferent
provides a wrong solution to the riddle. In our analysis, number of parameters (123 and 27 respectively).
Alan important aspect of this type of error is that often the though Gemma 3 27 has a lower percentage of Wrong
LLM does not respect the format of the riddle, selecting Guess (56.7%), its limitations in dealing with single
letwords with the wrong number of letters. Moreover, some ters (Vowel N/A 20% and Consonant N/A 5%) and
budmodels (such as Olmo 2 and Llama 3.2) can be considered get management (10%) deteriorates its performance.
“overconfident”, choosing to guess the solution with a
very limited amount of letters. As Vowel N/A (12.0%), 4.3. Hints on Strategy
we refer to those times the model, instead of choosing
a consonant, selects a vowel instead. As for Action N/A
and Consonant N/A, this error depends on not
understanding the game rules. Finally, the remaining 1.3% of
the errors occur when the model exceeds the round limit
imposed (20 rounds), continuously spinning the wheel
or buying vowels without trying to guess the solution of
the riddle.</p>
      </sec>
      <sec id="sec-3-7">
        <title>In this section, we report some information regarding</title>
        <p>the strategy followed by the best performing models.</p>
        <p>We think that a total absence of strategy would
result in picking random consonants. Instead, a smarter
approach would be to select consonants which appear
frequently in English words. To highlight this behaviour,
we analyse the first letters chosen by the model. Results
are available in Table 3, in which we report:</p>
        <p>Consonant
generated more frequently by the models. In fact,
considering Gemma 3 27 these consonants are the 46.41% of
all the letters chosen by the model. Similarly, for GPT-4.1
they are 43.67%. Although this difers from the standard
frequency in the English language (into which these five
Std. Freq. Gemma 3 GPT-4.1 consonants reach a total of 34.2%), we can say that both
models know which are the most common consonants
T 9.1 10.40 10.08 and exploit this information in their games, combining
NS 66..37 190.4.490 190.6.599 both linguistic knowledge and basic strategy. Both
modH 6.1 4.29 3.49 els have a very similar behaviour, with T, N, S and R being
R 6.0 11.83 9.82 the preferred consonants (with a frequency around 10%),
and H is considered less important, with a frequency
Total 34.2 46.41 43.67 that does not exceed 4%. This is quite diferent from the
statistics calculated for the English language, in which
 has a frequency of 6.1, quite similar to R (6.0), and
• the number of diferent pairs of letters chosen by S (6.3). This is probably due to the fact that H is very
the LLM at the start of the game (# Pairs); present in very common stop words such as the, which,
• the number of diferent triplets of letters chosen this, which may not be particularly important to solve
by the LLM at the start of the game (# Triplets); our riddles. More specifically, models tend to start with
• the number of # Vowels the model decided to buy; the two most frequent consonants (T, N or S) and then
buy a vowel (mostly E or A). This behaviour is constant
for most of the 80 riddles of our dataset, regardless of
the sentence length or other characteristics.</p>
        <p>We can see that there are notable diferences among the
models with respect to the number of distinct pairs and
triples chosen at the start of diferent games. Phi 4 14
has the highest variability, selecting 35 diferent pairs
and 61 diferent triples of letters across the 80 riddles in 5. Conclusions and Future Work
our dataset. Instead, the best performing models (such
as GPT-4.1, Gemini 2.0 Flash and Gemma 3 27) present In this paper, we proposed a novel textual game based
a much lower variability, with respectively 9, 10 and 11 on the famous “Wheel of Fortune” game show with the
diferent pairs and less than 30 diferent triples. This sug- aim of assessing linguistic and reasoning abilities. We
gests that they start many riddles with a similar strategy. created a framework for allowing LLMs to play under</p>
        <p>Analysing the number of vowels bought by our mod- strict rules and showed how the task was structured, the
els, we can see some other relevant information. The data, and the metrics used for the evaluations. We
analmodels with highest variability in terms of letters chosen ysed 29 open source models and 3 commercial models
(Phi 4 14 and Gemma 3 12) also tend to buy more to evaluate a variety of models with diferent model’s
vowels (respectively, 4.00 and 3.80 on average). Com- architecture and sizes. Only 9 LLMs out of 32 managed
paring these results with those in Table 2, we can see to solve at least one riddle. The most problematic aspects
that this strategy does not provide notable advantages: in are their little ability to follow the instructions, such as
fact, they win only 8 and 5 games respectively. Instead, the constraint of choosing only consonants. The best
the best performing models (the commercial models and performing open-source model is Gemma 3 27, with
Gemma 3 27) tend to buy fewer vowels (only 2.30 for 20 wins out of 80 riddles, whereas the commercial model
Gemma 3 27 and 2.63 for GPT-4.1) obtaining a defi- GPT-4.1 solves 65 riddles. Analysing their strategy, we
nitely higher number of wins. Moreover, since buying see that the best performing models select the most
frevowels requires subtracting 250 $ from the budget, this quent consonants in the English language, resulting in a
decision can be considered good also for the declared progressively easier riddle. However, they can also guess
goal of maximizing the earnings. the right solution with a completion of around 70%.</p>
        <p>In Table 4 we compare the standard frequency of the As future work, we want to analyse performance of
ifrst five consonants in the English language 5 (Std. Fre- Large Reasoning Models (LRM), such as Deepseek-R1, o3
quency) with the percentage of times that such conso- and o4-mini, and to expand the framework to let several
nants are chosen by two LLMs: the best performing open models play with each other. Moreover, another
intersource one, Gemma 3 27, and the best commercial one, esting direction would be to exploit Multimodal LLMs to
GPT-4.1. We can see that the most frequent consonants create a visual version of the game. We would also like
(which in English are T, N, S, H, and R) are definitely those to consider data in other languages. Finally, we would
like to implement new games and analyse the behaviour
of models in a more complex environment.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments References</title>
      <sec id="sec-4-1">
        <title>This work was carried out while the author, Ejdis Gjinika,</title>
        <p>was enrolled in the Italian National Doctorate on
Artificial Intelligence run by Sapienza University of Rome in
collaboration with the University of Brescia.</p>
        <p>This work has been partly funded by Regione
Lombardia through the initiative "Programma degli interventi
per la ripresa economica: sviluppo di nuovi accordi di
collaborazione con le università per la ricerca, l’innovazione
e il trasferimento tecnologico" - DGR n. XI/4445/2021.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and DeepL Write /
DeepL Translate in order to: Improve writing style and Grammar and spelling check. After using
these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Mirzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shahrokhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tuzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Farajtabar</surname>
          </string-name>
          ,
          <article-title>Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2410.05229.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Mo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>How trustworthy are open-source LLMs? an assessment under malicious demonstrations shows their vulnerabilities</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>2775</fpage>
          -
          <lpage>2792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Uluslu</surname>
          </string-name>
          , G. Schneider, Investigating linguis- URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>152</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>tic abilities of LLMs for native language identifi</article-title>
          - doi:10.18653/v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>152</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          cation, in: R. Muñoz Sánchez,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alfter</surname>
          </string-name>
          , E. Volod- [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Araki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          , G. Neubig,
          <article-title>How can we ina</article-title>
          , J. Kallas (Eds.),
          <source>Proceedings of the 14th</source>
          Work
          <article-title>- know when language models know? on the calishop on Natural Language Processing for Computer bration of language models for question answerAssisted Language Learning</article-title>
          , University of Tartu ing,
          <source>Transactions of the Association for CompuLibrary</source>
          , Tallinn, Estonia,
          <year>2025</year>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>88</lpage>
          . URL:
          <article-title>tational Linguistics 9 (</article-title>
          <year>2021</year>
          )
          <fpage>962</fpage>
          -
          <lpage>977</lpage>
          . URL: https: https://aclanthology.org/
          <year>2025</year>
          .nlp4call-
          <fpage>1</fpage>
          .7/. //aclanthology.org/
          <year>2021</year>
          .tacl-
          <volume>1</volume>
          .57/. doi:
          <volume>10</volume>
          .1162/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , LLa- tacl_a_
          <fpage>00407</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>MAX</surname>
          </string-name>
          :
          <article-title>Scaling linguistic horizons of LLM by en</article-title>
          - [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Laban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kryscinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            . Fabbri, hancing translation capabilities beyond 100 lan
            <surname>- C. Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          , C.-S. Wu, SummEdits: Measurguages, in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <article-title>Chen ing LLM ability at factual reasoning through the (Eds.), Findings of the Association for Computa- lens of summarization</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
          </string-name>
          , J. Pino, tional Linguistics:
          <source>EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for</article-title>
          K. Bali (Eds.),
          <source>Proceedings of the 2023 Conference Computational Linguistics</source>
          , Miami, Florida, USA,
          <source>on Empirical Methods in Natural Language Pro2024</source>
          , pp.
          <fpage>10748</fpage>
          -
          <lpage>10772</lpage>
          . URL: https://aclanthology. cessing, Association for Computational Linguisorg/
          <year>2024</year>
          .findings-emnlp.
          <volume>631</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/ tics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>9662</fpage>
          -
          <lpage>9676</lpage>
          . URL: https:
          <year>2024</year>
          .findings-emnlp.
          <volume>631</volume>
          . //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>600</volume>
          /. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , Y. Dai,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Han,
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .emnlp-main.
          <volume>600</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Self-playing adversarial language</article-title>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Can chatgpt defend game enhances llm reasoning, in: A. Globerson, its belief in truth? evaluating LLM reasoning via L</article-title>
          .
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Belgrave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Paquet</surname>
          </string-name>
          , J. Tom- debate, in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.), czak, C. Zhang (Eds.),
          <source>Advances in Neural Informa- Findings of the Association for Computational Lintion Processing Systems</source>
          , volume
          <volume>37</volume>
          ,
          <string-name>
            <surname>Curran</surname>
          </string-name>
          Asso- guistics
          <source>: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Singapore, December 6- ciates</article-title>
          , Inc.,
          <year>2024</year>
          , pp.
          <fpage>126515</fpage>
          -
          <lpage>126543</lpage>
          . 10,
          <year>2023</year>
          , Association for Computational Linguis-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          , S. Cheng, E. Diau,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , tics,
          <year>2023</year>
          , pp.
          <fpage>11865</fpage>
          -
          <lpage>11881</lpage>
          . URL: https://doi.org/10.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey of useful LLM evaluation</article-title>
          ,
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>795</volume>
          . doi:
          <volume>10</volume>
          .18653/ CoRR abs/2406.00936 (
          <year>2024</year>
          ). URL: https://doi.org/ V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>795</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          10.48550/arXiv.2406.00936. doi:
          <volume>10</volume>
          .48550/ARXIV. [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Beyond static datasets: A deep interaction approach to LLM evaluation</article-title>
          ,
          <source>CoRR abs/2309</source>
          .04369 (
          <year>2023</year>
          ). URL: https://doi.org/ 10.48550/arXiv.2309.04369. doi:
          <volume>10</volume>
          .48550/ARXIV.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          2309.04369. arXiv:
          <volume>2309</volume>
          .
          <fpage>04369</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          2406.00936. arXiv:
          <volume>2406</volume>
          .
          <fpage>00936</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Meharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of hallucination in large language, image, video and audio foundation models</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            , Y.-N. [12]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Difenderfer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kailkhura</surname>
          </string-name>
          , Chen (Eds.),
          <article-title>Findings of the Association for Compu- L.</article-title>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Stengel-Eskin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , K. Xu, tational Linguistics:
          <source>EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Gtbench: Uncovering the strategic reasoning capaComputational Linguistics</article-title>
          , Miami, Florida, USA, bilities
          <article-title>of llms via game-theoretic evaluations</article-title>
          ,
          <source>in: 2024</source>
          , pp.
          <fpage>11709</fpage>
          -
          <lpage>11724</lpage>
          . URL: https://aclanthology. A.
          <string-name>
            <surname>Globersons</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Belgrave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , U. Paorg/
          <year>2024</year>
          .findings-emnlp.
          <volume>685</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/ quet, J. M.
          <string-name>
            <surname>Tomczak</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Advances in 2024.findings-emnlp.685. Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems</source>
          <year>2024</year>
          , NeurIPS
          <year>2024</year>
          , Vancouver, BC, Canada, arXiv:
          <fpage>2407</fpage>
          .
          <fpage>07796</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>December</surname>
          </string-name>
          10 -
          <issue>15</issue>
          ,
          <year>2024</year>
          ,
          <year>2024</year>
          . [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Deciphering digital
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Manna</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          di Buono, J. Monti,
          <article-title>Riddle me detectives: Understanding LLM behaviors and cathis: Evaluating large language models in solving pabilities in multi-agent mystery games</article-title>
          , in: L.-W.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>word-based games</article-title>
          , in: C.
          <string-name>
            <surname>Madge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chamberlain</surname>
            , Ku,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <article-title>Findings of the K</article-title>
          . Fort,
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          , S. Lukin (Eds.),
          <source>Proceedings Association for Computational Linguistics: ACL of the 10th Workshop on Games and Natural Lan- 2024</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <source>guage Processing @ LREC-COLING</source>
          <year>2024</year>
          , ELRA Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>8225</fpage>
          -
          <lpage>8291</lpage>
          . URL: https: and ICCL,
          <string-name>
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>106</lpage>
          . URL: //aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>490</volume>
          /. doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          https://aclanthology.org/
          <year>2024</year>
          .games-
          <volume>1</volume>
          .11/. 18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>490</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lovetere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Monti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pascucci</surname>
          </string-name>
          , F. San- [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          , gati, L. Siciliani,
          <article-title>Ghigliottin-ai@evalita2020: Eval- Gameeval: Evaluating llms on conversational uating artificial players for the language game "la games, 2023</article-title>
          . URL: https://arxiv.org/abs/2308.10032.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>ghigliottina" (short paper)</article-title>
          , in: V.
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Croce</surname>
          </string-name>
          , arXiv:
          <fpage>2308</fpage>
          .
          <fpage>10032</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>M. D. Maro</surname>
            , L. C. Passaro (Eds.), Proceedings of [21]
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            , T. M. Mitchell,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Smartplay the Seventh Evaluation Campaign of Natural Lan- : A benchmark for llms as intelligent agents</article-title>
          ,
          <source>in: guage Processing and Speech Tools for Italian. Fi- The Twelfth International Conference on Learnnal Workshop (EVALITA</source>
          <year>2020</year>
          ), Online event, De- ing
          <string-name>
            <surname>Representations</surname>
          </string-name>
          ,
          <source>ICLR</source>
          <year>2024</year>
          , Vienna, Austria, cember
          <year>17th</year>
          ,
          <year>2020</year>
          , volume
          <volume>2765</volume>
          <source>of CEUR Work- May 7-11</source>
          ,
          <year>2024</year>
          , OpenReview.net,
          <year>2024</year>
          . URL: https:
          <source>shop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          . URL: https: //openreview.net/forum?id=
          <fpage>S2oTVrlcp3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2765</volume>
          /paper155.pdf. [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liang</surname>
          </string-name>
          , W. Wang,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Samdarshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rothkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Lyu</surname>
          </string-name>
          , ComT. Chakrabarty,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muresan</surname>
          </string-name>
          ,
          <article-title>Connecting the dots: peting large language models in multi-agent gamEvaluating abstract reasoning capabilities of LLMs ing environments, in: The Thirteenth Internausing the New York Times connections word game, tional Conference on Learning Representations</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>ICLR</source>
          <year>2025</year>
          , Singapore,
          <source>April 24-28</source>
          ,
          <year>2025</year>
          ,
          <source>OpenReProceedings of the 2024 Conference on Empiri- view.net</source>
          ,
          <year>2025</year>
          . URL: https://openreview.net/forum? cal Methods in
          <source>Natural Language Processing</source>
          , As- id=
          <fpage>DI4gW8viB6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <article-title>sociation for Computational Linguistics</article-title>
          , Miami, [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McDonell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          , Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>21219</fpage>
          -
          <lpage>21236</lpage>
          . URL:
          <article-title>https: Role play with large language mod//aclanthology</article-title>
          .org/
          <year>2024</year>
          .emnlp-main.
          <volume>1182</volume>
          /. doi:10. els, Nat.
          <volume>623</volume>
          (
          <year>2023</year>
          )
          <fpage>493</fpage>
          -
          <lpage>498</lpage>
          . URL:
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .emnlp-main.
          <volume>1182</volume>
          . https://doi.org/10.1038/s41586-023-06647-8.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tomlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          , E. Pathak, doi:10.1038/S41586-023-06647-8.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ginsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          , Automated crossword solv- [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          ,
          <article-title>A primer ing</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.), in BERTology:
          <article-title>What we know about how BERT Proceedings of the 60th Annual Meeting of the As- works, Transactions of the Association for Comsociation for Computational Linguistics (Volume putational Linguistics 8 (</article-title>
          <year>2020</year>
          )
          <fpage>842</fpage>
          -
          <lpage>866</lpage>
          . URL: https: 1:
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Computational //aclanthology.org/
          <year>2020</year>
          .tacl-
          <volume>1</volume>
          .54/. doi:
          <volume>10</volume>
          .1162/ Linguistics, Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>3073</fpage>
          -
          <lpage>3085</lpage>
          . tacl_a_
          <fpage>00349</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>219</volume>
          /. [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , doi:10.18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long.219. A</article-title>
          .
          <string-name>
            <surname>Bakhtin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          , Language models
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zeinalipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fusco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zanollo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Mag- as knowledge bases?</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , V. Ng, gini, M. Gori,
          <article-title>Harnessing llms for educational X</article-title>
          . Wan (Eds.),
          <article-title>Proceedings of the 2019 Confercontent-driven italian crossword generation, in: ence on Empirical Methods in Natural Language F</article-title>
          .
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprug- Processing and the 9th International Joint Connoli (Eds.),
          <source>Proceedings of the Tenth Italian Confer- ference on Natural Language Processing (EMNLPence on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), IJCNLP),
          <article-title>Association for Computational Linguistics</article-title>
          , Pisa, Italy, December 4-
          <issue>6</issue>
          ,
          <year>2024</year>
          , volume
          <volume>3878</volume>
          of Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          . URL: https: CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          . //aclanthology.org/D19-1250/. doi:
          <volume>10</volume>
          .18653/v1/ URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3878</volume>
          /110_
          <article-title>main_long</article-title>
          .
          <fpage>D19</fpage>
          -
          <volume>1250</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          pdf. [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          , P. Wang,
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Topsakal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Edell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Harper</surname>
          </string-name>
          ,
          <string-name>
            <surname>Evaluating</surname>
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Huang, large language models with grid-based game com- H.</article-title>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>N. Zhang,</given-names>
          </string-name>
          <article-title>Knowledge mechanisms in petitions: An extensible llm benchmark and leader- large language models: A survey and</article-title>
          perspecboard,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.07796. tive, in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>7097</fpage>
          -
          <lpage>7135</lpage>
          . URL: https://aclanthology.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          org/
          <year>2024</year>
          .findings-emnlp.
          <volume>416</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-emnlp.
          <volume>416</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Serina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Putelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gerevini</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serina</surname>
          </string-name>
          ,
          <article-title>Synonyms, antonyms and factual knowledge in BERT heads</article-title>
          ,
          <source>Future Internet</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <article-title>230</article-title>
          . URL: https://doi.org/10.3390/fi15070230. doi:
          <volume>10</volume>
          .3390/ FI15070230.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>L.</given-names>
            <surname>Putelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gerevini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mehmood</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Serina</surname>
          </string-name>
          ,
          <article-title>On the behaviour of bert's attention for the classification of medical reports</article-title>
          , in: C. Musto,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          , G. Semeraro (Eds.),
          <source>Proceedings of the 3rd Italian Workshop on Explainable Artificial Intelligence co-located with 21th International Conference of the Italian Association for Artificial Intelligence(AIxIA</source>
          <year>2022</year>
          ), Udine, Italy,
          <source>November 28 - December 3</source>
          ,
          <year>2022</year>
          , volume
          <volume>3277</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>