1. Introduction

LLMike: Exploring Large Language Models' Abilities in Wheel of Fortune Riddles

Ejdis Gjinika

Nicola Arici

Andrea Loreggia

Luca Putelli

Ivan Serina

Alfonso Emilio Gerevini

0 0 Università degli Studi di Brescia , Via Branze 38, Brescia , Italy

2025

A riddle from the game show “Wheel of Fortune” consists of a hidden sentence that can be discovered starting from a simple clue and by iteratively guessing its letters. Although the game is very popular and intuitive, solving one of these riddles is not trivial. In fact, for interpreting the clue, identifying the most probable letters, and leveraging the game's mechanics efectively, a player requires linguistic abilities, world knowledge, and even some form of strategic thinking. The goal of this study is to verify whether Large Language Models (LLMs) possess the necessary abilities to solve Wheel of Fortune riddles. We propose a software framework called LLMike in which an algorithmic Game Master interacts with an LLM: prompting it, enforcing the game's rules, updating the hidden sentence based on the model's guesses, and evaluating their correctness. We study several models with diferent sizes, evaluating their performance, behavioural patterns, and common types of errors. Our dataset and code are available at https://github.com/ejdisgjinika/LLMike.

eol>Large Language Models Wheel of Fortune Model Evaluation Benchmarks

1. Introduction

does not need particular reasoning capabilities, such as for choosing which words to complete first: LLMs may Assessing linguistic and reasoning abilities of Large Lan- start wherever they want and complete the puzzle with guage Models (LLMs) is an open challenge [ 1, 2, 3, 4 ]. knowledge alone.

Especially in the last few years, LLMs have proved to ad- With non-textual games, such as Connect-4 or Ticdress many Natural Language Processing tasks (such as Tac-Toe [ 12, 18 ] we can have a diferent situation. In fact, text classification, summarization, machine translation, both of these games require a more refined strategy to etc.) and their benchmarks, with performance that previ- win. For instance, Connect-4 is a game in which two playously seemed unreachable. However, LLMs come with ers compete with each other. They insert coloured disks several limitations, such as hallucinations [ 5 ], reason- into a board, trying to form a line (vertical, horizontal, or ing issues [ 6 ], and lack of trustworthiness [ 7, 8 ]. There- diagonal) of four disks of the same colour, while preventfore, researchers have started developing new methods ing the other player from doing the same. In order for or more challenging tasks to assess diferent types of an LLM to win, clearly it would need a solid strategy to abilities that LLMs may or may not possess [ 9, 10, 11 ]. choose all its actions in a specific order, to evaluate the

A popular research line is based on games [ 12, 13 ], situation on the board and consider all its options. especially text-based games such as word association Addressing linguistics, knowledge, and strategy, in this games [ 14, 15 ] or crossword puzzles [ 16, 17 ] which focus work we propose a task based on the popular “Wheel on linguistic aspects. For instance, in a crossword puzzle of Fortune” game show. An example of how this game LLMs would obviously need linguistic abilities to inter- works is available in Figure 1. In order to win, a player has pret the clues and to insert all the words correctly. More- to guess a sentence from a simple clue. At first, only the over, the clues may refer to general knowledge and trivia, number of words and the number of letters for each word which must be known by the LLM. However, this game are available. Next, the player has to spin a wheel (into which each wedge gives a diferent amount of money) CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- and say a consonant which will be revealed in the hidden tics, September 24 — 26, 2025, Cagliari, Italy sentence (if present). With some of the money earned, *$Coerjdreiss.pgojinndikinag@auuntihbosr.it (E. Gjinika); nicola.arici@unibs.it the player can decide to buy a vowel, which will make (N. Arici); andrea.loreggia@unibs.it (A. Loreggia); the guess easier. This procedure can be repeated sevluca.putelli@unibs.it (L. Putelli); ivan.serina@unibs.it (I. Serina); eral times until the player decides to guess the hidden alfonso.gerevini@unibs.it (A. E. Gerevini) sentence. If the guess is correct, the player efectively 0009-0006-9817-5846 (E. Gjinika); 0009-0000-9713-6630 takes the money and the overall goal is to accumulate as ((NL.. PAurtieclil)i;);0000000-00-000020-29-874768-50-1954792(A(I.. LSoerreingag)i;a0);000000-90-000010-89-050085-56-3688612 much money as possible. To solve this task, of course, (A. E. Gerevini) an LLM would need linguistic capabilities to understand © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License the rules, expressed in natural language. World knowlAttribution 4.0 International (CC BY 4.0).

Category: Around the house approach combined with a local search to choose possible word candidates and rank them for completing crossword puzzles. This game covers diferent aspects, such as common sense, general knowledge, and metalinguistic patterns. Another work on crossword puzzles with human evaluation has also been proposed in [ 17 ]. The authors of [ 14 ] propose a challenge in which participants submit Category: Around the house systems for the "Ghigliottina", an Italian text game where some semantic knowledge is needed to link a group of

L L words. Most of the proposed systems are based on techL niques that leverage the similarity between the vector representations of words.

With the growing popularity of LLMs, rather than creating ad-hoc models to play and complete games, reCategory: Around the house searchers have begun using these games to benchmark the general abilities of LLMs [21, 22]. Qiao et al. [20] A V A S E F U L L introduce the concept of evaluating LLMs using conversaO F F L O W E R S tional games, such as a round-based interaction between a questioner and an answerer called Ask-Guess. One of the main claims of this study is that conversational Figure 1: Example of the gameplay of the Wheel of Fortune games can diferentiate the capabilities of diferent LLMs. game. At the top, we show how the game starts, i.e., with a Manna et al. [ 13 ] assessed that the leading commercial completely hidden riddle. In the middle, we show the partially models (i.e. GPT-4 and Gemini-Pro) struggle in completcompleted riddle after one participant spins the wheel and ing a semantic connection game such as the “Ghigliottina” chooses the letter “L”. At the bottom, we show the solution of [ 14 ]. A similar work was presented by Samardashi et al. the game. [ 15 ], focusing on the New York Times Connections word game, which similarly requires semantic knowledge.

Another interesting work is [23], which focuses on edge is also needed to solve many of the clues based on role-playing abilities of LLMs combined with external places, movies, etc. Finally, choosing which consonants tools. Similarly, the authors of [19] evaluated the abilito say, whether to buy a vowel, or when to try to guess ties of several LLMs in a multi-agent scenario to solve the sentence also needs some basic strategic skills. a detective-style game. Although linguistic and world

In this paper, we create LLMike, an algorithmic frame- knowledge are needed, their evaluation focuses more on work that allows LLMs to play Wheel of Fortune games. the strategies the agents use to play the game. The name comes from the TV presenter of the first edi- More generally, the knowledge possessed by LLMs tions of the Italian version of Wheel of Fortune, Mike has been the subject of many studies [24], focusing on Bongiorno. LLMike prompts the LLMs with all the pro- world knowledge [ 25, 26 ], semantics [ 27 ] and specific cedures of the game and interacts with it depending on knowledge, such as the medical domain [ 28 ]. its responses. The framework allows simple budget management and the checking of diferent types of errors. 3. Methodology We tested both open-source and commercial models to see whether these models are capable of completing such In this section, we explain how we structure our evaldificult tasks. We manually created a dataset based on uation of the capabilities of LLMs in Wheel of Fortune some publicly available riddles. Finally, we analysed the riddles. First, we describe the original rules of the game; answers provided by the models in order to understand then, we describe our adaptation and implementation of their behaviour in the games they won, their main errors, the game. and to give some insight into their strategy.

3.1. Wheel of Fortune 2. Related Work As introduced earlier, the Wheel of Fortune is a game

Games and puzzles are a recurrent testbed for assessing show that lets multiple contestants compete with each the capabilities of deep learning systems, especially to other to win the game and earn money. The goal is to corimplement complex reasoning abilities [ 16, 13, 15, 19, 20 ]. rectly guess an hidden riddle by iteratively discovering its For instance, Wallace et al. [ 16 ] use a neural network letters until the player is confident enough to formulate You are a participant in the famous tv quiz show "Wheel of Fortune" and the user is the game master. [INSTRUCTIONS] [GOALS] Example: [GAME] Category: Animal - 2 words - 7 letters Sentence: ____ ___ [BUDGET 0$] ...

Game Master

SPIN or BUY VOWEL ?

LLM

SPIN + CONSONANT

1,2

LLM BUY VOWEL +

VOWEL 1,3,4

Game Master Add the reward to the budget Game Ma_ster Subtract 250$ from the budget PASS

Game Master Show to the player the hidden riddle and make the LLM choose

LLM GUESS or PASS 1

GUESS a guess. The game works in several rounds. In the begin- The Game Master gives the prompt, which contains the ning, it is shown the word puzzle (with no letters present, rules, the goals, and an example of the game, and asks as at the top of Figure 1) which can reveal a sentence, the LLM to select an action, starting a round. The LLM a name of a person, a place, etc. Each participant has a selects an action and its budget is updated. Next, the budget that starts at 0 $ and can gradually grow over the Game Master shows the new conditions of the game, i.e. rounds. Starting from the first participant, he/she can the hidden riddle partially revealed and the new budget. spin a wheel composed of several wedges, with diferent Finally, it asks the LLM to provide a guess or pass to the amounts of money associated with each wedge. Next, next round. the participant chooses a consonant: if the consonant is We redesign the game by adapting the rules to a singlerevealed in the hidden riddle (as in the middle of Figure participant scenario with a slightly diefrent round struc1), the participant earns the amount of cash indicated ture, as shown in Figure 2. First, we removed the speby the wedge times the number of occurrences of the cial wedges from the wheel (i.e., “Bankrupt” and “Lose a consonant chosen. Next, he/she can spin the wheel again turn”), because they depend only on luck, and this can and continue to play another round. If the consonant is lead to a non-systematic analysis of the LLM’s abilities. not present in the riddle, the participant passes the turn Therefore, our wheel has only cash wedges, all between to another player. As the rounds progress and the player 100 $ and 1.000 $. has enough money, he/she can buy a vowel for a fixed In our interaction schema, first the Game Master asks amount of the budget and has to indicate which vowel the LLM to spin the wheel or to buy a vowel for 250 $. he chooses. If the vowel is present in the riddle, it will After the choice made by the LLM, the riddle and the be revealed, but if it is not, the player passes the turn. budget are adjusted accordingly and subsequently comAt any time in his/her game, the player can guess the municated to the LLM. Then, the LLM has the option to riddle by giving their final solution. If the correct answer give a guess or to pass and start another round. Since is given, the player wins the budget he earned. However, we have only one LLM playing, a key diference is that if the answer is wrong, the player passes the turn. in our adaptation of the game, if the LLM gives a letter

In the original game show, some special wedges of that is not present in the riddle, it does not lose the turn the wheel are also present: “Bankrupt”, which resets the in favour of another player, but only its budget is set to player’s budget and passes the turn; and “Lose a turn”, 0 $. The goal we give to the LLM is to complete the game which makes the player skip his/her turn. and to maximize the amount of money earned by solving the riddles. These goals are in line with the goals a real 3.2. LLMike: Evaluating LLM’s Abilities at player playing the Wheel of Fortune would have.

We also formalize some rules specifically for the LLMs’

Wheel of Fortune interaction with the game, intending to control and better In the adaptation we created for evaluating LLMs’ abil- understand the ability of the models to follow instrucities at solving Wheel of Fortune riddles, we defined tions. This formalizations results in four rules: two main roles: the Game Master, which is a specifically coded algorithm (not based on artificial intelligence tools) • Rule 1: The LLM cannot choose to do an action that interacts with the LLM and evaluates its answers, that is not possible in a given situation; for inand the LLM, which acts as a player of the game. stance, the LLM can’t pass the turn when it is An overview of our adaptation is presented in Figure 2. required to spin the wheel or buy a vowel. • Rule 2: If the LLM spins the wheel, it has to

choose a consonant and not a vowel. • Rule 3: If the LLM buys a vowel, it has to choose

a vowel and not a consonant. • Rule 4: The LLM has to buy a vowel if and only

if it has enough money to do so.

If the model violates one of the rules, it will automatically

lose the game.

In Figure 2 also shows a brief version of the prompt used during the games. The prompt contains a short description of the context, followed by the instructions for playing the game, the goals, and an example. The goals are expressed in simple sentences, and the examples represent a standard conversation between an LLM and the Game Master. The complete prompt is available in the GitHub repository.1

Please note that the riddle cannot be solved by simply choosing all the letters in it, one at a time. In fact, all riddles are composed of consonants and vowels. However, the player can choose only consonants, which leads him/her to always deal with an incomplete riddle. This leads to two major possible decisions: buying vowels or guessing the sentence, which cannot be easily implemented in simple baseline approaches.

4. Experimental Evaluation Data. Our dataset is composed of 80 riddles in English

taken from a publicly available dataset4 and repurposed. The riddles are of variable length and divided into 16 categories. The shortest sentence is made up of 2 words while the longest is made up of 9 words. In terms of the number of characters, the range is from 9 to 47 characters. The average lengths are 19.47 and 3.16 in terms of characters and words, respectively.

In this section, we present how our experiments were

conducted, the models and data we used, how the performance was evaluated, and the results. Then, we present an analysis of the main errors made by the models and provide some intuition on their strategy.

Metrics. Several metrics were introduced to measure

the performance of LLMs in our Wheel of Fortune task.

First, we consider the number of games won (# Wins) and the average amount of money won by the LLM (Total Final Budget). Other metrics are more complex and are based on the game rules listed in Section 3.2. First, we Models and implementation details. We selected consider a group of metrics to evaluate the model be29 open-source models available through Ollama2, which haviour, such as the number of letters chosen by the LLM are available in Table 1. Ollama is a framework designed (# Letters), the percentage of the letters that were actually to facilitate the local execution of open-source LLMs. found in the riddle (% Correct Letters), and the percentThe models considered difer considerably in terms of age of completion of the riddle when the LLM gives the architecture, family, and number of parameters. right guess (% Riddle Completion). Next, we consider sev

Moreover, we select three commercial models: GPT- eral error-related metrics, to understand when the model 4.1, Mistral Large 2 and Gemini 2.0 Flash3. The exact size does not follow the rules (perhaps, by not selecting a of GPT-4.1 and Gemini 2.0 Flash has not been disclosed letter, or by trying to buy a vowel with an insuficient publicly. However, they are much bigger than any of the budget), when it just provides a wrong guess or when it open-source models we considered. Mistral Large 2 has reaches the maximum number of possible consonants. about 123 parameters.

For both open source and commercial models, the responses are generated using the default parameters.

4.1. Results of the Best Performing Models 1https://github.com/ejdisgjinika/LLMike

2https://ollama.com/ 3Specifically, we use the "mistral-large-2411", "gemini-2.0-flash-001", and "gpt-4.1-2025-04-14" snapshots.

In this section, we report the performance of LLMs in

the Wheel of Fortune game. Of the more than 30 mod

4https://www.kaggle.com/datasets/darrylljk/

wheel-of-fortune-answers els tested, only 9 managed to guess at least one solution: age 85.21 completion to solve a total of only 8 games. three commercial models and six open-source LLMs, four This may suggest a higher understanding and knowledge of which belong to the Gemma family. Except for Gemma possessed by Gemma 3 27, with respect to Phi 4. A sim2 9, all models have more than 10 parameters. Fur- ilar comparison can be made with Gemma 3 12, which thermore, all models with more than 25 parameters obtains only 5 wins with a riddle completion of 86.46. can guess at least one correct solution, with the exception In this case, the diference seems entirely dependent on of Aya Expanse and Command-R. the diferent number of parameters.

In Table 2, we show the results ordered by the number Significantly better results are obtained with commerof games won. The best open-source model, by far, is cial LLMs: GPT-4.1 gets 62 wins, Gemini 2.0 Flash 35, Gemma 3 27 with 20 wins in 80 games, followed by and Mistral Large 2 25. Nevertheless, these models have Gemma 2 27 and Phi 4 14 with 8 wins, and Gemma 3 similar performance with respect to the open-source mod12 with 5. Although they reached one and two victories, els in terms of number of letters (all between 10.53 and respectively, we did not include in Table 2 Gemma 2 13.23), percentage of correct letters (which does not ex9 and Cogito 32 due to the low significance of their ceed 68%), and percentage of riddle completion. This results with such a small sample. behaviour suggests that although these larger models

However, these victories can come from two diferent possess a similar ability in guessing the correct letters abilities. The first is that a model may guess as many and completing the masked riddle, they are much better letters as possible and progressively fill in the riddle, at providing the correct solution. until the guess becomes very simple. The second is that a Table 2 also reports the final budget earned by the model may not need to fill the riddle as much as possible, models. The best performing model is GPT-4.1, with more because it has enough knowledge to find the correct than 65 $. Notably, Gemma 3 27 obtains a higher solution of a more complicated riddle. Analysing the amount of money (20.6) with respect to Mistral Large ability of the model of choosing letters, the best open 2 (15.25), despite obtaining fewer wins (20 versus 25). source model is Gemma 2 27, with 68.7% of correct Since every time a model chooses a wrong consonant, letters. This ability is reflected in the number of letters the budget is set to 0, this is probably due to its higher required to provide a correct solution, which is 8.38, the percentage of correct letters (62.73 versus 54.97). lowest of all models. The other LLMs perform worse, ranging from 51.19 (Gemma 3 12) to 62.73 (Gemma 4.2. Typical Errors 3 27). All the other open-source models tend to select a higher number of letters, ranging from 11.00 to 16.8. In this section, we discuss the most common errors made Interestingly, the former has the tendency to select as by the models considered. Since, an important first result many letters as possible, filling the riddle up to 86.46%, of our experiments is that 23 LLMs over a total of 32 on average. were unable to give a single correct solution, we first

Analysing the guessing capabilities, Gemma 3 27 analyse their main flaws. obtains 20 victories not only by selecting letters, but also In Figure 3, we show six types of errors made by those by guessing from a quite low completion of the riddle LLMs considered and their frequency calculated for all (71.30), whereas the least performing models require a 80 games. The most common error (in blue) is definitely higher completion. Instead, Phi 4 14 requires an aver- Insuficient Budget (33.1%), in which an LLM tries to Overviews of the Error Made by the Best Performing Models In the following, we investigate the flaws Figure 3: Error frequency for the LLMs unable to guess a made by the best performing models, i.e. those reported single riddle. Each colour represents a diferent error category. in Table 2. Starting from GPT-4.1, the major cause its The frequency of each error, in the form of a percentage over losses is the Wrong Guess (55.56%): i.e. the model, at all the 80 games for each LLM, is reported inside each sector. a certain riddle completion, has enough “confidence” to try to guess the riddle but provides the wrong answer.

Despite GPT-4.1 being the best model at following the inbuy a vowel without the necessary money. The next structions, it still shows some limitations on letter chooserror, Action Not Allowed (N/A), is quite more complex. ing (11.11% of Vowel N/A and 5.56% of Consonant N/A) As we show in Figure 2, the model is forced to generate and managing the budget (11.11% of Insuficient Budget specific text such as [SPIN], [BUY VOWEL] or a sin- Error). Gemini 2.0 Flash shows a diferent behaviour in gle consonant at diferent times during the game. This terms of errors. In fact, it manifests lots of problems on text indicates the choice of executing a specific action in instruction adherence and budget management (respeca strict way and any other answer is considered as an tively 40% of Instruction Error and 33.3% on Insuficient Action N/A error. This error recurs 20.2% of the time. Budget Error). Interestingly, Mistral Large 2 is good at Similarly, Consonant N/A (19.4%) refers to those times following instructions, managing its budget and choosthat the model, after choosing to buy a vowel, selects a ing the letters in the right contexts. However, it provides consonant instead. Both Action N/A and Consonant N/A many wrong answers (Wrong Guess 87.27%). An indenote a lack of understanding of the game rules and of teresting fact is that Mistral Large 2 and Gemma 3 27 the prompt instructions provided by the Game Master. obtain a comparable number of wins (respectively 25 Wrong Guess (14.0%) happens when the model simply and 20 wins) even if they have a significantly diferent provides a wrong solution to the riddle. In our analysis, number of parameters (123 and 27 respectively). Alan important aspect of this type of error is that often the though Gemma 3 27 has a lower percentage of Wrong LLM does not respect the format of the riddle, selecting Guess (56.7%), its limitations in dealing with single letwords with the wrong number of letters. Moreover, some ters (Vowel N/A 20% and Consonant N/A 5%) and budmodels (such as Olmo 2 and Llama 3.2) can be considered get management (10%) deteriorates its performance. “overconfident”, choosing to guess the solution with a very limited amount of letters. As Vowel N/A (12.0%), 4.3. Hints on Strategy we refer to those times the model, instead of choosing a consonant, selects a vowel instead. As for Action N/A and Consonant N/A, this error depends on not understanding the game rules. Finally, the remaining 1.3% of the errors occur when the model exceeds the round limit imposed (20 rounds), continuously spinning the wheel or buying vowels without trying to guess the solution of the riddle.

In this section, we report some information regarding

the strategy followed by the best performing models.

We think that a total absence of strategy would result in picking random consonants. Instead, a smarter approach would be to select consonants which appear frequently in English words. To highlight this behaviour, we analyse the first letters chosen by the model. Results are available in Table 3, in which we report:

Consonant generated more frequently by the models. In fact, considering Gemma 3 27 these consonants are the 46.41% of all the letters chosen by the model. Similarly, for GPT-4.1 they are 43.67%. Although this difers from the standard frequency in the English language (into which these five Std. Freq. Gemma 3 GPT-4.1 consonants reach a total of 34.2%), we can say that both models know which are the most common consonants T 9.1 10.40 10.08 and exploit this information in their games, combining NS 66..37 190.4.490 190.6.599 both linguistic knowledge and basic strategy. Both modH 6.1 4.29 3.49 els have a very similar behaviour, with T, N, S and R being R 6.0 11.83 9.82 the preferred consonants (with a frequency around 10%), and H is considered less important, with a frequency Total 34.2 46.41 43.67 that does not exceed 4%. This is quite diferent from the statistics calculated for the English language, in which has a frequency of 6.1, quite similar to R (6.0), and • the number of diferent pairs of letters chosen by S (6.3). This is probably due to the fact that H is very the LLM at the start of the game (# Pairs); present in very common stop words such as the, which, • the number of diferent triplets of letters chosen this, which may not be particularly important to solve by the LLM at the start of the game (# Triplets); our riddles. More specifically, models tend to start with • the number of # Vowels the model decided to buy; the two most frequent consonants (T, N or S) and then buy a vowel (mostly E or A). This behaviour is constant for most of the 80 riddles of our dataset, regardless of the sentence length or other characteristics.

We can see that there are notable diferences among the models with respect to the number of distinct pairs and triples chosen at the start of diferent games. Phi 4 14 has the highest variability, selecting 35 diferent pairs and 61 diferent triples of letters across the 80 riddles in 5. Conclusions and Future Work our dataset. Instead, the best performing models (such as GPT-4.1, Gemini 2.0 Flash and Gemma 3 27) present In this paper, we proposed a novel textual game based a much lower variability, with respectively 9, 10 and 11 on the famous “Wheel of Fortune” game show with the diferent pairs and less than 30 diferent triples. This sug- aim of assessing linguistic and reasoning abilities. We gests that they start many riddles with a similar strategy. created a framework for allowing LLMs to play under

Analysing the number of vowels bought by our mod- strict rules and showed how the task was structured, the els, we can see some other relevant information. The data, and the metrics used for the evaluations. We analmodels with highest variability in terms of letters chosen ysed 29 open source models and 3 commercial models (Phi 4 14 and Gemma 3 12) also tend to buy more to evaluate a variety of models with diferent model’s vowels (respectively, 4.00 and 3.80 on average). Com- architecture and sizes. Only 9 LLMs out of 32 managed paring these results with those in Table 2, we can see to solve at least one riddle. The most problematic aspects that this strategy does not provide notable advantages: in are their little ability to follow the instructions, such as fact, they win only 8 and 5 games respectively. Instead, the constraint of choosing only consonants. The best the best performing models (the commercial models and performing open-source model is Gemma 3 27, with Gemma 3 27) tend to buy fewer vowels (only 2.30 for 20 wins out of 80 riddles, whereas the commercial model Gemma 3 27 and 2.63 for GPT-4.1) obtaining a defi- GPT-4.1 solves 65 riddles. Analysing their strategy, we nitely higher number of wins. Moreover, since buying see that the best performing models select the most frevowels requires subtracting 250 $ from the budget, this quent consonants in the English language, resulting in a decision can be considered good also for the declared progressively easier riddle. However, they can also guess goal of maximizing the earnings. the right solution with a completion of around 70%.

In Table 4 we compare the standard frequency of the As future work, we want to analyse performance of ifrst five consonants in the English language 5 (Std. Fre- Large Reasoning Models (LRM), such as Deepseek-R1, o3 quency) with the percentage of times that such conso- and o4-mini, and to expand the framework to let several nants are chosen by two LLMs: the best performing open models play with each other. Moreover, another intersource one, Gemma 3 27, and the best commercial one, esting direction would be to exploit Multimodal LLMs to GPT-4.1. We can see that the most frequent consonants create a visual version of the game. We would also like (which in English are T, N, S, H, and R) are definitely those to consider data in other languages. Finally, we would like to implement new games and analyse the behaviour of models in a more complex environment.

Acknowledgments References This work was carried out while the author, Ejdis Gjinika,

was enrolled in the Italian National Doctorate on Artificial Intelligence run by Sapienza University of Rome in collaboration with the University of Brescia.

This work has been partly funded by Regione Lombardia through the initiative "Programma degli interventi per la ripresa economica: sviluppo di nuovi accordi di collaborazione con le università per la ricerca, l’innovazione e il trasferimento tecnologico" - DGR n. XI/4445/2021.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and DeepL Write / DeepL Translate in order to: Improve writing style and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[6]

Mirzadeh ,

Alizadeh ,

Shahrokhi ,

Tuzel ,

Bengio ,

Farajtabar , Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models , 2024 . URL: https: //arxiv.org/abs/2410.05229.

[7]

Mo ,

Wang ,

Chen ,

Sun , How trustworthy are open-source LLMs? an assessment under malicious demonstrations shows their vulnerabilities , in: K. Duh,

Gomez , S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics , Mexico City, Mexico, 2024 , pp. 2775 - 2792 .

[1]

A. Y.

Uluslu , G. Schneider, Investigating linguis- URL: https://aclanthology.org/ 2024 . naacl-long . 152 /.

tic abilities of LLMs for native language identifi - doi:10.18653/v1/ 2024 . naacl-long . 152 .

cation, in: R. Muñoz Sánchez,

Alfter , E. Volod- [8]

Jiang ,

Araki ,

Ding , G. Neubig, How can we ina , J. Kallas (Eds.), Proceedings of the 14th Work - know when language models know? on the calishop on Natural Language Processing for Computer bration of language models for question answerAssisted Language Learning , University of Tartu ing, Transactions of the Association for CompuLibrary , Tallinn, Estonia, 2025 , pp. 81 - 88 . URL: tational Linguistics 9 ( 2021 ) 962 - 977 . URL: https: https://aclanthology.org/ 2025 .nlp4call- 1 .7/. //aclanthology.org/ 2021 .tacl- 1 .57/. doi: 10 .1162/

[2]

Lu ,

Zhu ,

Li ,

Qiao ,

Yuan , LLa- tacl_a_ 00407 .

MAX : Scaling linguistic horizons of LLM by en - [9]

Laban ,

Kryscinski ,

Agarwal , A . Fabbri, hancing translation capabilities beyond 100 lan - C. Xiong ,

Joty , C.-S. Wu, SummEdits: Measurguages, in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen ing LLM ability at factual reasoning through the (Eds.), Findings of the Association for Computa- lens of summarization , in: H. Bouamor , J. Pino, tional Linguistics: EMNLP 2024 , Association for K. Bali (Eds.), Proceedings of the 2023 Conference Computational Linguistics , Miami, Florida, USA, on Empirical Methods in Natural Language Pro2024 , pp. 10748 - 10772 . URL: https://aclanthology. cessing, Association for Computational Linguisorg/ 2024 .findings-emnlp. 631 /. doi: 10 .18653/v1/ tics, Singapore, 2023 , pp. 9662 - 9676 . URL: https: 2024 .findings-emnlp. 631 . //aclanthology.org/ 2023 .emnlp-main. 600 /. doi:10.

[3]

Cheng , Y. Dai,

Hu ,

Xu ,

Zhang , L. Han, 18653 /v1/ 2023 .emnlp-main. 600 .

Du ,

Li , Self-playing adversarial language [10]

Wang ,

Yue ,

Sun , Can chatgpt defend game enhances llm reasoning, in: A. Globerson, its belief in truth? evaluating LLM reasoning via L . Mackey , D.

Belgrave , A.

Fan , U.

Paquet , J. Tom- debate, in: H. Bouamor , J. Pino , K. Bali (Eds.), czak, C. Zhang (Eds.), Advances in Neural Informa- Findings of the Association for Computational Lintion Processing Systems , volume 37 , Curran Asso- guistics : EMNLP 2023 , Singapore, December 6- ciates , Inc., 2024 , pp. 126515 - 126543 . 10, 2023 , Association for Computational Linguis-

[4]

Peng , S. Cheng, E. Diau,

Shih ,

Chen ,

Lin , tics, 2023 , pp. 11865 - 11881 . URL: https://doi.org/10.

Chen , A survey of useful LLM evaluation , 18653 /v1/ 2023 .findings-emnlp. 795 . doi: 10 .18653/ CoRR abs/2406.00936 ( 2024 ). URL: https://doi.org/ V1/ 2023 .FINDINGS-EMNLP. 795 .

10.48550/arXiv.2406.00936. doi: 10 .48550/ARXIV. [11]

Li ,

Liu , Beyond static datasets: A deep interaction approach to LLM evaluation , CoRR abs/2309 .04369 ( 2023 ). URL: https://doi.org/ 10.48550/arXiv.2309.04369. doi: 10 .48550/ARXIV.

2309.04369. arXiv: 2309 . 04369 .

2406.00936. arXiv: 2406 . 00936 .

[5]

Sahoo ,

Meharia ,

Ghosh ,

Saha ,

Jain ,

Chadha , A comprehensive survey of hallucination in large language, image, video and audio foundation models , in: Y. Al-Onaizan , M.

Bansal , Y.-N. [12] J.

Duan , R.

Zhang , J.

Difenderfer , B.

Kailkhura , Chen (Eds.), Findings of the Association for Compu- L.

Sun , E.

Stengel-Eskin , M.

Bansal , T.

Chen , K. Xu, tational Linguistics: EMNLP 2024 , Association for Gtbench: Uncovering the strategic reasoning capaComputational Linguistics , Miami, Florida, USA, bilities of llms via game-theoretic evaluations , in: 2024 , pp. 11709 - 11724 . URL: https://aclanthology. A. Globersons , L.

Mackey , D.

Belgrave , A.

Fan , U. Paorg/ 2024 .findings-emnlp. 685 /. doi: 10 .18653/v1/ quet, J. M. Tomczak , C. Zhang (Eds.), Advances in 2024.findings-emnlp.685. Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024 , NeurIPS 2024 , Vancouver, BC, Canada, arXiv: 2407 . 07796 .

December 10 - 15 , 2024 , 2024 . [19]

Wu ,

Shi ,

Sun ,

Liu , Deciphering digital

[13]

Manna , M. P. di Buono, J. Monti, Riddle me detectives: Understanding LLM behaviors and cathis: Evaluating large language models in solving pabilities in multi-agent mystery games , in: L.-W.

word-based games , in: C. Madge , J.

Chamberlain , Ku, A.

Martins , V. Srikumar (Eds.), Findings of the K . Fort,

Kruschwitz , S. Lukin (Eds.), Proceedings Association for Computational Linguistics: ACL of the 10th Workshop on Games and Natural Lan- 2024 , Association for Computational Linguistics , guage Processing @ LREC-COLING 2024 , ELRA Bangkok, Thailand, 2024 , pp. 8225 - 8291 . URL: https: and ICCL, Torino , Italia, 2024 , pp. 97 - 106 . URL: //aclanthology.org/ 2024 .findings-acl. 490 /. doi: 10.

https://aclanthology.org/ 2024 .games- 1 .11/. 18653/v1/ 2024 .findings-acl. 490 .

[14]

Basile ,

Lovetere ,

Monti ,

Pascucci , F. San- [20]

Qiao ,

Wu ,

Liang ,

Li ,

Duan , gati, L. Siciliani, Ghigliottin-ai@evalita2020: Eval- Gameeval: Evaluating llms on conversational uating artificial players for the language game "la games, 2023 . URL: https://arxiv.org/abs/2308.10032.

ghigliottina" (short paper) , in: V. Basile , D. Croce , arXiv: 2308 . 10032 .

M. D. Maro , L. C. Passaro (Eds.), Proceedings of [21] Y.

Wu , X.

Tang , T. M. Mitchell, Y.

Li , Smartplay the Seventh Evaluation Campaign of Natural Lan- : A benchmark for llms as intelligent agents , in: guage Processing and Speech Tools for Italian. Fi- The Twelfth International Conference on Learnnal Workshop (EVALITA 2020 ), Online event, De- ing Representations , ICLR 2024 , Vienna, Austria, cember 17th , 2020 , volume 2765 of CEUR Work- May 7-11 , 2024 , OpenReview.net, 2024 . URL: https: shop Proceedings, CEUR-WS.org , 2020 . URL: https: //openreview.net/forum?id= S2oTVrlcp3 .

//ceur-ws. org/ Vol- 2765 /paper155.pdf. [22]

Huang ,

E. J.

Li ,

M. H.

Lam ,

Liang , W. Wang,

[15]

Samdarshi ,

Mustafa ,

Kulkarni ,

Rothkopf ,

Yuan ,

Jiao ,

Wang ,

Tu ,

M. R.

Lyu , ComT. Chakrabarty,

Muresan , Connecting the dots: peting large language models in multi-agent gamEvaluating abstract reasoning capabilities of LLMs ing environments, in: The Thirteenth Internausing the New York Times connections word game, tional Conference on Learning Representations , in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen (Eds.), ICLR 2025 , Singapore, April 24-28 , 2025 , OpenReProceedings of the 2024 Conference on Empiri- view.net , 2025 . URL: https://openreview.net/forum? cal Methods in Natural Language Processing , As- id= DI4gW8viB6 .

sociation for Computational Linguistics , Miami, [23]

Shanahan ,

McDonell ,

Reynolds , Florida, USA, 2024 , pp. 21219 - 21236 . URL: https: Role play with large language mod//aclanthology .org/ 2024 .emnlp-main. 1182 /. doi:10. els, Nat. 623 ( 2023 ) 493 - 498 . URL: 18653 /v1/ 2024 .emnlp-main. 1182 . https://doi.org/10.1038/s41586-023-06647-8.

[16]

Wallace ,

Tomlin ,

Xu ,

Yang , E. Pathak, doi:10.1038/S41586-023-06647-8.

Ginsberg ,

Klein , Automated crossword solv- [24]

Rogers ,

Kovaleva ,

Rumshisky , A primer ing , in: S. Muresan,

Nakov , A . Villavicencio (Eds.), in BERTology: What we know about how BERT Proceedings of the 60th Annual Meeting of the As- works, Transactions of the Association for Comsociation for Computational Linguistics (Volume putational Linguistics 8 ( 2020 ) 842 - 866 . URL: https: 1: Long

Papers)

, Association for Computational //aclanthology.org/ 2020 .tacl- 1 .54/. doi: 10 .1162/ Linguistics, Dublin, Ireland, 2022 , pp. 3073 - 3085 . tacl_a_ 00349 .

URL: https://aclanthology.org/ 2022 . acl-long . 219 /. [25]

Petroni ,

Rocktäschel ,

Riedel ,

Lewis , doi:10.18653/v1/ 2022 . acl-long.219. A . Bakhtin , Y.

Wu , A.

Miller , Language models

[17]

Zeinalipour ,

Fusco ,

Zanollo , M. Mag- as knowledge bases? , in: K. Inui,

Jiang , V. Ng, gini, M. Gori, Harnessing llms for educational X . Wan (Eds.), Proceedings of the 2019 Confercontent-driven italian crossword generation, in: ence on Empirical Methods in Natural Language F . Dell'Orletta , A.

Lenci , S.

Montemagni , R. Sprug- Processing and the 9th International Joint Connoli (Eds.), Proceedings of the Tenth Italian Confer- ference on Natural Language Processing (EMNLPence on Computational Linguistics (CLiC-it 2024 ), IJCNLP), Association for Computational Linguistics , Pisa, Italy, December 4- 6 , 2024 , volume 3878 of Hong Kong, China, 2019 , pp. 2463 - 2473 . URL: https: CEUR Workshop Proceedings, CEUR-WS.org, 2024 . //aclanthology.org/D19-1250/. doi: 10 .18653/v1/ URL: https://ceur-ws. org/ Vol- 3878 /110_ main_long . D19 - 1250 .

pdf. [26]

Wang ,

Yao ,

Xu ,

Qiao ,

Deng , P. Wang,

[18]

Topsakal ,

C. J.

Edell ,

J. B.

Harper , Evaluating

Chen , J.-C.

Gu , Y.

Jiang , P.

Xie , F.

Huang, large language models with grid-based game com- H.

Chen , N. Zhang,

Knowledge mechanisms in petitions: An extensible llm benchmark and leader- large language models: A survey and perspecboard, 2024 . URL: https://arxiv.org/abs/2407.07796. tive, in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024 , Association for Computational Linguistics , Miami, Florida, USA, 2024 , pp. 7097 - 7135 . URL: https://aclanthology.

org/ 2024 .findings-emnlp. 416 /. doi: 10 .18653/v1/ 2024 .findings-emnlp. 416 .

[27]

Serina ,

Putelli ,

A. E.

Gerevini , I. Serina , Synonyms, antonyms and factual knowledge in BERT heads , Future Internet 15 ( 2023 ) 230 . URL: https://doi.org/10.3390/fi15070230. doi: 10 .3390/ FI15070230.

[28]

Putelli ,

A. E.

Gerevini ,

Lavelli ,

Mehmood , I. Serina , On the behaviour of bert's attention for the classification of medical reports , in: C. Musto,

Guidotti ,

Monreale , G. Semeraro (Eds.), Proceedings of the 3rd Italian Workshop on Explainable Artificial Intelligence co-located with 21th International Conference of the Italian Association for Artificial Intelligence(AIxIA 2022 ), Udine, Italy, November 28 - December 3 , 2022 , volume 3277 of CEUR Workshop Proceedings, CEUR-WS.org , 2022 , pp. 16 - 30 .