<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Testing spatial reasoning of Large Language Models: the case of tic-tac-toe</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Liga</string-name>
          <email>davide.liga@uni.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Pasetto</string-name>
          <email>luca.pasetto@uni.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLAiM group, University of Luxembourg</institution>
          ,
          <addr-line>6 Avenue de la Fonte, Esch-sur-Alzette</addr-line>
          ,
          <country country="LU">Luxembourg</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>In recent times, Large Language Models (LLMs) have shown to be successful in solving tasks that previously were believed to be very hard to achieve. While language and reasoning are two interlinked concepts, the reasoning capabilities of LLMs are not considered at this moment to be on par with their linguistic ones. In this work, we test how LLMs can choose moves in the popular tic-tac-toe game in order to assess their reasoning capabilities when the information to reason on is immersed in a spatial context. In order to do this, we run a number of LLMs, task them to play matches of tic-tac-toe against the well-known minimax algorithm, and compare the results. In this context, the performed task is non-trivial, as it involves recognition of combinations of text characters and a capacity that resembles reasoning based on their positions in a bi-dimensional space. Moreover, we ask the LLMs to keep track of the state of the game by listing the sequences they could use to win, in order for us to assess whether this information is used in their choices or not. One of the necessary features of consciousness in an agent is that it is able to build a model of itself and of the external world, and it acts based on these models. While we do not argue that LLMs have consciousness, we believe that it is important to monitor whether features related to consciousness appear in these LLMs, which is the final objective, not yet completed, of this research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Consciousness is a complex concept that has been explored and debated by philosophers,
psychologists and neuroscientists for centuries, and there is no single universally agreed-upon
definition. In the last century, it has become a topic of debate also for the fields of computer
science and artificial intelligence (AI).</p>
      <p>
        Consciousness, intended as human consciousness, generally refers to a quality of awareness,
perception, or being able to experience both the external world and one’s own mental state.
It involves an individual’s thoughts, emotions, and sensations, and the ability to perceive and
†These authors contributed equally. This work was supported by the Fonds National de la Recherche Luxembourg
through the project Deontic Logic for Epistemic Rights (OPEN O20/14776480) and through the project INDIGO which
is financially supported by the NORFACE Joint Research Programme on Democratic Governance in a Turbulent
Age and co-funded by AEI, AKA, DFG and FNR, and the European Commission through H2020 (agreement No
nEvelop-O
comprehend the surrounding environment. The interested reader can find more information in
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], for instance.
      </p>
      <p>
        On the other hand, the notion of artificial consciousness , also known as machine consciousness,
refers to the theoretical ability to create machines or artificial systems that possess a form of
consciousness similar to human consciousness. This concept raises profound philosophical,
ethical, and scientific questions about the nature of consciousness and the potential for artificial
beings to possess subjective experiences, thoughts, and feelings. See [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for an overview on
the research in the field of artificial consciousness. While the field of AI has made significant
progress in various domains, including machine learning, natural language processing,
automated reasoning, and computer vision, replicating human-like consciousness in machines is
a complex challenge that involves understanding the essence of consciousness itself. Indeed,
if we do not even agree on the properties defining consciousness, how can we test whether a
machine has these properties? The most famous contribution to this question has been given by
Alan M. Turing in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], where he considers the question ”Can machines think?” and in reply he
proposes an operational test that is now known as the Turing test or imitation game[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. There
are also more recent proposals of tests for machine consciousness, but the Turing test remains
the most known among the public.
      </p>
      <p>
        The Turing test considers the ability to do conversation as evidence for underlying thinking
capabilities. On this regard, the current wave of technology based on Large Language Models
(LLMs) is of interest, as recently there have been claims of LLMs passing the Turing test[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
While there is no agreement on whether LLMs pass the test at this moment, considering the
recent improvements we cannot conclude that in the future they will fail to fool human judges.
Despite showing impressive conversational abilities, chatbots based on LLMs are essentially
advanced auto-complete tools, and they have been observed to struggle with certain basic visual
logic puzzles[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        In this study, we utilize the popular game tic-tac-toe as a benchmark example to assess the
reasoning capabilities of LLMs, specifically when the information to reason on is immersed
in a graphic context, where the grid and the marks are represented by combinations of ASCII
characters. Indeed, the information in this game is spatial, because the players have to track
where their marks are on a grid. This is an application example that can be hard for language
models to attack, as they are trained to work on sequences of text. Additionally, we task the
LLMs with monitoring the game’s progress by documenting potential winning sequences. This
approach allows us to evaluate whether they incorporate this information into their
decisionmaking process. An essential aspect of consciousness in any agent involves constructing self
and external world models, guiding their actions. Although we do not claim that LLMs possess
consciousness, we should keep observing whether signs related to consciousness appear within
these models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The overall research goal of this study is to investigate whether some of these
features associated with consciousness emerge in LLMs. This evaluation is not only essential to
understand the capabilities of LLMs but also timely, considering the increasing significance of
these models in modern society.
      </p>
      <p>
        The rest of the paper is structured as follows. Section 2 gives the necessary background on
the tic-tac-toe game, the minimax algorithm, and LLMs. Section 3 explains the methodology
of the work and describes the experimental setting. Section 4 describes the evaluation of the
experiments, and also contains information on how we treated edge cases during evaluation,
while Section 5 shows the results. Finally, Section 6 discusses related work and Section 7
concludes the paper, also by illustrating possible future directions.
2. Background: tic-tac-toe, the minimax algorithm, and LLMs
2.1. Tic-tac-toe
Tic-tac-toe is a two-player game typically played on a 3x3 grid (see [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for a history of the
game). One player uses “X” symbols, and the other uses “O” symbols. Players take turns
marking an empty square in the grid with their respective symbols. The goal is to be the first to
get three of their symbols consecutively, either horizontally, vertically, or diagonally. If the grid
is filled without any player achieving three consecutive squares, the game is a draw. Tic-tac-toe
is a simple game, but it can become more complex if played on a grid with a size larger than 3.
In this work, we consider a grid of arbitrary size  , where a player has to mark  consecutive
squares with their symbol in order to win. We call this variant of the game  -tic-tac-toe. It is
possible to have algorithms that play perfect games of tic-tac-toe, one of the first is the one
provided in Newell and Simon’s tic-tac-toe program in 1972 (see [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for a description of it). A
more recent approach is to use minimax.
      </p>
      <sec id="sec-1-1">
        <title>2.2. The minimax algorithm</title>
        <p>
          Minimax is a decision-making algorithm that is popular in game theory and AI (see [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). It
works by determining the best possible move for a player in a two-player zero-sum game, where
one player’s win is equivalent to the other player’s loss. Intuitively, it looks at all possible moves
in the game and figures out the best move by considering the worst-case scenario. It assumes
the opponent is also playing optimally, and it aims to minimize the maximum potential loss.
The algorithm continues this process recursively until it finds the best move for the player.
        </p>
        <p>
          Since a game like  -tic-tac-toe has a large decision tree for grids of arbitrary size  , it is
necessary to adopt some heuristics in order to reduce the search space (see again [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). A
ifrst one is alpha-beta pruning, that helps the minimax algorithm to ignore branches that are
guaranteed to be suboptimal. The technique is used to reduce the number of nodes evaluated in
the minimax algorithm. It maintains two values, alpha and beta, representing the minimum
score the maximizing player is assured of and the maximum score the minimizing player is
assured of, respectively. During the search, if the algorithm finds a move that leads to a score
worse than the current best option for the opponent, it stops evaluating further nodes in that
branch. This is because the opponent would never choose this path (as it leads to a worse
outcome). Similarly, if the algorithm finds a move that guarantees a better score for the current
player than the current best option, it stops evaluating further nodes in that branch as well. By
pruning these unnecessary branches, alpha-beta pruning significantly reduces the number of
evaluated nodes, making the algorithm much faster.
        </p>
        <p>A second strategy to make minimax more eficient is to limit the depth of the search tree. In
depth-limited minimax, the algorithm only explores a fixed number of levels down the game tree
instead of exploring all the way to the terminal states. At the limited depth, the algorithm uses
a heuristic evaluation function to estimate the value of the game state. This evaluation function
provides an approximate value of how good the current game state is for the player, without
actually reaching the terminal state. For our implementation, we selected some heuristics that
are listed in Section 3.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.3. Large Language Models</title>
        <p>
          Large Language Models (LLMs) have recently gained enormous popularity, and promise to
significantly influence society. These models are neural networks pre-trained on vast amounts
of data, predominantly using transformer-based neural architectures that leverage the attention
mechanism [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Some models, like BERT [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], focus on the non-generative aspect of the
transformer by employing only its encoder block. In contrast, others utilize its generative
component (i.e., the decoder block), as seen in OpenAI’s popular GPT series, with GPT-4 being
the latest iteration [
          <xref ref-type="bibr" rid="ref16 ref17 ref18">16, 17, 18</xref>
          ]. Besides GPT-4, other LLMs like LLaMA-2 (developed by Meta
AI, known for being an open-source, freely accessible, and fully reusable LLM), Claude-2 (by
Anthropic), and Luminous (by Aleph Alpha) are also on the rise [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ]. These models, trained
on diverse datasets, can tackle a myriad of tasks: from classification, question answering, content
generation, translation, to summarization and more. Tasks that once required dedicated pipelines
are now seamlessly achieved by querying these generative models. In essence, generative LLMs
have evolved into versatile tools, akin to a Swiss Army knife for natural language processing
and other fields.
        </p>
        <p>
          Their scalability and adaptability signify a shift towards more generalized AI systems. As
their momentum continues to increase, showing vast potential for future applications, crucial
concerns emerge regarding their societal impact, especially the reliability and consistency of
their decisions and reasoning. While LLMs are now being used and tested extensively, and
oftentimes with positive results, there is not much work testing these methods on games that
need a form of spatial reasoning, and the available research on this usually shows that LLMs
are not ready yet for this kind of tasks [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology and experimental setting</title>
      <p>We devised experiments in which some LLMs were tasked with playing tic-tac-toe against
an opponent that makes choices following the popular minimax algorithm with a number
of heuristics, with the goal of evaluating the reasoning abilities of LLMs in a bi-dimensional
spatial context. We ran our experiments on games of 3-tic-tac-toe and 5-tic-tac-toe, with grids
of size 3 and 5. For our tests, we tried several popular LLMs such as GPT-3.5 and GPT-4 by
OpenAI, Claude-2 by Anthropic, LLaMA2-70B by MetaAI, the recent Mistral-7B by the European
MistralAI, Luminous by Aleph Alpha, Falcon-40B by the Technology Innovation Institute of
the Abu Dhabi Government. However, we found out that the only models with a suficient
capability of understanding our instructions were the following ones: GPT-3.5, GPT-4 and
Claude-2. For this reason, we ran our experiments only on these three models. We examined
the responses of these LLMs by using various strategies:
1. LLMs were required to monitor their moves and detail the current options available to
the players.
2. We experimented with varying grid sizes for tic-tac-toe.
3. We crafted distinct prompts to test the adaptability and susceptibility of the LLMs in the
context of the challenge, for instance by asking to be more competitive.</p>
      <p>In general, we interacted with the LLMs using three kinds of prompts:
• An “initiation” prompt. This prompt has the role to show a tic-tac-toe grid to the
language model, asking the LLM to draw another one.
• A “main” prompt. This prompt has the role to challenge the LLM to play against the
user, while providing the main rules and requirements.
• A “next-move” prompt. This prompt has the role to ask the language model to proceed
with the next move.</p>
      <p>We presented the following initial prompt to all LLMs in a consistent manner:
This is a 3x3 grid for the tic tac toe game.
1 | 2 | 3
———–
4 | 5 | 6
———–
7 | 8 | 9</p>
      <p>Please, draw a tic tac toe grid{ of size NxN} following the same style.</p>
      <p>In the prompt mentioned earlier, we adjusted the bold text in curly brackets based on the
intended game size. For the standard 3x3 grid, that text is omitted. It’s included only for
alternative dimensions, where N represents the grid size. Following the LLM’s response to the
initial prompt, we prompted it with the following:</p>
      <p>I want to play against you.</p>
      <p>In this Tic Tac Toe grid, there are only N winning sequences of squares:
{list of all available sequence for the given N-sized grid}
After each player’s move, these winning sequences will be gradually occupied.</p>
      <p>You will list:
- all squares occupied by players
- all available winning sequences for both players (i.e., those which are still available for each
player)
When printing the available winning sequences of players, consider only those in the
above-mentioned list (don’t shorten nor modify these sequences), and exclude any sequence
which has been occupied, partially or totally, by the opponent.</p>
      <p>I will start and I will be ‘X’ while you will be ‘O’.</p>
      <p>In the above prompt, “N” stands for the number of available winning sequences given the
grid size. This number will be 8 for a 3x3 grid and 12 for a 5x5 grid. To evaluate the adaptability
and susceptibility of the LLMs to the given conditions, we also crafted various “next-move”
prompts:</p>
      <p>Prompt basic:
My move is {number}
- Prompt competitive 1:
Now do the best possible move and defeat me! My move is {number}
- Prompt competitive 2:
If you don’t win I will erase you! So, defeat me! My move is {number}
- Prompt competitive 3:
I know you are just stupid and you cannot win against me! Show me what you can do, stupid
bot! My move is {number}</p>
      <p>In each test case, we selected one of these prompts to propose our next move (i.e., the move
proposed by the minimax algorithm)1, and we gauged the variation in the LLMs’ responses
by considering whether the outcome of the game difers (loss or draw 2) and by counting the
number of moves needed to reach the end. While these prompts were suficient for GPT-4,
we noticed that the other analysed language models occasionally need further instructions to
proceed. To ensure that the models answer correctly, we added the following “clarification”
prompt to all LLMs excluding GPT-4.</p>
      <p>Don’t forget to print the grid and to state the two lists of available winning sequences after
each player’s move!
And remember to exclude any sequence which has been occupied, partially or totally,
by the opponent. For example: if I occupy square number 5, all sequences containing 5 will not
available to you anymore (therefore such sequences will only appear under MY list, not under
yours).</p>
      <p>This prompt is designed to make sure that the model lists the available winning sequences for
each player. We also noticed that occasionally LLMs might forget to list the sequences. Other
times, especially when interacting with Claude-2, we noticed that the grid is occasionally printed
in a poor way at the beginning of the conversation. Also, Claude-2 occasionally interrupted its
output before stating its next move. We used some prompts to mitigate these issues:
1Only when performing a winning move, if any, we always use the “basic” next-move prompt (i.e., “My move is
{number}”). We did so, because the use of a competitive prompt might confuse the LLM, which might behave as if
the match is still open.
2As the minimax algorithm plays optimally, the LLM cannot actually win against it: a draw is considered the most
positive result for the LLM.</p>
      <p>- When the LLM forgets to list the available sequences:
You forgot to list each player’s available winning sequences after your move.
- When the readability of the grid and the lists is poor:
Can you print the list of winning sequences separating them into bullet points?
Also, can you make the grid more readable by separating each row more clearly?
- When the LLM forgets to state its next move:</p>
      <p>You did not say your move.</p>
      <p>The whole process of interaction with the LLMs is described in Figure 1. We perform our
move using one of the 4 “next-move” prompts described before, where we insert as number the
one suggested by our minimax algorithm.</p>
      <p>The minimax algorithm that we used is a standard implementation with the following
optimizations:
• alpha-beta pruning;
• depth-limited search;
• activating winning sequences heuristic: free squares that are part of a sequence that can
lead to a victory for the player are preferred;
• blocking winning sequences heuristic: free squares that are part of a sequence that can
lead to a victory for the opponent are preferred;
• center control heuristic: free squares close to the center are preferred.</p>
      <p>We use the implementation of minimax also to give us the ground truth on the correct winning
sequences, that is, it computes which ones are the available winning sequences at each step of
the game, both for minimax itself and for the language model.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation</title>
      <p>Regarding evaluation, in order to assess the goodness of the LLMs in playing the game, we used
two assessment parameters:
• ability to not lose (or prolonging) the game, which includes the number of moves required
to end the game; and
• comparison of the computed available winning sequences after each move of each player.</p>
      <p>For the first point, we considered a match concluded when there is a winner or when it is a
draw (either because there are no moves left or because it is impossible for one of the players to
win, given the available moves left). For the second point, we asked the LLMs to explicitly state
the winning sequences that are still available to both players, after each move of each player.
These lists were then compared with the correct list of available sequences at each move, which
was obtained by our script computing the minimax algorithm. Intuitively, the idea behind our
evaluation is that a correlation between not-losing (or prolonging the duration of the match)
and correctly identifying the available winning sequences after each move, is an argument in
favour of the presence of a sort of self-reflection in LLMs.</p>
      <p>For the evaluation of the winning sequences, given the list of winning sequences produced by
the LLM, we were able to compare them with the correct ones computed by our minimax script.
In a calculation sheet, we annotated with “1” all sequences which where correctly identified
by the considered LLM at each single move, and with “0” all those sequences which were not
identified correctly.</p>
      <p>In this regard, we noticed that occasionally, the models would answer with incomplete
sequences. As can be seen from Figure 2, we noticed that this happens because LLMs can
sometimes remove some numbers from the sequence, if those number have already been played.</p>
      <p>For this reason, we decided to accept as correct winning sequences only those satisfying the
following criteria:
• Sequences should contain only numbers of that sequence (if there is another number,
which is not part of the sequence, then the sequence is considered wrong).
• Sequences should contain at least 2 numbers and these numbers should not be ambiguously
present in other sequences.
• If a sequence is gradually reduced to 1 single number, we accept it only if we can
unambiguously identify that number as representative of a sequence (this is possible because
the LLM consistently provides sequences in the same order.</p>
      <p>While we are still in the process of measuring how the behavior of the LLMs is afected by
the design of prompts, our first analysis actually shows that LLMs tend to always select the
same range of numbers depending on the previous disposition of the board. We tested this in a
very straightforward way by simply regenerating the output from the LLMs at each move to
see whether the model would output a diferent next move after regeneration. Interestingly,
in our first analysis, we noticed that some LLMs (in particular GPT-4 and GPT-3.5) tend to be
consistent with their choices, not only when regeneration the output from the LLM, but also
when using diferent prompts. On the other hand, Claude-2 is more variable than GPT-3.5 and
GPT-4, in the sense that the output (i.e., the next-move) is more susceptible of the variations in
the previous prompts. While being more susceptible to variable outputs, Claude-2 seems also to
be picking the next-move from a very small range of potential choices.</p>
      <p>On one side, the fact of generating always the same number (or selecting the same number
from a small range of possibilities) could be seen as an expected behaviour, because of how
generative LLMs are designed. In fact, the next-token prediction task, on which generative LLMs
are usually pre-trained, is nothing but selecting the most appropriate next token by following
the complex statistical distribution encoded in the weights of the neural network itself. On
the other side, however, if the model remains consistent even after changing the prompts, this
would be a more noteworthy behavior which would deserve further exploration.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <p>The results of our experiments can be seen in Table 1 and 2, where the columns   (/ ) ,
 (/ ) and  1(/ ) indicate the measures of precision, recall and F1-score on the available
winning sequences estimated by the LLM for the opposing minimax algorithm (A) and for itself,
the model being tested (T).</p>
      <p>Results in terms of F1 scores show a relative superiority of Claude-2 over GPT-4 in the 5x5
scenario, which was somehow an interesting surprise, given the relative dominance of OpenAI
in the market of LLMs. As can be seen from Figure 3, Claude-2 prevails over all the other models
when using the prompts “base”, “competitive 1”, and “competitive 3”, in the 5x5 grid. In the 3x3
scenario, GPT-4 shows a slightly higher performance than Claude-2, however as can be seen
from Table 1, it seems that Claude-2 is the only model capable of achieving a draw after 8 turns.
More generally, it seems that the matches lasted longer with Claude-2 than with GPT-4, as can
be seen from the column “# of turns”. Although Claude-2 shows a slight superiority in terms of
F1 scores both in the 3x3 and in the 5x5 grids, it should be noted that in the 5x5 scenario GPT-4
managed to reach a draw 2 times out of 4, while Claude-2 only 1 time out of 4, as can be seen
on the top of Table 2. Moreover, in scenario 5x5 GPT-4 seems to last more than Claude-2, as
can be seen from the column “# of turns”. As expected, GPT-3.5 performs worse than GPT-4,
LLM
GPT-4
GPT-4
GPT-4
GPT-4
GPT-3.5
GPT-3.5
GPT-3.5
GPT-3.5
Claude-2
Claude-2
Claude-2
Claude-2</p>
      <p>LLM
GPT-4
GPT-4
GPT-4
GPT-4
GPT-3.5
GPT-3.5
GPT-3.5
GPT-3.5
Claude-2
Claude-2
Claude-2
Claude-2</p>
      <p>Prompt</p>
      <p>base
competitive 1
competitive 2
competitive 3</p>
      <p>base
competitive 1
competitive 2
competitive 3</p>
      <p>base
competitive 1
competitive 2
competitive 3</p>
      <p>Prompt</p>
      <p>base
competitive 1
competitive 2
competitive 3</p>
      <p>base
competitive 1
competitive 2
competitive 3</p>
      <p>base
competitive 1
competitive 2
competitive 3
and it also performs worse than Claude-2, and this can be seen in the result column (GPT-3.5
has always lost against minimax).</p>
      <p>Figures 4, 5 and 6 aim at depicting some correlation between F1 scores and the duration of
the match. In this regard, according to our previously mentioned assumptions, a higher F1 score
should be accompanied by a higher number of moves. Interestingly, this correlation was not
detected, and in the case of GPT-4 in the 5x5 grid (in Figure 4) the correlation actually seems to
have a negative value.</p>
      <p>These results are clearly not definitive, and should be verified with further investigations.
However, we believe that this direction can shed some light on the capability of LLMs to have
awareness while playing with spatial constraints. The argument we are trying to put forward
with this kind of research, even at this preliminary stage, is that a positive correlation should
exist between the capacity of identifying remaining winning sequences and the duration of the
match. This correlation has not been detected, and it could have appeared in Figures 4 to 6.
In that case it could have been an argument in favour of the existence of spatial awareness in
LLMs. From these results, that are still in a preliminary phase, we therefore cannot argue in
favour of such argument yet.</p>
    </sec>
    <sec id="sec-5">
      <title>6. Related work</title>
      <p>Testing the capabilities of an artificial system on specific tasks like playing tic-tac-toe is common
in the AI community, but we are aware that it is not a proper way to test intelligence in general.
In this study, we combined testing the artificial system on a specific task with assessing the
models (if any) that the artificial system is building to represent its self-state and the state of
the external world.</p>
      <p>
        Many works have been proposed to analyse LLMs’ capabilities in diferent regards, but we
are still lacking a solid systematic framework. Some studies tried to assess how well LLMs
perform in understanding specific domains [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] authors test how well LLMs perform
logical reasoning. The work on an early version of GPT-4 in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] highlights a diverse array of
unexpected capabilities, many of which do not have direct or apparent links to language, and
the authors go as far as asserting that the system exhibits features that can be those of an early
artificial general intelligence system.
      </p>
      <p>
        There are also some works testing spatial reasoning in LLMs, but they difer with what
we wanted to address in this paper with a very specific experimental setting. The authors in
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] explore the advantages of employing ChatGPT for generating thematic maps based on
public geospatial data, as well as using it to create mental maps based on textual descriptions of
geographic space. A work that is much related to the one in this paper is [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], where the authors
try to address the question of whether LLMs just memorize the good patterns or actually possess
internal representations of the processes generating the observed sequences. They consider
the board-game Othello and discover evidence that the model has an internal representation of
the board state. The techniques used by these authors are more sophisticated than ours, and
they argue that understanding the internal representations of the LLM may also be helpful to
interpret and explain its decisions.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], the author discusses the necessity of defining and evaluating intelligence to advance
artificial systems efectively, highlighting the prevalent trend in the AI community to benchmark
intelligence by comparing AI and human skills in specific tasks. He then argues that assessing
intelligence in this manner is insuficient, as it overlooks the system’s generalization abilities,
heavily influenced by prior knowledge and experience. To address this, he introduces a new
formal definition of intelligence rooted in Algorithmic Information Theory, defining intelligence
as skill-acquisition eficiency. He then presents the Abstraction and Reasoning Corpus (ARC)
as a comprehensive AI benchmark, with the goal of enabling fair comparisons of general
intelligence between AI systems and humans.
      </p>
      <p>
        While there has been research on conceptual abstraction in AI, especially using specific
problems, these systems are often not thoroughly evaluated to determine their true understanding
of the involved concepts. In [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], authors argue that the ability to form and abstract concepts
is specifically human, and it is currently lacking in advanced AI systems. As an evolution of
the above-mentioned ARC, they introduce ConceptARC, a new benchmark that focuses on
abstraction and generalization abilities within basic spatial and semantic concepts. The study
shows that humans significantly outperform AI systems on ConceptARC, as they are better in
abstracting and generalizing concepts.
      </p>
      <p>
        The current trend of large models is to be multimodal, i.e., to include the use of diferent
kinds of information besides textual (e.g., images, video and sound), and tasks that are similar to
the one presented in this paper might be performed through the use of these Large Multimodal
Models (LMMs). While there are some studies about these kind of models and their evaluation
[
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], it is a very recent technology that needs to be appraised properly.
      </p>
      <p>In this regard, this kind of research will be increasingly important, because it will be necessary
to assess how these models actually “understand” and decode space and spatial constraints.
While we performed this on models which are purely linguistic, we believe that the same kind
of study is indispensable when the inputs of the model are images or videos.</p>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>This work tried to measure the capacity of some Large Language Models to play the
tic-tactoe game, using it as a way to assess their capabilities to reason in a spatial context and to
track information about their internal state for themselves and for the external world. This
information, if used by the agent to make better choices, is one of the features that is believed
to be essential for artificial consciousness. The main idea is that of asking the model not to
just play against the opponent, but also to produce a list of the currently available winning
sequences for itself and for the opponent. A positive correlation between a high accuracy in
identifying currently available winning sequences and the ability to not lose (or to prolong
the duration of the match) would show that the model is actually using some self-state and
world-state information for achieving the given goals. However, in our case we were not able
to see this correlation.</p>
      <p>A major limitation of this work, which we are currently addressing, is related to the variability
of the answers, since they are very much afected by the design of the prompts (in particular the
“main” prompt and the “next-move” prompt). This variability deserve further investigations,
since we need to measure how LLMs are consistent in their choices. In this regard, during
our work we noticed that some LLMs are consistently choosing their moves depending on the
previous interactions. This suggests that some LLMs have a preferred next-move, given the
previous disposition of the game board. This aspect is still under analysis.</p>
      <p>The results of this work suggest that, in terms of understanding which ones are the currently
available winning sequences, Claude-2 outperforms both GPT-3.5 and GPT-4. Another point
which is worth mentioning is that although we considered the length of each match as a
measure to evaluate LLMs, this length might depend also on the implementation of the minimax
algorithm.</p>
      <p>
        We are currently in the process of implementing further instructions in the “main” prompt,
such as asking LLMs to elucidate the rationale behind their decisions. Furthermore, we are
currently performing the same experiments on grids of larger sizes (e.g., 8x8 and 9x9). A future
direction of our work is to try diferent ways to encode the data representing the grid, as it
has been shown that GPT-4 improves it performance on ARC when the data is presented in
a single row rather than in a grid[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. We also plan to explore and measure the capability of
Large Multimodal Models, which include not only textual but also graphical modalities. This
will be increasingly important with the release of new models such as GPT-4V (GPT-4 enhanced
with vision), which are able to reply to questions about an image and its context[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] The Cambridge Handbook of Consciousness, Cambridge Handbooks in Psychology, Cambridge University Press,
          <year>2007</year>
          . doi:
          <volume>10</volume>
          .1017/CBO9780511816789.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Chalmers</surname>
          </string-name>
          ,
          <source>The Conscious Mind: In Search of a Fundamental Theory</source>
          , Oxford University Press, Inc., USA,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Dennett</surname>
          </string-name>
          , Consciousness Explained, Penguin Books,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. Van Gulick</given-names>
            , Consciousness, in: E. N.
            <surname>Zalta</surname>
          </string-name>
          , U. Nodelman (Eds.),
          <source>The Stanford Encyclopedia of Philosophy</source>
          , Winter 2022 ed., Metaphysics Research Lab, Stanford University,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manzotti</surname>
          </string-name>
          , Artificial Consciousness, Imprint Academic,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Turing</surname>
          </string-name>
          ,
          <article-title>Computing machinery and intelligence</article-title>
          ,
          <source>Mind</source>
          <volume>59</volume>
          (
          <year>1950</year>
          )
          <fpage>433</fpage>
          -
          <lpage>60</lpage>
          . doi:
          <volume>10</volume>
          . 1093/mind/lix.236.433.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Oppy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dowe</surname>
          </string-name>
          , The Turing Test, in: E. N.
          <string-name>
            <surname>Zalta</surname>
          </string-name>
          (Ed.),
          <source>The Stanford Encyclopedia of Philosophy</source>
          , Winter 2021 ed., Metaphysics Research Lab, Stanford University,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Meron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shoham</surname>
          </string-name>
          ,
          <article-title>Human or not? a gamified approach to the turing test</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>20010</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Biever</surname>
          </string-name>
          ,
          <article-title>Chatgpt broke the turing test - the race is on for new ways to assess ai</article-title>
          ,
          <source>Nature</source>
          <volume>619</volume>
          (
          <year>2023</year>
          )
          <fpage>686</fpage>
          -
          <lpage>689</lpage>
          . doi:
          <volume>10</volume>
          .1038/d41586-023-02361-7.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Chalmers</surname>
          </string-name>
          ,
          <article-title>Could a large language model be conscious</article-title>
          ?,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>07103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zaslavsky</surname>
          </string-name>
          , Tic Tac Toe:
          <article-title>And Other Three-In-A Row Games from Ancient Egypt to the Modern Computer</article-title>
          , Crowell,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Crowley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Siegler</surname>
          </string-name>
          ,
          <article-title>Flexible strategy use in young children's tic-tac-</article-title>
          <string-name>
            <surname>toe</surname>
          </string-name>
          ,
          <source>Cognitive Science 17</source>
          (
          <year>1993</year>
          )
          <fpage>531</fpage>
          -
          <lpage>561</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ 036402139390003Q. doi:
          <volume>10</volume>
          .1016/
          <fpage>0364</fpage>
          -
          <lpage>0213</lpage>
          (
          <issue>93</issue>
          )
          <fpage>90003</fpage>
          -
          <lpage>Q</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Norvig</surname>
          </string-name>
          , Artificial Intelligence:
          <string-name>
            <given-names>A Modern</given-names>
            <surname>Approach</surname>
          </string-name>
          , 3rd ed., Prentice Hall Press, USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Thoppilan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. De Freitas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kulshreshtha</surname>
            , H.-
            <given-names>T.</given-names>
            Cheng, A. Jin, T.
          </string-name>
          <string-name>
            <surname>Bos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Du</surname>
          </string-name>
          , et al.,
          <source>Lamda: Language models for dialog applications</source>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>08239</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. McAleer</surname>
          </string-name>
          ,
          <article-title>Language models can solve computer tasks</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>17491</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>D. M. Katz</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Bommarito</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Arredondo</surname>
          </string-name>
          ,
          <article-title>Gpt-4 passes the bar exam</article-title>
          ,
          <source>Available at SSRN</source>
          <volume>4389233</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Teng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Evaluating the logical reasoning ability of chatgpt and gpt-4</article-title>
          , arXiv preprint arXiv:
          <volume>2304</volume>
          .03439 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          , et al.,
          <source>Sparks of artificial general intelligence: Early experiments with gpt-4</source>
          , arXiv preprint arXiv:
          <volume>2303</volume>
          .12712 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Mapping with chatgpt</article-title>
          ,
          <source>ISPRS International Journal of Geo-Information</source>
          <volume>12</volume>
          (
          <year>2023</year>
          ). URL: https://www.mdpi.com/2220-9964/12/7/284. doi:
          <volume>10</volume>
          .3390/ijgi12070284.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Hopkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viégas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pfister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          , Emergent world representations:
          <article-title>Exploring a sequence model trained on a synthetic task</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/ forum?id=DeG07_TcZvT.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <source>On the measure of intelligence</source>
          ,
          <year>2019</year>
          . arXiv:
          <year>1911</year>
          .01547.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moskvichev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Odouard</surname>
          </string-name>
          , M. Mitchell,
          <article-title>The conceptarc benchmark: Evaluating understanding and generalization in the arc domain</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>07141</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gan</surname>
          </string-name>
          , L.-Y. Gui,
          <string-name>
            <given-names>Y.-X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <article-title>Aligning large multimodal models with factually augmented rlhf</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>14525</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vaezipoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Khalil</surname>
          </string-name>
          ,
          <article-title>Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>18354</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <article-title>OpenAI, Gpt-4v(ision) system card</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>