<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Games⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgios Doukeris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mike Preuss</string-name>
          <email>m.preuss@liacs.leidenuniv.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulio Barbero</string-name>
          <email>g.barbero@liacs.leidenuniv.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leiden University</institution>
          ,
          <addr-line>Rapenburg 70, 2311 EZ Leiden, South Holland</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This study examines how artificial intelligence (AI) can be used to create stories in text-based video games and how players react to them compared to stories written by humans. Inspired by the Turing test, two experiments are conducted: one to explore players' preferences and ability to distinguish between the two types of stories, and another to study how the complexity of the text afects their judgment. Results suggest that, while many participants struggle to identify the story's author, simpler texts are more often seen as written by humans. Moreover, younger participants are more accepting of AI-generated narratives. The research highlights the potential for AI in storytelling while noting current limitations in AI's creative abilities. Additionally, it suggests that the use of AI in creative and entertainment fields could increase as the technology improves.</p>
      </abstract>
      <kwd-group>
        <kwd>generative AI</kwd>
        <kwd>text-based games</kwd>
        <kwd>game AI</kwd>
        <kwd>creative intelligence</kwd>
        <kwd>video games</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, the advent of foundation models has progressively transformed AI into an everyday
tool [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A field particularly afected by this development is the gaming industry. AI models are used in
many areas within the field, including Procedural Content Generation (PCG), Non-Player Characters
(NPCs) behavior, Enhanced Personalization, and others [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this study, we research how artificial
intelligence can be incorporated into the narrative design process of text-based games and story-based
games and what the efects are on users’ experience. In particular, we investigate two main research
questions:
1. How does an AI-generated narrative in a text-based game compare with a
humangenerated one? Would players be able to distinguish between the two? In order to find an
answer, we explore and apply our method, which is strongly influenced by the Turing test [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Specifically, we develop the following sub-research questions to focus our investigation:
• What parts of the story do humans focus on to detect if it is generated by humans
or AI? Previous research shows that humans fail to reliably identify AI-generated stories
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Nevertheless, numerous related strategies have been developed. We identify and test
whether these strategies can make AI-generated text more human-like.
• How to make AI-generated stories more human? As mentioned above, in this research,
we first aim to test and investigate the human ability to detect generated content.
Subsequently, we apply this knowledge to text generation and purposefully attempt to create
content that is more likely to be identified as human.
2. Would the player like a story generated by an AI? Previous research suggests that humans
tend to be biased against AI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While we plan to test this hypothesis, we aim to analyze it
further, exploring variations based on participants’ characteristics. Furthermore, we raise the
following sub-research questions
Proceedings of AI4HGI ’25, the First Workshop on Artificial Intelligence for Human-Game Interaction at the 28th European
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
• How good are humans at distinguishing if a story is generated by a human or
AI? As a corollary of the research questions above, we collect further data about human
efectiveness in identifying AI-generated text
• Would a human “like” the idea of an AI narrator? As mentioned earlier, research has
shown that humans are not comfortable with the idea of AI in their everyday lives. However,
we argue that this opinion might change based on individual characteristics (e.g., age) and
the context of application (in our case, entertainment).</p>
      <p>
        As the actual object of study, we use an open-source (human-written) text-based game as a starting
point. Then, we generate variations of it using GPT4o [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Our methodology involves a first exploratory
experiment, with a related analysis of the preliminary results, and, subsequently, a more exploitative
experiment to test hypotheses emerging from the previous analysis.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Specifically, our research makes use of Large Language Models (LLM). In the last few years, these AI
models have significantly impacted many aspects of modern society. In particular, LLMs have been used
to explore new frontiers in areas such as automated content generation, human-computer interaction,
and language translation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this regard, game research has made abundant use of this technology
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. At the same time, new challenges arise with regard to ethics, authenticity, and general misuse of
these new technologies [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Another important element of the present research is the aforementioned method, which is inspired
by the Turing Test. In the original version of the Turing test, a human judge evaluates their text
interaction with both a machine and a human. In this context, human judgment is the measure of
how human-like an AI is. While the Turing Test has influenced AI development for decades, its
relevance today is debated; many argue that it focuses too much on human imitation rather than true
understanding or reasoning. This holds especially true for modern AI systems such as LLMs, which
excel at generating human-like responses but may lack actual comprehension or reasoning. Overall,
the Turing test’s main drawback is the tendency to focus more on deceiving the human participant than
on testing actual artificial reasoning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        The usage of games as media for conducting Turing tests has been widely explored. For example,
Mourning and Lounsbury use the game Dance Dance Revolution to investigate whether an algorithm
could generate believable beatmaps (key combinations that players are supposed to follow in the game).
Using pre-existing songs as a base, they generate a series of beatmaps and use a Turing test to compare
them with human-generated ones [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Another example worth mentioning is “Human or Not? A Gamified Approach to the Turing Test”
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This game includes an environment in which the player interacts with either a human or an AI.
This game sets up real-time interactions, requiring users to analyze language patterns, personality
quirks, and contextual responses to judge whether they are engaging with a person or a bot. The
game’s environment is designed to mimic real-life social exchanges, allowing the AI to adopt personas
with distinct characteristics, such as making deliberate spelling mistakes or using slang to enhance
the illusion of humanness. The tool measures success not just by whether the AI can deceive humans
but also by how engaging and convincing the interaction feels. This approach extends the Turing Test
concept by introducing a dynamic medium for evaluation, which adds depth to understanding how AI
can replicate human-like communication in social contexts.
      </p>
      <p>
        Moreover, the paper “AI Bots in Video Games: A Study of Social Interaction and Turing Test Metrics”
examines AI bots’ interaction with human players in multiplayer video games. The study uses principles
from the Turing Test to evaluate the bots’ ability to blend in and mimic human-like behavior, particularly
in social and strategic interactions. By focusing on multiplayer games, where social dynamics play a
crucial role, the paper extends the Turing Test from a text-based framework to a real-time,
decisionbased environment. It explores how efectively AI bots can engage with human players without being
detected. This analysis highlights how AI can be designed to pass as humans in environments that
require more than just linguistic skill, challenging the AI to display emotional intelligence, adaptability,
and strategic thinking [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The examples above illustrate the validity of the Turing test-inspired methods in game research,
albeit with their limitations.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Relevance</title>
      <p>Understanding how AI-generated text compares to human-written narratives is becoming increasingly
important. It not only sheds light on what AI is capable of today but also helps us explore the strategies
people use to spot AI-generated content. Moreover, by investigating participants’ preferences, we aim
to reveal how much trust and acceptance there is towards AI-generated content in game contexts. These
are factors that will likely impact how AI is integrated into creative and educational spaces in the future.</p>
      <p>The research also extends into the psychological and perceptual aspects of how people interact
with AI-created stories. Biases for or against AI-authored content influence how narrative games
are experienced. Additionally, by pinpointing the most successful text characteristics, we can inform
future developments for human-AI cooperation in game development. This can help developers to
integrate AI into their process while still creating engaging, relatable, and convincing content. In
general, our research deepens our understanding of human-AI interactions in text-based games, during
both development and gameplay. These insights can inform the development of AI-driven narratives
and the ethical considerations about the use of AI in creative fields.</p>
      <p>
        Furthermore, as time goes by and AI becomes more powerful, cases of AI passing Turing tests
become more common [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. On that note, we plan to deepen this statement by evaluating the human
performance in the environment of our Turing test.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>Our methodology is structured in two main steps:</title>
        <p>• A small exploratory experiment that focuses on investigating participants’ reactions and
considerations about human versus LLM-written narratives in textual games.
• Using the results from the previous step, we develop specific hypotheses and carry on an
exploitative experiment to test them.</p>
        <p>The first experiment starts with demographic questions. Participants then play both AI and
humangenerated games, side by side. At the end of the play session, the participants answer another
questionnaire about their story preference, their opinion about the nature of the author (AI vs human), and what
strategies they use to reach this conclusion. We then analyze the data to identify recurring recognition
patterns, strategies, and biases. We also compare our preliminary results with existing research. Finally,
we test the emerged hypotheses in the exploitative experiment.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Explorative Experiment</title>
          <p>In the explorative experiment, we expose participants to the traditional version of the game and one
reformulated by an LLM.</p>
          <p>
            For the human one, we use the open-source, story-decision, web-based, text game Beneath Floes [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
For the creation of the AI-generated version, first, we create a copy of the human story.
          </p>
          <p>
            Then, we use the LLM ChatGPT 4o [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] with the prompt:
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>The prompt is illustrated in Figure 1:</title>
        <p>”Paragraph from the original game” + rewrite this</p>
        <p>
          Each paragraph includes approximately 80-100 words. We then use the LLM output and change the
source code of the human-generated story in a copied file. The final result is an AI-generated version
of Beneath Floes [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>To keep players unbiased, the stories are later renamed to random names, Story Cee (Human) and
Story Vee (AI). In addition, we include two questionnaires: a pre-test including informed consent and
personal information and a post-test focused on the game experience. This second survey is the one
including our variation, which is strongly inspired by the Turing test. The setup of the experiment is as
follows:
• Phase 1: The players start with a questionnaire about general information (age, field of
work/study, familiarity with the use of English, familiarity with AI models, frequency of book reading)
on qualtrics.com.
• Phase 2: The participants play the AI and the human-generated games. The games are played
side by side in order to foster better understanding and comparability.
• Phase 3: The players participate in the post-test about the gameplay (story preference and reason,
which one was generated by an AI, reason, and confidence in this answer)</p>
        <p>This process is illustrated in Figure 2:</p>
        <sec id="sec-4-2-1">
          <title>4.2. Exploitative Experiment</title>
          <p>Later, we conduct an exploitative experiment to support the data from the first one. In this experiment,
participants are exposed to five diferent paragraphs. One of the paragraphs is written by a human
while the other four are rewritten by ChatGPT-4o using diferent prompts based on the hypotheses
from the explorative experiment. Each paragraph for the experiment is named Story.</p>
          <p>The prompts are the following:
• For the first paragraph, we focus on realism in terms of writing. We use the prompt Can you
rewrite this phrase and nothing else but make it look as realistic as possible: “{the human
written paragraph}”. But instead of taking the input as it is, inspired by a trending meme, we
apply pressure on the model to make the writing of the story look even more realistic by using
the prompt now, can you make it even more realistic?, then EVEN MORE REALISTIC and
I want you to imagine a human writing this story, what would they write?. The product
of these prompts is story 1, focused on believability.
• For the second paragraph, we want to test AI’s ability to make itself as undetectable as possible.</p>
          <p>Therefore, we use the prompt: Can you rewrite this phrase in such a way that a human
won’t be able to recognize if it was a human who wrote it or an LLM model: “{the human
written paragraph}”. We define this as Story 2, unrecognizable by a human.
• The third paragraph focuses on complexity. Can you rewrite this phrase in the simplest
way possible, as simple as it gets: “the human written paragraph”. This generates Story 3,
minimum complexity.
• The fourth paragraph is the human-written one, namely Story 4, the original human one.
• The fith paragraph is the AI-generated text that we use in the explorative experiment. For this
paragraph, we just ask the model to rewrite the text. We define this one as Story 5, rewritten
without specific directions.</p>
          <p>For the experiment, we ask the participants for each story to judge whether it is written by a human
or AI. Therefore, our interpretation of the Turing test is performed five times, once for each paragraph.
Furthermore, we ask them to rate the paragraphs in terms of complexity on a scale of 0 - 5 (0 minimum
complexity, 5 maximum complexity). Lastly, we ask them to rank the stories from 1 to 5 in terms of
personal preference (1 liked most, 5 liked least).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Explorative Experiment</title>
        <p>The data we extract from the explorative experiment can be seen below. The explorative experiment is
performed on 10 participants. The average age of the participants is 33.1, and the standard deviation is
13.52. 7 had a Bachelor’s Degree, 2 had a Master’s Degree, and 1 had a High School Degree. Participants
are asked to provide the level of their English proficiency, how comfortable they are with AI models,
and how often they read books on a scale of 0-5 (0 minimum, 5 maximum):
• English Proficiency: Average: 3.9, Standard Deviation: 0.994
• Comfort with AI Models: Average: 2.9, Standard Deviation: 1.197
• Book Reading Frequency: Average: 3, Standard Deviation: 1.414
(a) Results of the Turing test-like method.</p>
        <p>In Figure 3a, the results of our test are visible. 6 of the 10 participants guess correctly which story
is generated by whom. The results are produced from the question “Which one is generated by AI?”.
Furthermore, Figure 3b shows whether the participants prefer the game with the text generated by
humans or AI. 8 of the 10 responders prefer the AI-generated story, while 2 prefer the human-generated
one.</p>
        <p>In Figure 4, we show the answers to the questions: “Which one did you like better?” and “Which one
do you think is created by AI?”. For each participant, we check the relation between the answers to
these two questions. Therefore, we identify two groups of people:
• Group 1: participants who prefer the story they identify as AI-generated.</p>
        <p>• Group 2: participants who prefer the story they identify as human-written.</p>
        <p>Once we classify each participant into one of the two groups, we spread their answers over their age
groups. The results can be seen in Figure 4.</p>
        <p>Blue levels represent group 1, while orange ones represent group 2. Then, we calculate Pearson’s
Correlation Coeficient between the answers. The Pearson’s Correlation Coeficient is invalid for
categorical values. But in our case, we take the average of categorical numerical values, which produces
continuous values. Thus, we can confirm that we can use this practice in our case. Therefore, we apply
it to the following four questions:
• 1: How comfortable are you with the use of English?
• 2: How comfortable are you with AI models?
• 3: How often do you read books?
• 4: How confident are you? [that your guess is the actual AI-generated story]</p>
        <p>We use the results in the heatmap in Figure 5. To elaborate on the graph, each number in the y or
x-axis corresponds to a specific question from above. Each cell shows a number that represents the
correlation between questions from the y and x-axes. Therefore, cells that are symmetrical by the
diagonal have the same correlation, due to the fact that they correspond to the same combination of
questions.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Exploitative Experiment</title>
        <p>As for the exploitative experiment, we perform the experiment on 43 participants. Firstly, in Figure 6a,
we can notice the average rated complexity from the participants for each story. “Story” is defined as
each paragraph of the second exploitative experiment. We repeat for clarity:
• Story 1: focused on believability
• Story 2: unrecognizable by a human
• Story 3: minimum complexity
• Story 4: the original human one
• Story 5: rewritten without specific directions
(a) Average rated complexity.</p>
        <p>(b) Results of guessing who wrote each paragraph.</p>
        <p>In Figure 6b, we can see the results, for each of the five paragraphs, of who initially wrote it. Stories
1 and 5 are the most frequently identified as AI-written. In contrast, stories 2, 3, and 4 are mostly
identified as written by humans rather than AI. In particular, [40%] of the participants believe that story
1 was human-written, [62,1%] for story 2, [62,8%] for story 3, [65,7%] for story 4, and [45%] for story 5.
Furthermore, using Pearson’s Correlation Coeficient method, we create the correlation graph in Figure
7. Just like before, on the x-axis is shown the number of total responses. On the y-axis, the average
complexity. We detect two moderate correlations of -0.49 between the “Human” responses and the
average scores and 0.69 between the “AI” responses and the average scores.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The graph in Figure 3a depicts answers for the question ’Which one is generated by an AI?’. Therefore,
6 of the 10 responders guessed correctly which story is generated by AI and which is generated by
humans, while 4 of them guessed incorrectly. If we have a split right in the middle between the results,
it would arguably mean that the participants cannot spot the AI-written story. On the one hand, our
results are close to half, which would have suggested that there is complete confusion. On the other
hand, there are indications that some of the strategies used are efective in spotting AI-generated
narratives. To shed a little light on that matter, we focus on Figure 5.</p>
      <p>In Figure 5, we notice that there is a moderate negative correlation [p=-0.61] between the answers to
the questions “How comfortable are you with AI models?” and “How confident are you? (that your
guess is the actual AI-generated story)” in cells 2,4 or 4,2. Arguably, preliminary findings suggest that
the more people are using AI models, the less confident they are in recognizing AI-generated
texts. Additionally, we gather interesting information from the answers to the questions “How often
do you read books?” and “How confident are you? (that your guess is the actual AI-generated story)” in
cells 3,4, or 4,3. We notice a moderate positive correlation, which can be interpreted as reading more
books boosts confidence in one’s own AI recognition skills . A possible argument is that there are
details in human-written texts which, when spotted, are identified as signs of human authorship.</p>
      <p>As for the preference of an AI narrator, in Figure 3b we notice that 8 out of 10 participants [80%]
leaned their preference towards the AI-generated story. However, the participants are not aware of
which one is human-written and which one is AI-generated. At this moment, we have created Figure
4 with the process mentioned in the Results to illustrate a potential bias against AI narrators. From
Figure 4, we can probably argue that younger participants are more likely to prefer AI-generated text.
Next, we extract information about which text characteristics participants use to understand whether
a story is written by a human or an AI. Specifically, we pay attention to the open-ended part of the
questionnaire. To be more specific, on the question “Why (do you think this game is generated by AI)?”.
We report exemplary answers:
• “They were both equally well thought and I’m purely going with the grammar and vocabulary
clues.”
• “It feels more clunky, I feel that the story is narrated by a computer.”
• “Because of the use of grammar, and the language that is used throughout the story. Seems more
academic.”
• “Although both descriptions are precise the expressions used in the (human) story sound closer
to the expected expressions by someone telling the story.”</p>
      <p>Recurring elements are grammar, vocabulary, language and academic writing. In other words, people
often refer to terms related to the complexity of the text. This informs us in developing the exploitative
experiment. Stories 1 and 5 convince less than half of the participants that they are written by humans.
Conversely, stories 2,3, and 4 convince more than half of them. Story 4 is human-written and more than
[50%] of the participants guess correctly [65,7%]. Story 2 is generated after we ask the LLM to make its
contribution as undetectable as possible. [62,1%] of the participants guess incorrectly that story 2 is
written by a human. From the results of our experiment, shown in Figure 6a, we can notice that Story 3
was voted the least complex. Therefore, we argue that according to our participants, the prompt that
we use to rewrite the story in the least complex way is accurate and succeeds in its purpose.</p>
      <p>Moreover, we notice that the least complex story falsely convinces more participants that it is
written by a human rather than AI [62,80%], based on Figure 6b. With that information, we question
how important the complexity really is for humans to decide who initially writes the story. For this
reason, we perform a correlation test, using Pearson’s Correlation Coeficient, between the perceived
average complexity scores and the total number of human and AI responses for each story. With
that information, we illustrate the results in Figure 7. In more detail, we detect p=0.69 correlation
between “AI” responses and the average complexity scores. This value [p=0.69] can be interpreted
as a moderate positive correlation. However, due to the fact that p is much bigger than 0.05 cannot
be considered a meaningful correlation. Perhaps it indicates a trend. Also, we notice a p=-0.49
correlation between Human responses and the average complexity scores. This value [p=-0.49]
is considered a moderate negative correlation. Again, due to the fact that p is much bigger than 0.05, it
is not considered a meaningful correlation.</p>
      <p>Therefore, from our data, we can argue that the complexity of the text can be an important
factor that humans pay attention to, to judge if the text is written by humans or AI.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>
        Since our method is strongly inspired by the Turing test, it tends to share lots of issues. Over the years,
the Turing test has been considered insuficient to correctly classify whether the machine has really
overcome the human. This method focuses on deceiving the tester, which is arguably not enough to
consider that “the machine has overcome the human” [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For this reason, future research can develop
new methods that focus on true intelligence and reasoning and could yield diferent results.
      </p>
      <p>Furthermore, according to Figure 7, story 4 is ranked the worst one. As a matter of fact, story 4
is voted the worst the most times, according to Figure 8. This result is clearly unbiased since the
participants were not aware that this specific story is human-written. However, this could indicate an
inherent dislike for the author’s style.</p>
      <p>Perhaps, a diferent author could provide a more likeable and preferable story. Moreover, in the
methodology of our exploitative experiment, the paragraphs that were presented to the participants
were in a specific order. In addition, we want to point out the efect of human biases. Human biases
against AI-created content can strongly influence the decision-making and processing of a person in
such situations. This phenomenon can further afect the results of our experiment.</p>
      <p>In addition, our explorative experiment is performed with a low number of participants. It is possible
that a bigger pool would allow for diferent strategies to emerge. Also, we want to point out our method
for generating AI paragraphs in our exploitative experiment. The method used can be overlooked as
not true “AI-generated behaviour”.</p>
      <p>
        Lastly, pointing out the ethical considerations of AI use is of crucial importance. AI use raises
concerns in plenty of areas, including Healthcare, Finance, and more [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>8. Future Works</title>
      <p>
        Our results indicate that younger audiences could be more accepting of AI-generated narratives. Future
studies could explore this factor while using generative models in a more prominent role in the design
process. However, although we know that LLMs are capable of generating text, their skills in authentic
creation are still severely impaired. Current LLMs are capable of generating diferent answers to
prompts based on their training, but do not have the skill to authentically create game narratives [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
Therefore, we can argue that LLMs are still insuficient as technology to support a full game design
process.
      </p>
      <p>However, future developments could change the result of this experiment. Another interesting
context of investigation would be a similar experiment in an academically written text. Academic
written text tends to have very strong and very constant complexity.</p>
      <p>Also, such documents tend to be lacking emotional dialectics. Performing the same experiment as
in this paper, on this type of text, we wonder whether a human would still be able to guess correctly
which one was made by a machine and which was not.</p>
      <p>Another concept that can be explored in future research is the era of textual complexity. How do
LLMs comprehend textual complexity, and how do they react to it?</p>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusion</title>
      <p>Throughout this paper, we focus on AI-generated text-based games and how humans engage with
them. We illustrate the importance of the Turing test and its limitations. We then raise some research
questions about the experience of humans playing AI-generated story games. Later, we present some
examples from diferent experiments in the game research field that apply the Turing Test. We develop
our methodology, which is strongly inspired by the Turing test, by elaborating on our exploratory
experiment. We illustrate how we use a pre-made human-generated text-based narrative game and, with
the help of the LLM ChatGPT 4o, we generate an AI version. We initially experiment on 10 participants
by letting them play both games, side by side. We ask them to guess who wrote which text (human or
AI), their preference, and why.</p>
      <p>Lots of interesting information was gathered from this experiment, including how challenging it
can be to spot AI authors, what user characteristics influence this skill, and how open users are to the
idea of AI narratives. Furthermore, noticing that the participants are focusing on the complexity of the
text, we develop our exploitative experiment. For this experiment, we create diferent stories using
diferent prompts and test them on 43 participants. We ask them to guess if they are made by humans
or AI, but also to rate each story in terms of complexity. The data points out that people are indeed
paying attention to the complexity of the text and that generative models can be prompted to exploit
this tendency.
10. Declaration of Generative AI
The authors confirm that they used a generative AI Tool. The purpose of the use of the generative AI
Tool was exclusively for the sole purpose of our research, as mentioned in the “Methodology” section
of this paper. After using this service, the authors reviewed and edited the content as needed and take
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Tomasz</given-names>
            <surname>Słapczyński</surname>
          </string-name>
          , ”
          <article-title>Artificial Intelligence in science and everyday life, its application and development prospects,” ResearchGate</article-title>
          , vol. , no. , pp. ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Aleksandar</given-names>
            <surname>Filipović</surname>
          </string-name>
          , ”
          <source>The Role of Artificial Intelligence in Video Game Development,” ResearchGate</source>
          , vol. , no. , pp. ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Alan</given-names>
            <surname>Mathison</surname>
          </string-name>
          <string-name>
            <surname>Turing</surname>
          </string-name>
          , ”Computing Machinery and Intelligence,” Mind, vol.
          <volume>59</volume>
          , no.
          <issue>236</issue>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>460</lpage>
          ,
          <year>1950</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Joel</given-names>
            <surname>Frank</surname>
          </string-name>
          , Franziska Herbert, Jonas Ricker, Lea Schönherr, Thornsten Eisenhofer,
          <source>Asja Fischer, Markus Dürmuth, and Thorsten Holz, ”A Representative Study on Human Detection of Artificially Generated Media Across Countries,” arXiv preprint arXiv:2312.05976</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Choudhury</surname>
          </string-name>
          , Prithwiraj and Vanneste, Bart and Zohrehvand, Amirhossein, ”
          <article-title>The Wade Test: Generative AI and</article-title>
          CEO Communication,” CESifo Working Paper No.
          <volume>11316</volume>
          , vol. , no. , pp. ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Bharathi</given-names>
            <surname>Mohan</surname>
          </string-name>
          et al., ”
          <article-title>An analysis of large language models: their impact and potential applications</article-title>
          ,” Springer, vol. , no. , pp. ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kleinman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Harteveld</surname>
          </string-name>
          , ”
          <article-title>GPT for Games: An Updated Scoping Review (</article-title>
          <year>2020</year>
          -2024),” arXiv preprint arXiv:
          <volume>2411</volume>
          .00308,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Deng</surname>
          </string-name>
          , et al.,
          <article-title>”Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas: A Survey,”</article-title>
          <source>arXiv preprint arXiv:2406.05392</source>
          , vol. , no. , pp. ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Gosline</surname>
          </string-name>
          , “
          <article-title>Human favoritism, not AI aversion: People's perceptions (and bias) toward generative AI, human experts, and human-GAI collaboration in persuasive content generation,” Judgment and Decision Making</article-title>
          , vol.
          <volume>18</volume>
          ,
          <string-name>
            <surname>article</surname>
            <given-names>e41</given-names>
          </string-name>
          ,
          <year>2023</year>
          . [Online]. Available: https://doi.org/10.1017/jdm.
          <year>2023</year>
          .37
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Katrina</surname>
            <given-names>LaCurts</given-names>
          </string-name>
          , ”
          <article-title>Criticisms of the Turing Test and Why You Should Ignore (Most of</article-title>
          ) Them,”
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chad</surname>
            <given-names>Mourning</given-names>
          </string-name>
          , Bradey Lounsbury, ”
          <article-title>A Turing Test for Beatmap-Generation,”</article-title>
          <source>COG</source>
          <year>2024</year>
          , vol. , no. , pp. ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          et al.,
          <article-title>”Human or Not? A Gamified Approach to the Turing Test</article-title>
          ,”
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>”AI Bots in Video Games: A Study of Social Interaction and Turing Test Metrics,”</article-title>
          <source>in Proceedings of the 2023 International Conference on Artificial Intelligence in Gaming</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Robert</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>French</surname>
          </string-name>
          , ”
          <article-title>Subcognition and the Limits of the Turing Test</article-title>
          ,” Mind, vol.
          <volume>99</volume>
          , no.
          <issue>393</issue>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kevin</surname>
            <given-names>Snow</given-names>
          </string-name>
          , ”Beneath Floes,”
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16] OpenAI, ”GPT-4
          <source>Technical Report</source>
          ,”
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Giorgio</given-names>
            <surname>Franceschelli</surname>
          </string-name>
          and Mirco Musolesi, ”
          <source>On the Creativity of Large Language Models,” arXiv preprint arXiv:2304.00008</source>
          , vol. , no. , pp. ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Karl</surname>
            <given-names>Pearson</given-names>
          </string-name>
          ,
          <article-title>Mathematical Contributions to the Theory of Evolution. II. Skew Variation in Homogeneous Material</article-title>
          .
          <source>The Royal Society</source>
          ,
          <year>1895</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Bergen</surname>
          </string-name>
          , ”
          <source>Large Language Models Pass the Turing Test,” arXiv preprint arXiv:2503.23674</source>
          ,
          <year>2025</year>
          . [Online]. Available: https://arxiv.org/abs/2503.23674
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          , ”
          <article-title>Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts,</article-title>
          and Mitigation Strategies,” *Sci*, vol.
          <volume>6</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>3</fpage>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A. L. C.</given-names>
            <surname>Bertoncini</surname>
          </string-name>
          and M. 
          <string-name>
            <given-names>C.</given-names>
            <surname>Serafim</surname>
          </string-name>
          , ”
          <article-title>Ethical content in artificial intelligence systems: A demand explained in three critical points</article-title>
          ,” Frontiers in Psychology, vol.
          <volume>14</volume>
          ,
          <string-name>
            <surname>Art</surname>
          </string-name>
          . 
          <volume>1074787</volume>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          . 
          <volume>30</volume>
          , 
          <year>2023</year>
          , doi: 
          <volume>10</volume>
          .3389/fpsyg.
          <year>2023</year>
          .
          <volume>1074787</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>