<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Julian Geheeb</string-name>
          <email>julian.geheeb@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Farhan Abid Ivan</string-name>
          <email>farhanabid.ivan@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Dyrda</string-name>
          <email>daniel.dyrda@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miriam Anschütz</string-name>
          <email>miriam.anschuetz@tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Groh</string-name>
          <email>grohg@cit.tum.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technical University of Munich</institution>
          ,
          <addr-line>Arcisstraße 21, 80333 Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Recent research has demonstrated that large language models (LLMs) can support experts across various domains, including game design. In this study, we examine the utility of medium-sized LLMs-models that operate on consumer-grade hardware typically available in small studios or home environments. We began by identifying ten key aspects that contribute to a strong game concept and used ChatGPT to generate thirty sample game ideas. Three medium-sized LLMs-LLaMA 3.1, Qwen 2.5, and DeepSeek-R1-were then prompted to evaluate these ideas according to the previously identified aspects. A qualitative assessment by two researchers compared the models' outputs, revealing that DeepSeek-R1 produced the most consistently useful feedback, despite some variability in quality. To explore real-world applicability, we ran a pilot study with ten students enrolled in a storytelling course for game development. At the early stages of their own projects, students used our prompt and DeepSeek-R1 to refine their game concepts. The results indicate a positive reception: most participants rated the output as high quality and expressed interest in using such tools in their workflows. These findings suggest that current medium-sized LLMs can provide valuable feedback in early game design, though further refinement of prompting methods could improve consistency and overall efectiveness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Game Design</kwd>
        <kwd>Conceptualization Phase</kwd>
        <kwd>Medium-sized LLMs</kwd>
        <kwd>Local Inference</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>AI-assisted Design</kwd>
        <kwd>LLM-as-a-judge</kwd>
        <kwd>Human Evaluation</kwd>
        <kwd>User Study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>At the beginning of any creative process, there is often a spark—a moment of imagination that captures
an idea and sets it on a path toward becoming a finished artifact. In our context, this artifact is a video
game. However, the initial concept is typically rough and underdeveloped, like an unpolished gem. It
requires refinement before it can serve as a foundation for development.</p>
      <p>
        This refinement begins in the pre-production stage of game development, particularly during the
conceptualization phase, where the core idea is expanded into a full-fledged game concept [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. To be
efective, such a concept must include a suficient level of detail across various dimensions, enabling
smoother transitions into later stages of production. Because these concepts are often documented in
written formats—such as Game Design Documents (GDDs)—large language models (LLMs) present a
promising tool for evaluating whether this level of detail has been achieved.
      </p>
      <p>
        LLMs have demonstrated their ability to support experts across many fields [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], including game
design [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, most high-performance LLMs require significant computational resources and
are typically accessed via cloud-based platforms. This reliance on third-party providers introduces
concerns about privacy, intellectual property, and long-term accessibility—issues particularly relevant
to independent game developers and small studios.
      </p>
      <p>In this study, we explore whether medium-sized LLMs, which can be hosted on consumer-grade
hardware, can provide meaningful support during the conceptualization phase of game design.
Specifically, we investigate whether these models can deliver valuable feedback on early-stage game concepts
without the need for external cloud services.</p>
      <p>To address this question, our contributions are as follows:
• We identify ten key aspects that characterize a robust game concept (section 2).
• We conduct a human evaluation to compare three medium-sized models using a test dataset and
standardized hardware (section 4).
• We build a prototype, SPARC, and run a pilot study in which the best-performing model is
integrated into the workflow of students engaged in early-stage game development (section 5).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Conceptualization Framework</title>
      <p>
        To enable the models to meaningfully evaluate game concepts, we first defined a set of criteria grounded
in established game development practices. Drawing from a range of sources in game design, level design,
and production—such as Salen and Zimmerman [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Schell [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Galuzin [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Totten [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Fullerton [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and
Yang [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]—we identified ten key aspects that a well-developed game concept should address. While not
exhaustive, these aspects ofer a solid foundation for evaluating early-stage design ideas and are well
suited to the aims of our study. In practical settings, they could also be adapted to the specific needs of
individual teams or projects. Each aspect was carefully defined in an extended description, which was
included in the prompt provided to the LLMs. A brief overview of the ten aspects is presented below.
Player Experience This aspect describes what the player is supposed to experience. It is written
from the perspective of the player in the active form focusing on emotional experiences and it should
include a high concept statement for the play idea.
      </p>
      <p>Theme This aspect defines the theme of the idea. The theme of a game concept is often divided into
a dominant unifying theme and multiple secondary themes.</p>
      <p>Gameplay This aspect describes the core gameplay. It includes finding 3–5 verbs that describe the
gameplay experience and it should include a 30 seconds of gameplay statement describing what the
player typically does.</p>
      <p>Place This aspect defines places in the game world where the space under construction can be set.
It includes the environment setting of the idea, which is similar to theme, but it describes an actual
location within the game world. This aspect should also provide a list of concrete locations the game
takes place in.</p>
      <p>Unique Features This aspect consists of a list with 3-5 features that are the defining elements of the
idea. It answers the question of how the idea will be unique by contrasting it to existing projects.
Story and Narrative This aspect describes the rough story of the game and how the player
experiences it. It includes defining storytelling methods, such as environmental storytelling, gameplay,
cutscenes, narrators, dialogues, story context, and more.</p>
      <p>Goals, Challenge and Rewards This aspect defines goals, challenges and rewards for the idea.
Goals define objectives that the player has to complete. Challenges are obstacles the player has to
overcome in order to achieve one goal. The rewards describe how the player will be rewarded for
overcoming a set of obstacles to achieve one goal.</p>
      <p>Art Direction This aspect describes the general artistic vision. It should include an art style, color
palettes, and visually unique features.</p>
      <p>Purpose This aspect defines the purpose of the project. It includes formulating the purpose for all
involved stakeholders on why they want to work on the project.</p>
      <p>Opportunities and Risks This aspect describes opportunities and risks of the idea by providing a
list of each. For the opportunities, it includes planning on how to use them efectively. For the risks, it
includes how likely they are to happen and strategies to minimize the risks.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Hardware Setup</title>
      <p>As outlined in section 1, our objective was to ensure that the proposed approach remains accessible to
small indie developers and hobbyists by relying on locally available consumer-grade hardware. To this
end, we selected a representative system configuration that served as the baseline for our study (see
the left side of Table 1). All model selections and experiment designs were made with this system’s
capabilities in mind, ensuring that the approach is technically feasible on such hardware. We chose to
execute the non–user-facing experiments in section 4 on a more powerful machine to reduce runtime
(see the right side of Table 1), but the system detailed on the left represents the minimum specifications
required to reproduce the methodology.</p>
      <sec id="sec-3-1">
        <title>Baseline System Configuration</title>
        <sec id="sec-3-1-1">
          <title>Operating System GPU</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Memory</title>
          <p>Ubuntu 22.04</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>NVIDIA GeForce RTX 3080 Ti 12 GB of GDDR6X</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Faster System Configuration</title>
        <sec id="sec-3-2-1">
          <title>Operating System GPU</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Memory</title>
          <p>Ubuntu 22.04</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>2× NVIDIA A40 48 GB of GDDR6</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Model Comparison and Qualitative Analysis</title>
      <p>
        This section outlines the methodology, results, and discussion of our first experiment, in which we
compared the outputs of three medium-sized LLMs: meta-llama/Llama-3.1-8B-Instruct1 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
Qwen/Qwen2.57B-Instruct2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and deepseek-ai/DeepSeek-R1-Distill-Llama-8B3 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] (LLaMA 3.1, Qwen 2.5, and
DeepSeek-R1 in the following). All three models were selected based on their compatibility with
the baseline system described in section 3, though the actual evaluation was conducted on a more
powerful machine to expedite processing. This section focuses on the comparative analysis itself;
hardware-specific execution details are discussed in their relevant context.
4.1. Methodology
The comparison was conducted through a qualitative human evaluation involving two researchers. To
enable this, we first created a custom dataset of game ideas and collected model outputs for each entry.
The following subsections describe both the data generation process and the evaluation procedure in
detail.
1https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
2https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
3https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
      </p>
      <p>Generate a game idea.
(1) Provide information on &lt;N&gt; of
the following categories: player
experience, theme, gameplay, place,
unique features, story and narrative,
goals, challenges and rewards, art
direction, purpose, opportunities and
risks.
(2) Here is more information about
the categories: &lt;Details on the
aspects&gt;
Summarize your
game idea in an
appropriate
amount of
paragraphs.</p>
      <p>GPT-4o</p>
      <p>Game summaries
GPT-4o</p>
      <p>Game ideas</p>
      <sec id="sec-4-1">
        <title>4.1.1. Game Idea Dataset Creation</title>
        <p>
          To evaluate the capabilities of diferent language models and enable consistent comparisons, we first
created a dataset of game ideas with varying levels of descriptive detail. We used OpenAI’s GPT-4o [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ],
accessed through its oficial chat interface 4, to generate both the ideas and corresponding summaries.
The generation process followed these steps:
• Prompt GPT-4o to generate a game idea (left prompt in Figure 1), optionally specifying how many
aspects to cover (1) and whether to include detailed descriptions of those aspects (2).
• Save the generated game idea as a plain text file for further processing.
• Prompt GPT-4o to produce a summary of the same idea (right prompt in Figure 1).
• Save the summary in a separate text file for evaluation.
        </p>
        <p>
          Results Using the first prompt shown in Figure 1, we generated 15 distinct game ideas under varying
conditions to ensure diversity in content and coverage. Each idea included both a full version and a
summary, resulting in a total of 30 text files. This structured variety allowed us to test model performance
across a range of input detail levels while maintaining consistency in generation logic. The prompt
configurations were as follows:
• 4 game ideas generated without options (1) or (2),
• 5 game ideas using option (1) only, with randomly selected values for &lt;N&gt;: [
          <xref ref-type="bibr" rid="ref3 ref5 ref7 ref7 ref8">3, 5, 7, 7, 8</xref>
          ],
• 3 game ideas using both options (1) and (2), with randomly selected values for &lt;N&gt;: [
          <xref ref-type="bibr" rid="ref6 ref6 ref8">6, 6, 8</xref>
          ],
• 3 game ideas using both options (1) and (2), with &lt;N&gt; fixed at 10.
        </p>
        <p>Therefore, an example prompt of the second configuration would be as follows, where the selection
of categories was determined by the LLM:</p>
        <p>Generate a game idea. Provide information on three of the following categories: player
experience, theme, gameplay, place, unique features, story and narrative, goals, challenges
and rewards, art direction, purpose, opportunities and risks.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.1.2. Model Prompting</title>
        <p>For the model comparison, we used a standardized evaluation prompt (Figure 2) across all models
tested. This prompt instructed each model to assess whether the key aspects required for initiating
game development were present or inferable in each game idea. To generate outputs for all 30 game</p>
        <sec id="sec-4-2-1">
          <title>You are an expert game development consultant. Your task is to evaluate the following game</title>
          <p>text as the foundation for a game development project. Check if the following aspects are
present or can be easily inferred from the game idea: player experience, theme, gameplay,
place, unique features, story and narrative, goals, challenges and rewards, art direction,
purpose, opportunities and risks. Expanded details about the aspects are as follows:
&lt;Details on the aspects&gt;</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>The objective is to check whether fields and aspects required to start development of a</title>
          <p>game have been considered. Add suggestions at the end of evaluation along with 2-5 other
details that would make the text better suited to start game development with in addition
to including aspects that are not addressed in the game text. Do not take into account fiscal
or managerial requirements. Focus only on factors relevant for early stages of game design.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Avoid redundancy and limit your response to 1000 words.</title>
          <p>ideas from subsubsection 4.1.1, we employed the Hugging Face Text Generation Inference Docker5.
This environment streamlined inference across various open-source LLMs, including the three models
selected for our comparison. Each model was prompted once per game idea, resulting in a total of 90
output files. Although the models were chosen for their compatibility with the baseline system, this
phase of the experiment was executed on a more powerful system, as discussed in section 3 and shown
in Table 1.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.1.3. Human Evaluation</title>
        <p>Following the generation of model outputs, we conducted a two-phase human evaluation to assess the
structure, relevance, and quality of the responses. The evaluation was conducted independently by two
researchers who were also closely involved in defining the ten aspects outlined in section 2.</p>
        <p>The first phase involved a high-level comparison of the 90 outputs (30 game ideas × 3 models) to
determine whether each model was capable of providing structured and usable feedback. This phase
aimed to answer the overarching question: Can this model provide structured and coherent feedback on
game concepts? The outputs were evaluated against the following general criteria:
• Format: Does the response follow the requested structure and formatting?
• Completeness: Does the model address all ten predefined aspects?
• Clarity and Coherence: Is the language clear, and does the feedback make logical sense overall?
The second phase focused on a closer qualitative review of the 30 outputs generated by the model
selected as most promising in the first phase. This detailed assessment combined open-ended analysis
with the following targeted criteria:
• Comprehension: Does the model correctly interpret the game idea and identify the relevant
aspects?
• Specificity : Is the feedback tailored to the individual game idea, or is it overly generic?
• Hallucination: To what extent does the model introduce unfounded or invented content?
• Feedback Quality: How valuable and well reasoned is the feedback from a game design
perspective?</p>
        <p>This two-phase process allowed us to first filter for viability and then examine depth and reliability
in greater detail.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Format</title>
      </sec>
      <sec id="sec-4-5">
        <title>Completeness</title>
      </sec>
      <sec id="sec-4-6">
        <title>Clarity</title>
      </sec>
      <sec id="sec-4-7">
        <title>LLaMA 3.1 Qwen 2.5 DeepSeek-R1</title>
        <p>4.2. Results - Phase 1: Comparative Evaluation Across Models
In terms of format, DeepSeek-R1 consistently outperformed the other two models (see Table 2). Outputs
from LLaMA 3.1 and Qwen 2.5 frequently entered infinite loops, repeating the last sentence, paragraph,
or entire structure until reaching the maximum token limit. These outputs were typically cut of
mid-sentence once the limit was reached, and they often failed to follow the structured format specified
in the prompt—namely, organizing feedback around the ten predefined aspects.</p>
        <p>In contrast, DeepSeek-R1 never exhibited looping behavior and provided structured feedback covering
all ten aspects in 26 out of 30 cases. However, a minor issue was observed in 3 out of 30 outputs, where
the model produced unexpected language artifacts, inserting Chinese characters mid-sentence. Despite
this, clarity and coherence were generally comparable across all models—aside from the looping and
formatting issues, no major qualitative diferences were noted in this category.
4.3. Results — Phase 2: In-Depth Analysis of DeepSeek Outputs
Given its strong performance in Phase 1, DeepSeek-R1 was selected for a more detailed analysis in
Phase 2. We observed two distinct output structures across the model’s responses:
• Summary-first structure — the model begins by summarizing the original game idea according
to the ten aspects, followed by a set of suggestions and feedback.
• Integrated structure — feedback and suggestions are embedded directly within each aspect’s
analysis, creating a more intertwined and iterative review.</p>
        <p>The integrated structure typically focused on feedback, while the summary-first structure emphasized
summarization. This distinction made the two structures clearly recognizable in our observations. In
practice, many outputs exhibited variations or hybrid forms of these two patterns, but nearly all could
be classified within or between these structural types, which were approximately evenly distributed.</p>
        <p>The depth and detail of feedback varied significantly across diferent game ideas. In general, the
model tended to echo the aspects explicitly stated in the prompt rather than invent new ones—indicating
a low level of hallucination. The model found this easier to do with the original game idea compared
to its summary. However, there were occasional instances where speculative ideas were presented as
factual. For example, the model sometimes introduced key locations not mentioned in the original
input. While these additions might be interpreted as hallucinations, they were consistently contextually
appropriate and logically consistent with the game’s setting. Since the prompt explicitly requested
additional suggestions, these cases could reflect either issues of expression or mild forms of hallucination,
making their classification less clear-cut.</p>
        <p>Another trend we observed was the model’s ability to adapt its feedback based on the completeness
of the input. Game ideas that lacked specific aspects received more focused and detailed suggestions
in those areas. Conversely, well-rounded ideas covering all ten aspects typically received shorter
summaries, along with a few targeted improvement hints. However, these observations were only
trends. Generally, the quality of the feedback varied considerably. Some responses were rich, specific,
and actionable, while others were brief and more generic. This variability was sometimes influenced by
the completeness and clarity of the input game idea, but not always.
5https://huggingface.co/docs/text-generation-inference/en/index
4.4. Discussion
This section discusses the model comparison and qualitative evaluation, with a focus on informing
the implementation of the prototype system described in section 5. Broader implications and general
reflections are addressed separately in section 6.</p>
        <p>Our aim was to compare the performance of LLaMA 3.1, Qwen 2.5, and DeepSeek-R1 in providing
structured feedback on game concepts. In our experimental setup, both LLaMA 3.1 and Qwen 2.5
failed to consistently adhere to the required output format and completeness criteria. While alternative
prompting strategies or setups might potentially improve their performance, we chose to focus our
in-depth analysis on DeepSeek-R1, which showed the most promise in terms of structural consistency
and coverage.</p>
        <p>DeepSeek-R1 reliably produced outputs structured around the ten predefined aspects of a game
concept—an important requirement for our goal of enabling systematic feedback. Although the quality
of feedback varied across individual outputs, the model’s ability to maintain structural coherence and
generally relevant content led us to proceed with prototype development and a subsequent pilot study.
In short, the model demonstrated suficient capability to warrant practical exploration.</p>
        <p>An additional consideration emerged regarding the dataset used during model evaluation. While
dataset design was not the primary focus of this phase, we observed a recurring bias in ChatGPT toward
large-scale or high-concept game ideas, which may not reflect the scope or constraints of smaller indie
studios. Many ideas shared recurring tropes—such as the presence of multiple coexisting dimensions in
space or time—which may reflect limitations of the generation prompt or the model’s training data.</p>
        <p>These biases, while not critical for the current phase, should be addressed in future
iterations—particularly in the context of the pilot study, where feedback will be applied to participants’ own
early-stage game concepts. This shift from synthetic to authentic data will allow for more targeted
evaluation of model utility in real-world design contexts.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. SPARC</title>
      <p>Following the selection of DeepSeek-R1, we developed a prototype tool named SPARC—System for
Prototyping And Refining Concepts —to support early-stage game design feedback in a practical setting.
The tool features a minimalistic user interface built using Streamlit6 as the frontend framework (see
Figure 3). In this setup, DeepSeek-R1 was integrated directly using the LangChain API7, with a typical
response time of approximately 1–2 minutes per input on our baseline system (see Table 1). SPARC
allows users to upload a game concept as a plain text file to receive structured feedback directly on screen.
The tool was designed to simulate real-world conditions, where users—such as students, hobbyists, or
indie developers—may lack expertise in prompting or may be unfamiliar with relevant aspects like
those identified in section 2. The tool served as the central component in our pilot study, enabling us to
evaluate the model’s usefulness in a context that more closely resembles actual design workflows.
5.1. Study Design and Procedure
With the frontend in place, we conducted a pilot user study with  = 10 participants. The participants
were students enrolled in a narrative storytelling course jointly ofered by the Technical University of
Munich (TUM) and the University of Television and Film Munich (HFF). The course was structured
around collaborative game development, with interdisciplinary teams of approximately four members
each. Students from TUM, enrolled in the Games Engineering program, primarily focused on game
programming, while HFF students came from film-related disciplines and were less directly involved in
games.
6https://streamlit.io/
7https://python.langchain.com/api_reference/</p>
      <p>The user study took place during the early phase of the course, when teams had just developed their
initial game concepts but had not yet started implementation. Participation was voluntary. In return
for their time, participants received formative feedback on their game ideas generated through SPARC,
which was relevant to their course project. In total, ten students across six teams took part.</p>
      <p>The procedure was as follows:
1. Participants reviewed and accepted an informed consent form.
2. SPARC was introduced, including a brief explanation of its purpose and user interface, with
emphasis on the fact that it was hosted locally.
3. Each team submitted its initial game concept as a text file. Because SPARC was hosted on a
private server at the time, we processed the files ourselves and demonstrated the results.
4. For each team, SPARC was run once using its submitted concept. All team members were present
during this process to ensure a shared understanding.
5. The resulting outputs were distributed to participants as text files for review.
6. Each participant then completed an individual online questionnaire, which included both
closedand open-ended questions.
5.2. Participants
All ten participants were students in the joint TUM–HFF course. Their ages ranged from 22 to 31 years
( = 25.0,  = 2.6). Reported gender was male for nine participants ( = 9), and one participant
chose not to disclose gender. Participants reported academic backgrounds in game design, narrative,
technical art, and computer science. Nine participants self-reported working in or studying game
development, while the remaining participant identified primarily as a player. This contextualizes their
perspective in relation to the feedback they provided.</p>
      <sec id="sec-5-1">
        <title>Frequency Frequency 6 4</title>
        <p>Maybe (80%)</p>
      </sec>
      <sec id="sec-5-2">
        <title>Rating Rating</title>
        <p>(a) How would you rate the quality of the response?
(b) How helpful do you think the response was?
5
3
3
4
1
5
5.3. Results
The results of the closed-ended questions are shown in Figure 4.</p>
        <p>In the open-ended responses, four students expressed a desire for more in-depth evaluations,
suggesting that the feedback could benefit from greater detail or elaboration. One participant specifically
noted that the response was "really good" and that the tool "gave some interesting perspectives /
recommendations." However, the same participant also observed a misinterpretation in the output, where the
model incorrectly identified certain elements—such as the art style—from the input. Three students
proposed an additional feature: the ability to focus on individual aspects of the game concept, rather
than receiving feedback on all ten at once. Another participant suggested that the tool be made available
to all students, highlighting its perceived value beyond the pilot setting. The remaining comments were
largely unrelated to the tool’s functionality—for example, non-substantive responses such as "idk" or
feedback on the naming scheme of output files.
5.4. Discussion
As in previous sections, this discussion focuses specifically on the outcomes of the pilot study. Broader
implications are explored in section 6.</p>
        <p>The quantitative results presented in the accompanying figures are encouraging. Participants
generally rated the quality of the model’s feedback as above average, suggesting that medium-sized LLMs
like DeepSeek-R1 are capable of producing coherent and relevant responses that reflect a reasonable
understanding of game concepts. However, the helpfulness of the feedback was rated slightly lower
than its quality—while not negative, it indicates room for improvement in practical applicability.</p>
        <p>Notably, two participants (20%) indicated they would incorporate the model’s feedback into their
game concepts, a modest proportion but nevertheless an increase over a baseline of zero. Moreover, 80%
of participants expressed interest in using such a tool in the future, underscoring the potential value of
locally hosted systems that prioritize privacy and accessibility. This interest was further reflected in
participants’ willingness to tolerate longer response times, as well as in one explicit request to make
the tool available more broadly to all students.</p>
        <p>That said, limitations in functionality and output quality remain apparent. Some participants were
satisfied with the feedback, while others expressed a desire for more in-depth evaluations. A commonly
suggested improvement was the option to receive feedback on individual aspects rather than all ten at
once. While this feature could improve focus and perceived depth, it may also increase total runtime, as
it would require separate inference passes for each aspect.</p>
        <p>There are also potential limitations in the study design that may have introduced bias. For instance,
members of the same team received identical responses for their submission, influencing more than
one set of answers. Additionally, participants’ game concepts might have been at slightly diferent
stages of development and may have varied in their level of detail, which could have influenced how
specific or generic the model’s responses appeared. Since we did not formally evaluate or categorize the
participant-submitted ideas, we cannot control for this variable. However, based on our earlier human
evaluation, it is plausible that more complete submissions (those covering most or all of the ten aspects)
led to shorter or more generic feedback.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. General Discussion and Implications</title>
      <p>To summarize our findings, the use of medium-sized LLMs—specifically DeepSeek-R1—on
consumergrade hardware shows considerable promise. At the upper end of the quality spectrum, the model
produced strong results in both the human evaluation and the pilot study. These outcomes demonstrate
the feasibility of leveraging locally hosted LLMs to support game designers without compromising
privacy or intellectual property concerns. This potential is further reflected in participants’ enthusiasm
for such tools, as evidenced by their willingness to use SPARC in future projects.</p>
      <p>However, the results also point to clear areas for improvement. Certain aspects were better understood
by the model than others, and the output continued to follow two distinct structural formats, which
may afect consistency and user expectations. A key opportunity for enhancing reliability lies in
prompt engineering. Targeting individual aspects through separate prompts—an idea also suggested by
participants—may improve both the specificity and depth of the feedback. Future studies could explore
this structured prompting strategy in more detail.</p>
      <p>
        Looking ahead, we envision an evolution of SPARC that leverages its current strength—categorizing
design feedback by aspect—but moves away from generating new ideas or full rewrites. Instead, the
system could help designers identify unclear or underdeveloped areas in their concepts. These areas
could then be paired with targeted, thought-provoking questions drawn from a curated catalog to guide
further development. This shift would support both conceptual clarity and creative iteration, aligning
well with established game design practices [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Such an approach would transform SPARC from a
general-purpose evaluator into a structured design support system, capable of helping designers not
only reflect on what is missing but also how to improve it.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Related Work</title>
      <p>
        The optimization of language models for consumer-grade hardware is an active area of research. Xu
et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] provide a comprehensive review of strategies for running LLMs on resource-constrained
systems across various domains. Their work highlights a range of applications—from text generation
for mobile messaging [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to potential use in medical diagnostics [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. However, these applications
are often geared toward general-purpose users, whereas our study distinguishes itself by focusing on
supporting domain experts, specifically in game design.
      </p>
      <p>
        The emerging field of LLM-as-a-judge explores the use of language models to assess user input or
system performance [
        <xref ref-type="bibr" rid="ref18">18, 19</xref>
        ]. In a games context, Tucek et al. [20] propose the game prototype One Spell
Fits All, in which a player’s in-game decisions are judged by LLMs for creativity and appropriateness.
Their work also emphasizes running AI models locally, aligning with our interest in minimizing reliance
on cloud-based solutions. However, while both approaches involve evaluating human-generated content
with LLMs and prioritize local execution, our work difers in its focus: we aim to assist designers during
the conceptualization phase, rather than embed AI into the gameplay loop itself. Similarly, Hutson et
al. [21] use generative AI to support assessment and feedback in game design education, with the goal
of enhancing student engagement and learning outcomes. While our system also provides feedback on
game concepts, our focus is not on pedagogical evaluation, but rather on helping designers iteratively
refine early-stage ideas.
      </p>
      <p>More broadly, the use of LLMs to support game design processes has received increasing attention [22,
23]. Begemann et al. [24] and Long et al. [25] explore the use of generative AI during the early phases
of game development, emphasizing its utility in supporting creativity and concept generation. However,
their focus lies primarily in visual content creation—such as image or asset generation—whereas our
study focuses specifically on the textual structure and clarity of game concepts. Lee et al. [ 26] investigate
AI-supported workflows for generating complete game design proposals, including concept art and
documentation, over a longitudinal study spanning four years. While we share a focus on the early
stages of game development, our work difers in both scope and scale: we concentrate specifically on
the written concept itself and explicitly emphasize deployment on consumer-grade hardware, making
our approach more accessible to indie developers and students.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Future Work</title>
      <p>In this study, we identified ten key aspects that contribute to a strong game concept and evaluated
three medium-sized language models—LLaMA 3.1, Qwen 2.5, and DeepSeek-R1—all of which can be run
on consumer-grade hardware. Through a structured human evaluation, we compared the outputs of
these models and selected DeepSeek-R1 for a more in-depth analysis based on its consistent formatting
and coverage of the ten aspects. Building on this, we developed SPARC (System for Prototyping and
Refining Concepts), a lightweight prototype tool that enables users to upload game concepts and
receive structured feedback. We then conducted a pilot study to assess SPARC’s practical efectiveness
in supporting early-stage game design. The results suggest that medium-sized LLMs are promising
tools for assisting designers during the conceptualization phase, ofering a balance between usability,
performance, and local deployment. However, the current system still exhibits inconsistencies in output
quality, and the usefulness of the feedback varies depending on the input and aspect coverage.</p>
      <p>To address these limitations, future work will focus on improving prompt design and refining the
interaction model to support aspect-specific evaluations, allowing users to request focused feedback
on individual dimensions of their game concepts. Additionally, rather than ofering direct
suggestions—which can vary in quality—we propose an alternative strategy: generating thought-provoking
questions targeting underdeveloped aspects. This approach aligns more closely with iterative design
practices and aims to better support designers in refining their ideas. By shifting from prescriptive
feedback to guided reflection, future iterations of SPARC can evolve into a more efective design support
system that empowers users to make meaningful improvements to their concepts.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>Thanks to the developers of ACM consolidated LaTeX styles https://github.com/borisveytsman/acmart
and to the developers of Elsevier updated LATEX templates https://www.ctan.org/tex-archive/macros/
latex/contrib/els-cas-templates.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors wrote a full draft of the paper and subsequently
used chatgpt.com with GPT-4o to improve the writing style and grammar. Further, the authors used
perplexity.ai to get an initial overview of related papers. After using these tools/services, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
[19] D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu, et al.,
From generation to judgment: Opportunities and challenges of llm-as-a-judge, arXiv preprint
arXiv:2411.16594 (2024).
[20] T. Tucek, K. Harshina, G. Samaritaki, D. Rajesh, One spell fits all: A generative ai game as a tool
for research in ai creativity and sustainable design (2024).
[21] J. Hutson, B. Fulcher, J. Ratican, Enhancing assessment and feedback in game design programs:
Leveraging generative ai for eficient and meaningful evaluation, International Journal of
Educational Research and Innovation (2024).
[22] R. Gallotta, G. Todd, M. Zammit, S. Earle, A. Liapis, J. Togelius, G. N. Yannakakis, Large language
models and games: A survey and roadmap, IEEE Transactions on Games (2024).
[23] P. Sweetser, Large language models and video games: A preliminary scoping review, in:
Proceedings of the 6th ACM Conference on Conversational User Interfaces, 2024, pp. 1–8.
[24] A. Begemann, J. Hutson, Empirical insights into ai-assisted game development: A case study on
the integration of generative ai tools in creative pipelines, Metaverse 5 (2024).
[25] L. Long, C. Xinyi, W. Ruoyu, L. Toby Jia-Jun, L. Ray, Sketchar: Supporting character design
and illustration prototyping using generative ai, Proceedings of the ACM on Human-Computer
Interaction 8 (2024) 337.
[26] J. Lee, S.-Y. Eom, J. Lee, Empowering game designers with generative ai, IADIS International
Journal on Computer Science &amp; Information Systems 18 (2023) 213–230.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Kanode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Haddad</surname>
          </string-name>
          ,
          <article-title>Software engineering challenges in game development</article-title>
          ,
          <source>in: 2009 Sixth International Conference on Information Technology: New Generations</source>
          , IEEE,
          <year>2009</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Nazi</surname>
          </string-name>
          , W. Peng,
          <article-title>Large language models in healthcare and medical domain: A review</article-title>
          ,
          <source>in: Informatics</source>
          , volume
          <volume>11</volume>
          ,
          <string-name>
            <surname>MDPI</surname>
          </string-name>
          ,
          <year>2024</year>
          , p.
          <fpage>57</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rechardt</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Nejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yáñez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Borghesani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pashkov</surname>
          </string-name>
          , et al.,
          <article-title>Large language models surpass human experts in predicting neuroscience results</article-title>
          ,
          <source>Nature human behaviour 9</source>
          (
          <year>2025</year>
          )
          <fpage>305</fpage>
          -
          <lpage>315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Lanzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Loiacono</surname>
          </string-name>
          ,
          <article-title>Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design</article-title>
          ,
          <source>in: Proceedings of the Genetic and Evolutionary Computation Conference</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1383</fpage>
          -
          <lpage>1390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Tekinbas</surname>
          </string-name>
          , E. Zimmerman,
          <article-title>Rules of play: Game design fundamentals</article-title>
          , MIT press,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schell</surname>
          </string-name>
          ,
          <article-title>The Art of Game Design: A book of lenses</article-title>
          , CRC press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Galuzin</surname>
          </string-name>
          , Preproduction Blueprint:
          <article-title>How to Plan Game Environments</article-title>
          and Level Designs,
          <source>CreateSpace Independent Publishing Platform</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Totten</surname>
          </string-name>
          ,
          <article-title>Level design: Processes and experiences</article-title>
          , CRC Press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fullerton</surname>
          </string-name>
          , Game design workshop:
          <article-title>a playcentric approach to creating innovative games</article-title>
          , AK Peters/CrC Press,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Level design book,
          <year>2020</year>
          . URL: https://www.leveldesignbook.com/, accessed:
          <fpage>2025</fpage>
          -06-30.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Meta</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <source>Meta llama 3</source>
          .1: Advancing open foundation models,
          <year>2025</year>
          . URL: https://ai.meta.com/blog/ meta-llama-3-1/, accessed:
          <fpage>2025</fpage>
          -06-30.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Qwen2 technical report, arXiv preprint arXiv:2407.10671</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Song,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , S. Ma,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Deepseek-</surname>
          </string-name>
          r1:
          <article-title>Incentivizing reasoning capability in llms via reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:2501.12948</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Goucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ostrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Welihinda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          , et al.,
          <article-title>Gpt-4o system card</article-title>
          ,
          <source>arXiv preprint arXiv:2410.21276</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <article-title>On-device language models: A comprehensive review</article-title>
          ,
          <source>arXiv preprint arXiv:2409.00088</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Android</surname>
            <given-names>Developers</given-names>
          </string-name>
          , Gemini nano | android developers,
          <year>2024</year>
          . URL: https://developer.android.com/ ai/gemini-nano#gboard-smart, accessed:
          <fpage>2025</fpage>
          -06-30.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Labrak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bazoge</surname>
          </string-name>
          , E. Morin,
          <string-name>
            <given-names>P.-A.</given-names>
            <surname>Gourraud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <article-title>Biomistral: A collection of open-source pretrained large language models for medical domains</article-title>
          ,
          <source>arXiv preprint arXiv:2402.10373</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. Liu, et al.,
          <article-title>A survey on llm-as-a-judge</article-title>
          ,
          <source>arXiv preprint arXiv:2411.15594</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>