<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bologna, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Reinforcement Learning to Develop Agents for a Fighting Video Game</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel Roberto Matera</string-name>
          <email>m.matera51@studenti.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>pierpaolo.basile@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bari Aldo Moro</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Video game</institution>
          ,
          <addr-line>Artificial Intelligence, Genetic Algorithm, Reinforcement Learning, Imitation Learning</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>2</volume>
      <fpage>5</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>This work focuses on the development of agents for fighting video games, presenting three distinct approaches. The first agent implements a hybrid strategy, structured hierarchically by applying the Genetic Algorithm and the Monte Carlo Tree Search. The second and third agents are based on Linear Q-Learning, but difer in their learning strategies: the second agent requires a training phase, whereas the third one learns online. Regarding the second agent, we investigate two training strategies: one based on Self-play and another based on a Genetic Algorithm, which evolves a population of Reinforcement Learning agents. The third agent is a customized variant of QDagger, a Policy-to-Value Reinforcement Learning method, which uses Monte Carlo Tree Search as its teacher policy. Our main interest is to propose alternative approaches to traditional AI enemy design, and to investigate how such methods are perceived by players. To this end, we conducted a user test in which participants played against the developed agents and evaluated their experience through validated questionnaires. Results reveal a generally positive outcome, with the third agent emerging as the most promising in terms of player engagement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Fighting video games represent a domain in which racing against time, dealing with unpredictable
opponents, and adapting to new strategies are necessary for delivering a rewarding and engaging
gaming experience. In traditional game design, AI enemies have relied on rule-based systems, finite-state
machines, and predefined scripts [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Although efective in creating predictable and manageable
behaviours, this approach inherently limits the adaptability and complexity of AI opponents. As a result,
enemy actions are often repetitive and easily anticipated by players. Additionally, the static nature of
rule-based AI tends to produce imbalanced dificulty levels, with enemies appearing either trivially easy
or frustratingly challenging, and thereby worsening the enjoyment of the gaming experience. These
limitations motivate the exploration of learning-based alternatives that can generate richer and more
varied opponent behaviors. In this work, we investigate how diferent approaches to agent design
afect the player’s gameplay experience. Hence, we develop three agents using Genetic Algorithms
(GA) [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ] and Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], each ofering distinct mechanisms for
decisionmaking. Accordingly, this work investigates whether learning-based approaches constitute viable
alternatives to traditional AI enemy design in fighting video games and whether they generate a
rewarding and engaging gameplay experience for players. To address these research questions, we
assess the efectiveness of these approaches through user evaluations.
https://github.com/matera02 (M. R. Matera); https://swap.di.uniba.it/members/basile.pierpaolo/ (P. Basile)
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Research on agent development for fighting video games often relies on DareFightingICE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as the
experimental environment. DareFightingICE is a fighting game platform developed by the Intelligent
Computer Entertainment Laboratory at Ritsumeikan University in Japan. This Java-based platform
serves as both a research tool and an environment for competitions at several conferences, including
the IEEE Conference on Computational Intelligence in Games (IEEE CIG). Agents can access data such
as position, velocity, health points, and action states, with a response time of 16.66 ms per frame.
The prevailing trend in agent development is the adoption of the Monte Carlo Tree Search (MCTS)
algorithm [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. The use of MCTS for stochastic simulations to evaluate in-game decision making
was first demonstrated in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. MCTS is valued for its real-time decision-making capabilities, ease of
implementation, and consistent performance in competitive scenarios. It was also observed that
simulation accuracy improves when an opponent-modelling mechanism is incorporated. Building on this
insight, [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced an action-prediction module, in which the agent maintains and continuously
updates an action table reflecting the opponent’s play patterns, thereby achieving superior performance.
Nevertheless, a notable limitation afecting both approaches [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] is the strong dependence on the
initial states: in both cases, simulations are initialized with five random actions, which may result in
suboptimal action selection that fails to account for strategic game dynamics. To address this limitation,
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] proposed a hierarchical architecture combining GA with MCTS. While this hybrid approach reduces
response times and eliminates the need for domain-specific knowledge, it introduces computational
overhead and necessitates careful design of the fitness function. Despite these limitations, the method
adopted for implementing our first agent follows this hybrid approach.
      </p>
      <p>
        In the application of Reinforcement Learning (RL) to fighting agent development, there has been a shift
towards deep neural architectures. For example, [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] introduced a deep RL framework incorporating a
hybrid reward architecture (HRA), in which the overall reward function is decomposed into multiple
components and a separate value function is learned for each one; experimental results demonstrate
that HRA-based models outperform their non-HRA counterparts. Although this architecture exceeds
the complexity our experimental setup can accommodate, its reward decomposition scheme could still
help design simpler models, such as those employed for our second and third agents. In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], a training
method combining Self-play and MCTS for deep RL agents is proposed. Given the promising results
reported, this methodology will be adopted for training our second agent. Self-play, in particular, ofers
significant advantages, including the elimination of the need for external data and the ability to enable
continuous adaptation. However, it also presents challenges, including convergence to suboptimal
strategies and an imbalance between exploitation and exploration, often favouring the former.
While these approaches have demonstrated strong performance, they typically do not fully account for
the player’s gameplay experience. In this regard, [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] explores a method to implement an agent based
on a variant of MCTS designed to follow specific fighting styles. This study aimed to enhance the
experience for both players and spectators by generating personalized gameplay tailored to individual viewer
preferences. However, experimental results revealed that the agents struggled to accurately replicate
certain fighting styles, indicating the need for further improvements in the evaluation functions used to
control agent behavior. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] a Dynamic Dificulty Adjustment (DDA) system is introduced using
two machine learning agents. The first agent learns the player’s behavior through Imitation Learning
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], while the second one is trained via Reinforcement Learning to defeat the first. This combination
enables the generation of a personalized level of challenge. Although the study was conducted with a
small number of participants, it provided a useful reference that led us to evaluate Imitation Learning
as a design component for the third agent. Specifically, we selected DAgger (Dataset Aggregation)
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], as this algorithm leverages no-regret online learning principles to enable progressive strategy
optimization in response to opponent moves, delivering continuous performance improvement while
adapting to unpredictable adversarial behaviors.
      </p>
      <p>
        In addition to overlooking users’ gameplay experience, the aforementioned RL methods rely on
computationally expensive models, both in terms of training and inference. Given the simplicity of a 2D
ifghting video game, such complexity seems unjustified outside of purely academic interest. Therefore,
we chose to adopt the Linear Q-Learning algorithm [
        <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
        ], which ofers a favourable trade-of
between computational eficiency and generalization.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section reports details about all the methodologies exploited by our work for developing intelligent
agents in fighting video games.</p>
      <sec id="sec-3-1">
        <title>3.1. GAMCTS Agent: A Hierarchical Integration of Genetic Algorithm and MCTS</title>
        <p>
          The primary challenge in developing agents for fighting video games lies in ensuring response times
within 16.6 milliseconds to match the game’s frame rate, while eficiently and exhaustively exploring
the state space to learn efective strategies A clear understanding of our approach requires a description
of the game environment’s action space. The environment defines a total of 56 possible actions, 40 of
which are actively selectable by the agent. These actions are partitioned into three categories based on
their execution context: 15 air actions, 25 ground actions, and a remaining set of special moves. Since
agents must predict both their own actions and opponent responses, the total number of possible action
pairs amounts to 40 × 40 = 1600 combinations. Given the 16ms time constraint, exhaustive simulation of
all combinations is computationally infeasible. For this reason, previous approaches [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ] attempted
to improve MCTS eficiency by assuming that selecting only 5 actions at random would sufice. Since
this small subset of actions has a significant impact on determining the actual action to be executed, we
now ask what would happen if, instead, more promising actions were considered from the beginning.
To investigate this, we adopt a hierarchical approach to action selection [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. At each decision point, the
process works as follows: we first frame the problem as an optimization task, where the GA identifies
the five most promising actions according to a given evaluation criterion; then, these actions are passed
to the MCTS to determine the actual action to be executed. This hierarchical structure ensures the GA
provides a good starting point for MCTS exploration at every step.
In Figure 1 we summarize the GAMCTS approach. As shown in Table 1, the selected configuration for
MCTS bounds the UCT budget, balancing performance and responsiveness. The UCB1 exploration
constant encourages suficient exploration within the limited UCT iterations, while the node expansion
threshold prevents excessive tree growth. The hyperparameter configuration for the GA presented in
Table 2 reflects a strategy that prioritizes exploitation over exploration, considering the implications of
the Schema Theorem [
          <xref ref-type="bibr" rid="ref25 ref4">4, 25</xref>
          ]. The reduced population size and limited number of generations represent
        </p>
        <sec id="sec-3-1-1">
          <title>UCT time limit</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Maximum number of UCT iterations</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>UCB1 exploration constant</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Maximum UCT tree depth</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Node expansion threshold (UCT)</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>Simulation duration (in frames)</title>
          <p>
            Value
a necessary computational trade-of to ensure algorithm completion within the prescribed temporal
frames, though this constrains genetic diversity. The extremely low mutation probability and high
tournament selection probability heavily favor exploitation by minimizing random perturbations and
systematically selecting higher-fitness individuals. The primary exploratory component resides in the
modified two-point crossover configuration, which prevents gene duplication within chromosomes.
Despite introducing computational overhead, this constraint ensures that generated action sequences
maintain the diversity necessary for efective MCTS simulation. The non-duplicated chromosome
structure, which encodes actions of the same type, ensures consistency with the prediction performed
by MCTS, by restricting candidate actions to those compatible with the predicted action context.
Furthermore, we used tournament selection [
            <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
            ] due to its ease of implementation, noise resistance,
and direct control over the selection pressure. To evaluate the quality of candidate solutions, the
GA employs a specifically designed fitness function, which is computed from a short-term forward
simulation of the game state based on the action sequences encoded in each chromosome. Formally,
the function is defined as:
          </p>
          <p>Fitness() = 2Δ  () − 1.5() + 10 ()
(1)
where  denotes a chromosome, and:
• Δ  ()
• ()
•  ()</p>
          <p>is the HP diference between the agent and the opponent;
is a penalty based on the distance between the agent and the opponent;
is the counter of successful hits landed by the agent.</p>
          <p>
            This fitness function characterizes the agent with an ofensive gameplay strategy, where chromosomes
exhibiting higher fitness values correspond to strategies that favor frequent close-range attacks. The
weight coeficients associated with each component were determined through empirical evaluation to
achieve the desired strategic emphasis. Furthermore, being linear in its formulation, this function is
well-suited for the maximization problem at hand, which requires a concave function formulation [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Reinforcement Learning Agents</title>
        <p>3.2.1. Environment Description
To introduce RL agents, it is necessary to define the environment in which they operate and the
reward mechanism that guides their learning process. Its observation space and action space define the
environment. The observation space can be schematically represented in Table 3 for both the agent and</p>
        <sec id="sec-3-2-1">
          <title>Energy</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Position X</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Position Y</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Speed X</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Speed Y</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>State</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Action</title>
          <p>with the player and that related to the launched projectiles. Therefore, the observation space size is 144,
as the table represents the information for each entity. Since this size afects the agent’s performance
and several features are either uninformative or deducible, we applied a feature selection mechanism
to reduce the dimensionality of the observation space. The selected features comprise 18 elements
distributed as follows, with the number of instances indicated in parentheses: air recovery actions for
the agent (2), recovery and position-deducible actions for the opponent (7), transition actions to DOWN
state for both entities (2), STAND state for both entities (2), AIR state for the agent only (1), Hit Area Y
coordinates for agent projectiles and opponent’s second projectile (3), and Hit Damage information for
the agent (1). In this fashion, we have reduced the size of the observation space from 144 to 126. As for
the action space, the same considerations apply as we discussed in Section 3.1.</p>
          <p>In addition, the environment provides the agent with a reward, which we have defined as:
R =
ΔHPopp − ΔHPmy

+ 
(2)
where:
• ΔHPopp = HPopprep − HPonpopw is the decrease in the opponent’s health points.
• ΔHPmy = HPpmrye − HPnmoyw is the decrease in the agent’s health points.
•  is a bonus:  = +0.01 if the distance to the opponent has decreased, −0.01 if it has increased.
• The constant  = 10 is used for normalization.</p>
          <p>
            This function is inspired by the reward defined in [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. ΔHPopp represents the reward component that
favors the ofensive strategy, while
          </p>
          <p>
            ΔHPmy values the defensive strategy. In [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], both components
are added up to reward both strategies. In our approach, however, we subtract rather than add the
defensive component from the ofensive one, creating a reward that evaluates the net outcome of combat
exchanges. This formulation rewards the agent when its ofensive gains exceed its defensive losses,
thus encouraging agents to seek favorable combat trade-ofs. To prevent excessive reward values, we
normalize the diference in the numerator using a constant, denoted as  , determined through empirical
testing. Additionally, to encourage proactive engagement with the opponent, we include a bonus factor
 that rewards movement toward the opponent. This bonus component provides the primary incentive
for ofensive positioning and active combat engagement.
3.2.2. Base Model: Q-Learning with Linear Function Approximation
The Reinforcement Learning problem is modeled as a Markov Decision Process (MDP) defined by
the tuple ( ,  ,  , ℛ,  )
probabilities, ℛ(, )
[
            <xref ref-type="bibr" rid="ref29">29</xref>
            ], where  is the state space,  the action space,  (
′|, ) the transition
the reward function, and  ∈ [0, 1) the discount factor. In our environment,
the state space  ⊆ ℝ 126 is a 126-dimensional continuous observation space, while the action space
 = {1, … , 56}
          </p>
          <p>consists of 56 discrete actions and the reward function is defined in Equation 2.</p>
          <p>
            The objective in this MDP framework is to learn the optimal action-value function  ∗(, ) , which
represents the expected cumulative discounted reward obtained by taking action  in state  and
subsequently following the optimal policy. To handle the large dimensionality of the state-action space,
the RL agents employed in this study adopt Q-Learning with Linear Function Approximation as their
base model [
            <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
            ]. This approach enables generalization across similar states and actions by
representing the Q-function as (, ;
w) = w⊤(, )
, where w ∈ ℝ is a weight vector and (, ) ∈ ℝ

a sparse feature vector encoding state-action pairs with  = 126 × 56 = 7056 . The weight vector w is
updated using the standard Q-Learning update rule adapted for linear function approximation [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. An
 -greedy policy is adopted for action selection [
            <xref ref-type="bibr" rid="ref30 ref7">30, 7</xref>
            ].
          </p>
          <p>Parameter</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>Discount factor</title>
        </sec>
        <sec id="sec-3-2-9">
          <title>Learning rate</title>
        </sec>
        <sec id="sec-3-2-10">
          <title>Exploration rate</title>
          <p>
            Value
0.95
3.2.3. Two Stage Training with MCTS and Self-play
In the simplest case, the agent could be trained against an opponent that performs random actions.
However, this quickly becomes inefective due to the opponent’s predictable behavior, as its actions are
discrete and independent. To address this limitation, we employ an MCTS-based opponent that provides
strategic behavior through systematic exploration of the state space. While this approach doesn’t
eliminate the risk of convergence to a local optimum, it ofers a more robust foundation for an initial
training stage than the previous method. To further improve the agent, our methodology combines
MCTS and Self-play [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ] training in a structured two-stage approach, drawing from the framework
presented in [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. This combination leverages the complementary strengths of both methods: MCTS
provides diverse strategic exposure while Self-play identifies and addresses strategic weaknesses.
          </p>
          <p>During the first stage, the agent is trained exclusively against an MCTS-based opponent over 500
episodes to establish a baseline level of competency. The MCTS algorithm’s broad exploration through
simulations exposes the agent to diverse scenarios, enabling efective action filtering through move
masking and the development of an accurate value function. After 300 episodes, the learning rate 
decreases from 0.03 to 0.01 to facilitate initial rapid learning followed by strategic refinement.
In the second stage, which spans an additional 2,000 episodes, the training alternates between the
MCTS opponent and Self-play copies in a 1:3 ratio (MCTS:Self-play). This ratio ensures generalization
through MCTS exposure while allowing Self-play to identify and eliminate strategic vulnerabilities.</p>
          <p>
            Unlike the referenced approach, we expand the agent pool every 500 episodes regardless of win rate,
maintaining the fixed ratio due to hardware constraints that prevent parallel environment execution.
Each training episode corresponds to a complete match against the designated opponent type. The
total estimated training time is approximately 42 hours. To encourage goal-directed behavior, the agent
receives win/loss bonuses of ± 10, supplementing the standard reward function and addressing aspects
not explicitly captured by the environment’s reward architecture. Self-play agents in the Agent Pool
are updated only when the training agent’s average reward over a sliding window of five episodes
exceeds the previous best by at least 5%. A key limitation of our approach is the risk of convergence to
suboptimal solutions, due to the generalization introduced by the linear approximation of (, ) [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ].
Unlike in tabular Q-Learning, where state values are independent and local optima are also global, the
linear function introduces dependencies between states, so improving performance in one may worsen
it in others. Its performance improves when there is a way to escape such local optima [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ].
3.2.4. An Evolutionary Approach to Training RL Agents
A common practice for escaping local optima involves stochastic search algorithms based on population
methods, where multiple agents are trained in parallel and the best performer is selected based on
cumulative reward. This leads to the adoption of Evolutionary Reinforcement Learning (EvoRL), which
integrates Evolutionary Computation with RL [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ].
          </p>
          <p>We illustrate a simple and general framework of EvoRL in Figure 3.</p>
          <p>
            It consists of two loops: the outer loop governs the EC process, while the inner loop represents
agent–environment interactions in RL. Initially, a population of candidate solutions is randomly
generated. Ofspring candidate solutions are then generated from parents via variation. Each ofspring is
evaluated by performing a RL task to obtain its fitness value. A new population is selected for the next
iteration by replacing the entire current population with the ofspring generated through recombination
and mutation. While this is a basic example, EvoRL encompasses several research areas. Among them,
Policy Search aims to find policies that maximize cumulative reward. One technique that can be adopted
within this context is Neuroevolution [
            <xref ref-type="bibr" rid="ref35">35</xref>
            ], which evolves neural network weights and architectures
without relying on gradients. Research, such as [
            <xref ref-type="bibr" rid="ref36">36</xref>
            ], demonstrates this integration, utilizing GA to
evolve a population of neural networks, each represented by its weights.
          </p>
          <p>Inspired by this approach, we developed an alternative training methodology for our agent. By evolving
a population of RL agents, our goal is to discover the optimal weights for the linear approximation
of (, ) . In essence, we reformulate the training process as an optimization problem, where the
objective is to find a weight configuration that maximizes the cumulative reward. We adopt a GA where
each chromosome is encoded by the weights of a corresponding RL agent. Moreover, this approach
removes the need for an initial training phase to generate a baseline experience for the agent, as we
start from a population of agents, each initialized with a random configuration of weights. Due to
limited computational resources, the approach was not parallelized, and each agent in the population is
trained sequentially for a fixed number of episodes. Agents that achieve higher fitness, measured as the
cumulative reward over their available episodes, are considered better candidates and are therefore more
likely to survive and propagate their parameter configurations to the next generation. This evolutionary
mechanism enables the exploration of multiple weight configurations, allowing the algorithm to escape
suboptimal solutions.</p>
          <p>Parameter</p>
        </sec>
        <sec id="sec-3-2-11">
          <title>Individual</title>
        </sec>
        <sec id="sec-3-2-12">
          <title>Selection method</title>
        </sec>
        <sec id="sec-3-2-13">
          <title>Crossover method</title>
        </sec>
        <sec id="sec-3-2-14">
          <title>Population size</title>
        </sec>
        <sec id="sec-3-2-15">
          <title>Tournament size</title>
        </sec>
        <sec id="sec-3-2-16">
          <title>Mutation probability</title>
        </sec>
        <sec id="sec-3-2-17">
          <title>Episodes per individual</title>
        </sec>
        <sec id="sec-3-2-18">
          <title>Episodes per generation</title>
        </sec>
        <sec id="sec-3-2-19">
          <title>Number of generations</title>
          <p>Value</p>
          <p>
            In Table 5, we report the parameters used for the GA employed in training, referred to as GARL. The
parameters were selected empirically. A tournament size of 3 promotes a moderate level of elitism,
avoiding excessive selection pressure. The number of episodes per individual was set to balance
learning speed and computational cost, limiting the total number of episodes per generation. The
training required 500 episodes in total, with an estimated time of approximately 8 hours. While
the implementation could benefit from parallelization, this training approach proved more eficient,
requiring only about 19% of the time needed by the previous method. Since the agent trained with
this methodology outperformed the one trained with the previously discussed approach in preliminary
tests, we selected it as our second agent for user evaluation and will refer to it as RLAgent.
3.2.5. Cutting Training Time with QDagger
Since training an RL agent from scratch is computationally expensive, we aim to find a method that
can accelerate this process. To address this, we adopt Policy-to-Value Reinforcement Learning (PVRL),
which transfers a suboptimal policy to a value-based agent, enabling eficient training regardless of
the original model. A proper PVRL algorithm must satisfy three key desiderata: teacher-agnosticism,
ensuring that the student does not depend on the teacher’s architecture or learning algorithm; weaning
support, involving a progressive reduction of the student’s reliance on the teacher policy as training
proceeds; and computational eficiency, meaning that this method must incur lower cost than training
from scratch. Hence, our third agent is inspired by the QDagger algorithm [
            <xref ref-type="bibr" rid="ref37">37</xref>
            ], which combines DAgger
[
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] with n-step Q-learning. Building on this approach, we propose a modified variant tailored to our
specific setting with linear function approximation. The proposed approach difers from [
            <xref ref-type="bibr" rid="ref37">37</xref>
            ] in two key
aspects: it employs Linear Q-Learning and uses a mean squared error loss (MSE) between the teacher
and student Q-value vectors for policy distillation [
            <xref ref-type="bibr" rid="ref38">38</xref>
            ]. Furthermore, given the absence of prior data in
our setting, we operate exclusively in online mode, bypassing the first training phase outlined in [
            <xref ref-type="bibr" rid="ref37">37</xref>
            ].
The loss function combines temporal diference learning with policy distillation:
ℒQDagger(; w ) = ℒ⏟⏟⏟T⏟⏟D⏟⏟(⏟⏟;⏟⏟⏟⏟⏟⏟ ) + ⏟⏟⏟⋅⏟⏟ℒ⏟⏟M⏟⏟S⏟⏟E⏟(⏟;⏟⏟⏟⏟w⏟⏟ )
          </p>
          <p>w
TD Loss</p>
          <p>Distillation Loss
where  represents a generic replay bufer,</p>
          <p>w are the student’s weights and   is the distillation
coeficient at time step</p>
          <p>
            . To collect expert data we employ a bufer defined as
where   is the observed state, q  is the Q-value vector estimated by the expert policy for state   , and
 = 3 is the bufer size, with FIFO replacement upon reaching capacity. At each update iteration, the
agent applies a standard Q-learning update, inserts the current tuple into the bufer, and then updates
 Expert = {(  , q

  )}=1

the weights by minimizing the MSE loss over all bufer entries, while ensuring teacher-agnosticism and
computational eficiency. To provide support for weaning, we introduce the distillation coeficient
which allows QDagger to deviate from the teacher’s suboptimal policy   , rather than converging toward
it. We selected MCTS as our teacher policy because it is applicable to MDPs [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. The absence of prior
domain knowledge precluded training a classifier-based policy, while MCTS ofers several advantages
within fighting game environments that make it particularly suitable as a heuristic policy. As suggested
in [
            <xref ref-type="bibr" rid="ref37">37</xref>
            ], the distillation coeficient can be decayed linearly over time to allow for a smooth transition
from reliance on the expert policy to student autonomy. In our case, we instead opted for exponential
decay to accelerate this transition, updating it at each time step  as  +1 = max(  ⋅  decay, 10−3) with
 decay = 0.99. This choice is motivated by both the teacher’s reliability and the desire to minimize the
risk of overfitting to it. To preserve a minimal influence from the teacher, we set a lower bound of
10−3,
(3)
  ,
ensuring that   does not fall below this threshold.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Experimental Details</title>
        <p>All experiments were carried out in a consistent hardware configuration to ensure reliability of
performance in all testing scenarios. The experimental setup is based on an Intel Core i7-8650U processor
with 16.0 GB of RAM. All algorithms are implemented in Java from scratch.
reward/fitness functions and observation spaces, these results provide a basis for comparing their
relative eficiency. The MCTS-based agent is used as a baseline.</p>
        <p>Agent A</p>
        <sec id="sec-3-3-1">
          <title>GAMCTS</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>GAMCTS</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>GAMCTS</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>RLAGENT</title>
        </sec>
        <sec id="sec-3-3-5">
          <title>RLAGENT</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>QDAGGER</title>
          <p>Agent B</p>
        </sec>
        <sec id="sec-3-3-7">
          <title>MCTSAI</title>
        </sec>
        <sec id="sec-3-3-8">
          <title>RLAGENT</title>
        </sec>
        <sec id="sec-3-3-9">
          <title>QDAGGER</title>
        </sec>
        <sec id="sec-3-3-10">
          <title>MCTSAI</title>
        </sec>
        <sec id="sec-3-3-11">
          <title>QDAGGER</title>
        </sec>
        <sec id="sec-3-3-12">
          <title>MCTSAI</title>
          <p>Win Rate A
Win Rate B
90.9%
72.7%
81.8%
63.6%
54.5%
54.5%
9.1%
27.3%
18.2%
36.4%
45.5%
45.5%</p>
        </sec>
        <sec id="sec-3-3-13">
          <title>Win rates from matches between agents.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>To evaluate the three developed agents (GAMCTS, RLAgent, and QDagger), we conducted a user
test involving 20 participants. Each participant played against all three agents in randomized order,
which helped minimize potential bias and ensured that no agent was unintentionally advantaged or
disadvantaged. This procedure allowed us to obtain more reliable and objective assessments of both
the gameplay experience and the players’ perception of each agent. During the study, each participant
completed three game sessions, each consisting of three rounds against one of the randomly assigned
agents. After every session, participants were asked to complete a questionnaire evaluating their
experience. Prior to testing, a short tutorial was provided to familiarize them with the game controls
and mechanics.</p>
      <sec id="sec-4-1">
        <title>4.1. Questionnaire</title>
        <p>4.1.1. Personal information
The questionnaire was structured in two sections: the first collected personal information about
participants, while the second focused on evaluating their gameplay experience.</p>
        <p>
          In the first section, participants were asked to provide their full name optionally, while age and gender
were mandatory fields. A summary of this data is presented in Figure 4. A more balanced distribution
of participants across age groups and genders would have been preferable for these tests, but this was
not feasible due to the dificulty in recruiting volunteers. It is worth noting that, in general, most of the
testers were not familiar with this video game genre. This does not undermine the validity of our study;
on the contrary, it aligns with one of our primary objectives: to develop agents that can dynamically
adapt to the player’s skill level. Such adaptability ensures a challenging and engaging experience for
both players with prior experience in fighting video games and those new to this genre.
4.1.2. Game Experience Questionnaire
The second part of the questionnaire focuses on assessing the participants’ gameplay experience. To this
end, we employed the Game Experience Questionnaire (GEQ) [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], a validated instrument commonly
used in academic research to measure players’ experience both during and after gameplay.
The GEQ consists of three modules: Core Module, Social Presence Module, and Post-Game Module.
In our study, only the Core Module and Post-Game Module were employed. In both modules, participants
were required to evaluate a set of items reflecting their emotional states, using the following scale:
The Core Module is designed to assess how the participant felt during gameplay by averaging the scores
assigned to items associated with the following seven components: Immersion, Flow, Competence,
Positive and Negative Afect, Tension, and Challenge.
        </p>
        <p>The Post-Game Module provides an assessment of how players felt after finishing the game, by averaging
the scores assigned to items related to the following four components: Positive Experience, Negative
Experience, Tiredness, and Returning to reality.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Discussion</title>
        <p>Tables 7 and 8 report the scores obtained for the Core Module and Post-Game Module, respectively, where
we use the following abbreviated notation: Mean (Standard Deviation). We report the highest mean
and standard deviation values in bold. To assess statistical significance between agents’ evaluations, we
conducted paired t-tests for each GEQ Core Module and Post-Game Module component across all agent
pairs. Results appear in Tables 9 and 10, showing t-statistics with 19 degrees of freedom and p-values
in brackets. To account for multiple comparisons, p-values were adjusted using the Holm-Bonferroni
method. Statistically significant p-values (  &lt; 0.05 ) are shown in bold.</p>
        <p>The reported scores reveal distinct trends across the three agents, with QDagger receiving higher ratings
in terms of user engagement, GAMCTS being perceived as more dificult, and RLAgent scoring between
the two extremes. We now turn our attention to statistically significant diferences observed between
agents. In the Core Module, QDagger significantly outperformed GAMCTS in Competence, indicating
that participants felt more skilled when playing against it. In contrast, no significant diferences were
found between RLAgent and the other two agents. Tension scores across all agents were close to 1,
suggesting that tension was perceived as slight rather than moderate, which represents a favorable
outcome for all tested agents. QDagger led to significantly lower Negative Afect compared to GAMCTS,
indicating that the gameplay experience was not associated with negative emotions. It is worth noting
that all Negative Afect scores remain below 1, with QDagger’s score approaching zero. For other
components, no statistically significant diferences were observed. Turning to the Post-Game Module,
QDagger achieved significantly higher scores in Positive Experience compared to GAMCTS, in line
with the trend suggested by the Core Module scores. Diferences in Negative Experience, Tiredness, and
Returning to Reality were not statistically significant; however, the consistently low scores reported
across all agents indicate that participants generally did not experience strong negative feelings after
gameplay, such as guilt or a sense of wasted time, nor did they report dificulty transitioning back to
reality.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Works</title>
      <p>We present several strategies for developing AI Agents for a fighting video game, considering a balance
between performance and responsiveness. The outcomes ofer valuable insights into how participants
experienced the gameplay with each agent, which was generally perceived as engaging.
In order to answer our research questions more conclusively, it is important to acknowledge that the
absence of a traditional AI baseline limits the strength of our claims. A rule-based agent could serve
as such a baseline, although defining efective rules is a nontrivial challenge that initially motivated
our focus on learning-based alternatives. Including a rule-based baseline in future work would help
better evaluate the benefits of learning-based methods. The proposed methods provide directions for
future research on adaptive game AI. For example, GAMCTS could achieve adaptability through the
implementation of a dificulty balancing mechanism that strategically selects suboptimal solutions rather
than choosing the best ones within the GA. Similarly, RL-based agents could be developed to modify
their behavioral patterns dynamically through reward systems that reflect real-time player interactions.
In addition, several other directions are being considered. We aim to explore the development of parallel
solutions to enhance the performance of GAs employed in our study. Regarding RLAgent, we are
interested in investigating its performance if equipped with an online learning mechanism similar to
that of QDagger. For QDagger, the idea is to start with a pre-trained policy in a phase preceding the
ifghting against this agent. In other words, suficient gameplay data would be collected, for example, by
having the user play against other agents. Based on this data, a policy would be trained, which would
then be exploited during actual gameplay. These research directions collectively address the need for
more sophisticated adaptive AI systems in fighting video games while maintaining the essential balance
between computational performance and real-time responsiveness that defines engaging gameplay
experiences.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 6
Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly and Claude Sonnet 4 to improve
the writing style. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
• Fighting Game AI repo: https://github.com/matera02/Fighting-Game-AI
• DareFightingICE website: https://www.ice.ci.ritsumei.ac.jp/~ftgaic/index.htm
• DareFightingICE repo: https://github.com/TeamFightingICE/FightingICE</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Bourg</surname>
          </string-name>
          , G. Seemann,
          <article-title>AI for Game Developers</article-title>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <source>Ai Game Engine Programming (Game Development Series)</source>
          ,
          <source>Charles River Media</source>
          , Inc., USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Millington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Funge</surname>
          </string-name>
          , Artificial Intelligence for Games,
          <source>Second Edition</source>
          , 2nd ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Holland</surname>
          </string-name>
          ,
          <source>Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence</source>
          , MIT Press, Cambridge, MA, USA,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          , An Introduction to Genetic Algorithms, MIT Press, Cambridge, MA, USA,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bäck</surname>
          </string-name>
          , Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming,
          <source>Genetic Algorithms</source>
          , Oxford University Press,
          <year>1996</year>
          . URL: https://doi.org/10.1093/ oso/9780195099713.001.0001. doi:
          <volume>10</volume>
          .1093/oso/9780195099713.001.0001.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          , Reinforcement Learning: An Introduction, second ed., The MIT Press,
          <year>2018</year>
          . URL: http://incompleteideas.net/book/the-book-2nd.html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvari</surname>
          </string-name>
          , Algorithms for Reinforcement Learning, Morgan and Claypool Publishers,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yamamoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Nomura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizuno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thawonmas</surname>
          </string-name>
          ,
          <article-title>Fighting game artificial intelligence competition platform</article-title>
          ,
          <source>in: 2013 IEEE 2nd Global Conference on Consumer Electronics (GCCE)</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>320</fpage>
          -
          <lpage>323</lpage>
          . doi:
          <volume>10</volume>
          .1109/GCCE.
          <year>2013</year>
          .
          <volume>6664844</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kocsis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvári</surname>
          </string-name>
          ,
          <article-title>Bandit based monte-carlo planning</article-title>
          , in: J.
          <string-name>
            <surname>Fürnkranz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Schefer</surname>
          </string-name>
          , M. Spiliopoulou (Eds.),
          <source>Machine Learning: ECML 2006</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2006</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>293</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C. B. Browne</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Powley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Whitehouse</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Lucas</surname>
            ,
            <given-names>P. I.</given-names>
          </string-name>
          <string-name>
            <surname>Cowling</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rohlfshagen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Tavener</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Samothrakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Colton</surname>
          </string-name>
          ,
          <article-title>A survey of monte carlo tree search methods</article-title>
          ,
          <source>IEEE Transactions on Computational Intelligence and AI in Games</source>
          <volume>4</volume>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          . doi:
          <volume>10</volume>
          .1109/TCIAIG.
          <year>2012</year>
          .
          <volume>2186810</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Swiechowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Godlewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sawicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mandziuk</surname>
          </string-name>
          ,
          <article-title>Monte carlo tree search: A review of recent modifications and applications</article-title>
          ,
          <source>CoRR abs/2103</source>
          .04931 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2103.04931. arXiv:
          <volume>2103</volume>
          .
          <fpage>04931</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoshida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ishihara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miyazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thawonmas</surname>
          </string-name>
          ,
          <article-title>Application of monte-carlo tree search in a fighting game ai</article-title>
          ,
          <source>in: 2016 IEEE 5th Global Conference on Consumer Electronics</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          .1109/GCCE.
          <year>2016</year>
          .
          <volume>7800536</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M.-J. Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-J. Kim</surname>
          </string-name>
          ,
          <article-title>Opponent modeling based on action table for mcts-based fighting game ai</article-title>
          ,
          <source>in: 2017 IEEE Conference on Computational Intelligence and Games (CIG)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>180</lpage>
          . doi:
          <volume>10</volume>
          .1109/CIG.
          <year>2017</year>
          .
          <volume>8080432</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>M.-J. Kim</surname>
            ,
            <given-names>C. W.</given-names>
          </string-name>
          <string-name>
            <surname>Ahn</surname>
          </string-name>
          ,
          <article-title>Hybrid fighting game ai using a genetic algorithm and monte carlo tree search</article-title>
          ,
          <source>in: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO '18</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>129</fpage>
          -
          <lpage>130</lpage>
          . URL: https://doi.org/10.1145/3205651.3205695. doi:
          <volume>10</volume>
          .1145/3205651.3205695.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Takano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thawonmas</surname>
          </string-name>
          ,
          <article-title>Applying hybrid reward architecture to a ifghting game ai</article-title>
          ,
          <source>in: 2018 IEEE Conference on Computational Intelligence and Games (CIG)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/CIG.
          <year>2018</year>
          .
          <volume>8490437</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>D.-W. Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
          </string-name>
          , S.-i. Yang,
          <article-title>Mastering fighting game using deep reinforcement learning with self-play</article-title>
          ,
          <source>in: 2020 IEEE Conference on Games (CoG)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>576</fpage>
          -
          <lpage>583</lpage>
          . doi:
          <volume>10</volume>
          .1109/CoG47356.
          <year>2020</year>
          .
          <volume>9231639</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ishihara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thawonmas</surname>
          </string-name>
          ,
          <article-title>Monte-carlo tree search implementation of ifghting game ais having personas</article-title>
          ,
          <source>in: 2018 IEEE Conference on Computational Intelligence and Games (CIG)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/CIG.
          <year>2018</year>
          .
          <volume>8490367</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gieseke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dockhorn</surname>
          </string-name>
          ,
          <article-title>Personalized dynamic dificulty adjustment - imitation learning meets reinforcement learning</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2408.06818. arXiv:
          <volume>2408</volume>
          .
          <fpage>06818</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hussein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Gaber</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Elyan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Jayne</surname>
          </string-name>
          ,
          <article-title>Imitation learning: A survey of learning methods</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>50</volume>
          (
          <year>2017</year>
          ). URL: https://doi.org/10.1145/3054912. doi:
          <volume>10</volume>
          .1145/3054912.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Bagnell</surname>
          </string-name>
          ,
          <article-title>No-regret reductions for imitation learning and structured prediction</article-title>
          ,
          <source>CoRR abs/1011</source>
          .0686 (
          <year>2010</year>
          ). URL: http://arxiv.org/abs/1011.0686. arXiv:
          <volume>1011</volume>
          .
          <fpage>0686</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>C. J. C. H. Watkins</surname>
          </string-name>
          , et al.,
          <source>Learning from delayed rewards</source>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dayan</surname>
          </string-name>
          , Technical note: Q-learning,
          <source>Machine Learning</source>
          <volume>8</volume>
          (
          <year>1992</year>
          )
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          . doi:
          <volume>10</volume>
          . 1007/BF00992698.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Baird</surname>
          </string-name>
          ,
          <article-title>Residual algorithms: reinforcement learning with function approximation</article-title>
          ,
          <source>in: Proceedings of the Twelfth International Conference on International Conference on Machine Learning</source>
          , ICML'
          <fpage>95</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>1995</year>
          , p.
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Eiben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schippers</surname>
          </string-name>
          ,
          <article-title>On evolutionary exploration and exploitation</article-title>
          ,
          <source>Fundam. Inform</source>
          .
          <volume>35</volume>
          (
          <year>1998</year>
          )
          <fpage>35</fpage>
          -
          <lpage>50</lpage>
          . doi:
          <volume>10</volume>
          .3233/FI- 1998- 35123403.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Deb</surname>
          </string-name>
          ,
          <article-title>A comparative analysis of selection schemes used in genetic algorithms</article-title>
          ,
          <source>in: Foundations of Genetic Algorithms</source>
          ,
          <year>1990</year>
          . URL: https://api.semanticscholar.org/CorpusID:938257.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Genetic algorithms, tournament selection, and the efects of noise, Complex Systems 9 (</article-title>
          <year>1995</year>
          )
          <fpage>193</fpage>
          -
          <lpage>212</lpage>
          . URL: https://www.complex-systems.com/abstracts/v09_i03_
          <fpage>a02</fpage>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lenhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Workman</surname>
          </string-name>
          , Optimal Control Applied to Biological Models, 1st ed.,
          <source>Chapman</source>
          and Hall/CRC,
          <year>2007</year>
          . doi:
          <volume>10</volume>
          .1201/9781420011418.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>M. L. Puterman</surname>
          </string-name>
          ,
          <article-title>Markov Decision Processes: Discrete Stochastic Dynamic Programming</article-title>
          , 1st ed., John Wiley &amp; Sons, Inc., USA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lattimore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvári</surname>
          </string-name>
          , Bandit Algorithms, Cambridge University Press,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          , Mastering Reinforcement Learning, The University of Queensland,
          <year>2024</year>
          . URL: https: //gibberblot.github.io/rl-notes/index.html. doi:
          <volume>10</volume>
          .14264/4bf1412.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Samuel</surname>
          </string-name>
          ,
          <article-title>Some studies in machine learning using the game of checkers. ii-recent progress</article-title>
          ,
          <source>IBM Journal of Research and Development</source>
          <volume>3</volume>
          (
          <year>2000</year>
          )
          <fpage>206</fpage>
          -
          <lpage>226</lpage>
          . doi:
          <volume>10</volume>
          .1147/rd.441.0206.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Poole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Mackworth</surname>
          </string-name>
          ,
          <source>Artificial Intelligence: Foundations of Computational Agents</source>
          ,
          <volume>3</volume>
          <fpage>ed</fpage>
          ., Cambridge University Press,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bai</surname>
          </string-name>
          , R. Cheng, Y. Jin,
          <article-title>Evolutionary reinforcement learning: A survey, Intelligent Computing 2 (</article-title>
          <year>2023</year>
          ). URL: http://dx.doi.org/10.34133/icomputing.0025. doi:
          <volume>10</volume>
          .34133/icomputing.0025.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>K. O. Stanley</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Miikkulainen</surname>
          </string-name>
          ,
          <article-title>Evolving neural networks through augmenting topologies</article-title>
          ,
          <source>Evolutionary Computation</source>
          <volume>10</volume>
          (
          <year>2002</year>
          )
          <fpage>99</fpage>
          -
          <lpage>127</lpage>
          . doi:
          <volume>10</volume>
          .1162/106365602320169811.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>F. P.</given-names>
            <surname>Such</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Conti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          , Deep neuroevolution:
          <article-title>Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning</article-title>
          ,
          <source>CoRR abs/1712</source>
          .06567 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1712.06567. arXiv:
          <volume>1712</volume>
          .
          <fpage>06567</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schwarzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <article-title>Reincarnating reinforcement learning: Reusing prior computation to accelerate progress, 2022</article-title>
          . URL: https://arxiv.org/abs/2206. 01626. arXiv:
          <volume>2206</volume>
          .
          <fpage>01626</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Colmenarejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          , G. Desjardins,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kirkpatrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pascanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          , Policy distillation,
          <year>2016</year>
          . URL: https://arxiv.org/abs/1511.06295. arXiv:
          <volume>1511</volume>
          .
          <fpage>06295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>W.</given-names>
            <surname>IJsselsteijn</surname>
          </string-name>
          , Y. de Kort,
          <string-name>
            <given-names>K.</given-names>
            <surname>Poels</surname>
          </string-name>
          ,
          <source>The Game Experience Questionnaire, Technische Universiteit Eindhoven</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>