=Paper=
{{Paper
|id=Vol-3305/paper3
|storemode=property
|title=NTBOE: a new algorithm for efficiently playing multi-action adversarial games
|pdfUrl=https://ceur-ws.org/Vol-3305/paper3.pdf
|volume=Vol-3305
|authors=Diego Villabrille,Raul Montoliu
|dblpUrl=https://dblp.org/rec/conf/cev/VillabrilleM22
}}
==NTBOE: a new algorithm for efficiently playing multi-action adversarial games==
<pdf width="1500px">https://ceur-ws.org/Vol-3305/paper3.pdf</pdf>
<pre>
NTBOE: a new algorithm for efficiently playing
multi-action adversarial games
Diego Villabrille1, Raul Montoliu1,*
1
    Institute of New Imaging Technologies, Jaume I University. Castellón, Spain


                                         Abstract
                                         This paper presents N-Tuple Bandit Online Evolution (NTBOE), a novel algorithm for efficiently playing
                                         multi-action adversarial games. Its main advantage is the use of a structure composed of a set of multi-
                                         armed bandits to store information about the best combination of actions that can be played in the
                                         player’s turn. Accessing this structure, to get an estimation of how good is to be in a particular game
                                         state, is several times faster than accessing the forward model of the game. Thanks to that, it is possible
                                         to explore more game states than the state of the art algorithms since it is able to go deeper in the search
                                         space. The performance of the proposed method has been assessed using the game ASMACAG as a
                                         testbed. The results obtained show that the proposed method overcomes state-of-the-art methods such
                                         as Monte Carlo Tree Search and Online Evolution.

                                         Keywords
                                         Artificial Intelligence for videogames, Multi action games, Evolutionary algorithms


1. Introduction
In turn-based multi-action adversarial games, each player turn consists of several actions and the
order in which the agent plays those actions has a significant influence in the game. Evolutionary
algorithms are the current state-of-the-art in this kind of games, overcoming other popular
methods as Monte Carlo Tree Search [1].
   Recently, the N-Tuple Bandit Evolutionary algorithm (NTBEA) [2] was presented as an
effective method for parameter tuning. It is very useful when the evaluation function of the
problem is noisy and fairly expensive in CPU time, as it uses to be the case in multi-action
adversarial games. It allows exploring the space of possibilities choosing only the most promising
options. Therefore, it is capable of reaching good, though not necessarily optimal, solutions
in a very short time. This method has been used mainly in the field of artificial intelligence
applied to games to optimize the heuristic function for a multi-action card game [3], optimize
the hyperparameters of the agent that is capable of solving a set of games [4] and to model the
player experience [5].
   This paper presents a novel evolutionary algorithm which uses a set of multi arm bandits as
in the NTBEA algorithm to store information of the best action combinations to be played in
the player’s turn. This new method to play multi-action adversarial games is called N-Tuple
I Congreso Español de Videojuegos, December 1–2, 2022, Madrid, Spain
*
  Corresponding author.
$ diego.villabrille@uji.es (D. Villabrille); montoliu@uji.es (R. Montoliu)
 0000-0001-9176-2868 (D. Villabrille); 0000-0002-8467-391X (R. Montoliu)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Bandit Online Evolution (NTBOE). To assess the performance of the proposed method, the game
A Simple Multi-Action CArd Game (ASMACAG) [6] has been used. It is a card game specially
developed to be used as testbed for game artificial intelligence agents.
  The main contributions of this work are as follows:

    • We present a complete re-implementation of ASMACAG using Python 3.8 and a set of
      several good practices to provide code that is easily readable and understandable.
    • We present N-Tuple Bandit Online Evolution (NTBOE), a new method to play multi-action
      adversarial games.
    • We assess the performance of NTBOE playing versus two baseline methods (Random
      and One Step Looking Ahead) and two state-of-the-art ones (Monte Carlo Tree Search and
      Online Evolution).

   As far as we know, this is the first paper using a method such as NTBEA to obtain the best
combination of actions to be played in the player’s turn.
   The rest of the paper has been organized as follows: Section 2 presents a summary of the
most relevant related methods. Section 3 exposes the problem definition that this paper tries
to solve and Section 4 explains the ASMACAG game. Section 5 briefly explains the 4 methods
implemented to be compared with the proposed one, presented in Section 6. The experiments
done to assess the performance of the proposed method and the results obtained are showed in
Section 7. Finally, Section 8 summarizes the main conclusions arisen from this work.


2. Related Work
One of the problems of previous multi action games’ implementations is that, due to complex
game rules, it should be quite complex to implement a bot for playing them. This is one of
the reasons for developing ASMACAG since it provides a simple enough context in which the
bots can be tested and understood and it allows for easy step by step debugging, while being
complex enough that the results it yields are statistically relevant. For instance, ASMACAG is
very convenient for artificial intelligence courses since the way to implement a new bot is very
straightforward.
   One of the most relevant works for playing multi action games is [7] where the authors
first presented the Online Evolution (OE) method that overcomes MTCS implementation [1].
It is based on the concept of applying genetic algorithms by evolving actions during a game
(online) instead of using evolution to train an agent that will play the game later (offline).
This concept was first introduced, though applied to a different type of games, by the Rolling
Horizon Evolution algorithm proposed at [8]. The Online Evolution algorithm tries to apply
the aforementioned idea by using an evolutionary algorithm on a set of actions composing
a turn. As it is stated by Niels Justesen in [7], "it can be seen as a single iteration of Rolling
Horizon Evolution with a very short horizon (one turn)". The algorithm tries to find the best
combination of actions for a turn by evolving an ordered set of actions. A more recent work in
this game is Evolutionary MCTS for multi action adversarial games [9], which combines the tree
search of MCTS with the sequence-based optimization of evolutionary algorithms, obtaining
better results that OE in some particular conditions.
   The main difference of the proposed NTBOE method with respect to the state of the art ones
is that NTBOE stores information about the action combinations that can be used for selecting
new turn candidates to be analyzed. In addition, evaluating the candidates using the internal
structure composed of multi armed bandits is several times faster than evaluating using the
forward model of the game. Therefore, using NTBOE more game states can be explored in the
same time period increasing the probability of finding the optimum actions combination.


3. Definitions, Notation and Problem Formulation
This section introduces the terminology and the notation that will be used throughout this paper
and formulates the problem to be solved. The notation follows the one exposed by P. Cowling
et al. in [10]. More detail on the game theory concepts can be found in standard textbooks on
this subject, e.g. [11].
   A game is defined as a direct graph (𝑆, Λ) where 𝑆 are the nodes and Λ the edges of the
graph. The nodes 𝑆 are called states of the game. The leaf nodes are called terminal states and
the other nodes non-terminal states. In general, a game has a positive number of 𝑘 players. Some
games also have an environment player (player 0). Each state 𝑠 is associated with a number
𝜌(𝑠) ∈ {0, . . . , 𝑘}, that represents the player about to act. Each terminal state 𝑠𝑇 is associated
with a vector 𝜇(𝑠𝑇 ) ∈ R𝑘 , which represents the reward vector. In some games, the non-terminal
states can also be associated with an immediate reward vector that gives an idea of how well
the players are doing. A heuristic function 𝜆(𝑠) can be used to estimate the reward vector 𝜇(𝑠),
given the state (terminal or not).
   The game starts at time 𝑡 = 0 in the initial state 𝑠0 . At time 𝑡 = 0, 1, 2, . . . , if state 𝑠𝑡 is non
terminal, player 𝜌(𝑠𝑡 ) chooses an edge (𝑠𝑡 , 𝑠𝑡+1 ) ∈ Λ and the game transitions through that
edge to the state 𝑠𝑡+1 . This continues until a terminal state is reached at time 𝑡 = 𝑇 . Then, each
player receives a reward equal to their corresponding entry in the vector 𝜇(𝑠𝑇 ) and the game
ends. If the game allows immediate rewards, players receive the immediate reward 𝜇(𝑠𝑡+1 )
after reaching the state 𝑠𝑡+1 .
   Players typically do not choose edges directly, but choose actions. In the simplest case, each
outgoing edge of a state 𝑠 corresponds to an action that player 𝜌(𝑠) can play. The set of actions
from a state 𝑠 is denoted 𝐴(𝑠). Note that, given a state 𝑠, the player 𝜌(𝑠) can play just one
action (i.e. choose an edge (𝑠𝑡 , 𝑠𝑡+1 )) from the ones included in 𝐴(𝑠).
   In multi-action games a player can play several consecutive actions in a turn. A turn, given a
state 𝑠𝑡 at time 𝑡, is defined as a list 𝜏 (𝑠𝑡 ) = [𝑎1 , 𝑎2 , . . . , 𝑎𝑞 ] where 𝑞 is the number of consecutive
actions that the player must play in its turn, 𝑎1 ∈ 𝐴(𝑠𝑡 ), 𝑎2 ∈ 𝐴(𝑠𝑡+1 ) and 𝑎𝑞 ∈ 𝐴(𝑠𝑡+𝑞−1 ).
The order in which actions are played into the turn can be very important on some multi-action
games, since to play some action before other can modify the immediate reward obtained after
playing the last action of the turn 𝑎𝑞 .
   Therefore, the problem that the agents have to solve is, given a game state 𝑠𝑡 , to find the list
𝜏 (𝑠𝑡 ) that leads to the highest immediate expected reward 𝜏 (𝑠𝑡+𝑞 ) when the turn finished (i.e.
after playing the last action 𝑎𝑞 of the turn).
4. A Simple Multi-Action CArd Game (ASMACAG)
ASMACAG is a simple card game proposed as a tool to test, develop and debug bots that
implement artificial intelligence algorithms in the context of multi-action games. It was first
presented in [6] using Java programming language. For this work, the game has been completely
reimplemented using Python 3.8 programming language. It has been implemented using several
best practices such as the use of SOLID principles, PEP8 conventions, type hinting, and the
production of a valuable documentation using docstrings. The aim of this is to have a code that
is easily readable, understandable and that allows the user of ASMACAG to quickly understand
anything they could need to. The documentation of the game can be access in https://dvs99.
github.io/ASMACAG/index.html.

4.1. Game Rules
The rules for ASMACAG are simple by design, since the game has been designed as a tool
whose aim is to be as straightforward as possible. This design philosophy allows to easily
step-by-step debug the bots and understand their decisions while also staying complex enough
for the results to be relevant and representative of their performance. The rules in the base
version are as follows (note that each player of the game is a bot following an specific algorithm
for decision-making):

    • It is an adversarial game with 2 players, each of them is dealt 9 cards.
    • There is a board with 20 cards dealt on it, which must contain only numbers.
    • The cards include numbers from 1 to 6 and the special values x2 and %2. The deck where
      the card are dealt from contains 8 cards of each of the numbers from 2 to 5, 5 cards of
      numbers 1 and 6 and 6 instances of each one of the special cards.
    • Each player must play 3 cards per turn with no possibility to skip. Playing a card consists
      in taking it from your hand and placing it on any card available on the board, if it is
      numbered, or just using it directly if it has an special value. After playing a card both
      the played card and the used card from the board, if applicable, will be discarded. Turns
      alternate until there are no cards left on either player’s hand.
    • Each time a numbered card is played (𝑃 − 𝐵) * 𝐹 points are awarded to the player that
      used it, where 𝑃 is the value of the card played, 𝐵 is the value of the card on the board
      and 𝐹 is a factor that defaults to 1. Special cards affect the value of 𝐹 when computing
      the score of the next numbered card played by any of the players. The x2 special card
      will duplicate 𝐹 while the %2 card will halve it.

  Some examples are as follows:

    • The player plays a 6 card on a 2 card. 𝐹 = 1. The players gets (6 − 2) * 1 = 4 points.
    • The player plays a 6 card on a 2 card. 𝐹 = 2. The players gets (6 − 2) * 2 = 8 points.
    • The player plays a 4 card on a 6 card. 𝐹 = 2. The players gets (4 − 6) * 2 = −4 points.
    • The player plays a 5 card on a 2 card. 𝐹 = 0.25 since, e.g. the opponent played two
      consecutive %2 cards in the previous turn. The players gets (5 − 2) * 0.25 = 0.75 points.
   An aspect taken into account in the design of the ASMACAG software is the need for flexibility
in the test environment, specially regarding the rules. All these rules need to be programmed to
be easily modified as needed during the testing process, by just coding a class that implements
a forward model compatible with the game. Also, the parameters of the rules need to be easily
accessible and changeable. This parameters include things, such as the number of cards dealt at
each time, the amount of existing cards of each type, the range of numbers in the numbered
cards, the number of actions to play per turn, etc.
   According to the terminology exposed in the Section 3, in ASMACAG the number of players
is 𝑘 = 2, there is not environment player and 𝑞 = 3, i.e. the turn 𝜏 (𝑠𝑡 ) is composed by three
actions. Thanks to the existence of the special cards x2 and %2, the order in which the actions are
played can modify the immediate reward 𝜇(𝑠𝑡+3 ) obtained after playing the actions contained
in the turn 𝜏 (𝑠𝑡 ).

4.2. How to implement a new agent
Implementing a new bot for playing ASMACAG is very straightforward. A new bot must inherit
from Player class and implement three methods: the constructor, the one who returns a string
with the name of the bot and the think function, which given the current state of the game and a
time budget must return the action to be played. As an example, the implementation of a bot that
always returns a random action from the existing ones given the current game state of the game
can be seen in https://github.com/dvs99/ASMACAG/blob/master/Players/RandomPlayer.py.


5. Baseline and state of the art methods
5.1. Random
This algorithm will just choose randomly one of the available valid actions when required to
decide, without looking into the next action to play at all. It is included to be used as a minimal
baseline, to check that each of the other algorithms are able to beat it in an statistically relevant
and consistent way.

5.2. Greedy One-Step Lookahead
This algorithm will just perform a greedy search on the available valid actions at one moment
of the turn, playing the one that returns the best value. Once again, it will not go through the
next actions that could be played during that turn. It is useful as a baseline heuristic algorithm,
testing the capability of each of the other algorithms to win against it, when applying the same
heuristic.

5.3. Monte Carlo Tree Search Algorithm
This algorithm is probably the most researched of the ones tested and it has been selected for its
relevance and importance in the AI field during the last decades. First coined in 2006 [12], it has
been proposed almost since its conception as a useful algorithm for game AI [13]. It iteratively
builds a tree that estimates the long term value of each action until the time budget is hit. Then
it returns the estimated best performing action at that point. The nodes of the tree represent
the state space of the game, while the edges represent the actions. Therefore, a child node of
a given node represents the game state of the game after playing one of the possible actions
that can be played from this game state. Each node 𝑎 stores two values: 𝑠𝑎 the average of the
rewards that have been obtained in the past by choosing this node (i.e. playing the action), and
𝑛𝑎 the number of times that this node has been chosen.
   At each iteration, the method has to selected which child node has to be explored using a
multi arm bandit. This is made by the UCB equation which balance between exploiting the
node with the best 𝑠𝑎 or exploring nodes with less visits 𝑛𝑎 . The formula 𝑈 𝐶𝐵 for a particular
node 𝑎 is expressed as follows:
                                                    √︂
                                                         ln 𝑁
                                 𝑈 𝐶𝐵(𝑎) = 𝑠𝑎 + 𝑐                                             (1)
                                                        𝑛𝑎 + 𝜖
𝑠𝑎 is the average of the rewards that have been obtained in the past by choosing the action
𝑎, 𝑁 is the total number of times an action has had to be chosen, 𝑛𝑎 is the number of times
the action 𝑎 has been chosen, and 𝐶 is a parameter which balances between the exploitation
and the exploration parts of the UCB functions. Greater values benefits exploration versus
exploitation. The 𝜖 value is a small number (e.g. 𝜖 = 0.05) that is used to avoid having to divide
by zero in the equation. In those cases, i.e. when an action has never been chosen before, the
value obtained will be very high to give the opportunity to explore, at least once, all possible
actions.
   In our particular implementation of the MCTS algorithm, the tree only explores the possible
action combination in a player turn.

5.4. Online Evolution Algorithm
This algorithm has been selected for its relevance for the problem of multi-action adversarial
games and focuses on the idea of solving the problem that is the most specific to multi-action
games, which is finding the right combination of actions to compose a turn and therefore should
be quite efficient when trying to win at ASMACAG. The algorithm has the followings steps:

   1. A set of 𝑁𝑃 individuals are randomly generated. Each individual (or genome) is a
      combination of actions that can be played in a turn. In ASMACAG each individual has
      three components (or genes).
   2. The set of individuals are evaluated using the forward model and a score is obtained using
      a heuristic function.
   3. The worst individuals, according to the obtained score, are removed from the set. This
      process is controlled by the survivor rate 𝛼 parameter.
   4. New individuals are generated doing crossover operations.
   5. Each individual is mutated according to a mutation rate 𝛽.
   6. The process is repeated until convergence or until the time budget is reached.
6. N-Tuple Bandit Online Evolution
As a novel approach to solving multi-action adversarial games, the N-Tuple Bandit Evolution-
ary Algorithm(NTBOE) is proposed. This algorithm was first designed for its use in offline
optimization of an heuristic function [4] and of for tuning the hyperparameters of an agent for
solving a set of games [3]. The idea suggested is attempting to apply it to evolving in-game
actions during the turn calculation, using a similar approach to the one that OE uses to adapt
traditional evolutionary algorithms.
   The N-Tuple model in the algorithm directly models the statistics, while approximating the
fitness and number of evaluations of each modeled combination of parameters [2]. This way,
the algorithm places the focus on the combinations of the set of parameters being optimized.
Applying it to multi-action decision making, the aim is to further focus on finding the right
combinations of actions. Since this is the main problem of multi-action games, as discussed
before, this approach could give interesting results for this specific kind of game.
   The advantage it can have over other approaches, and specifically against other evolutionary
algorithms, is that the N-Tuple based model in the algorithm is faster to access than the game
itself. The algorithm tests several turns against that model, which approximates the reward
based on the data received up to that point, per each time it tests a turn with the actual game.
This allows the NTBOE algorithm to test a lot more possible turns in the same amount of time.
   Algorithm 1 shows the procedure of the method. The first step of the model is create the
N-Tuple Bandits model Γ where the score obtained after playing the actions into the turn, using
the forward model, will be stored. In this work, a 1D Multi arm bandit per action (i.e. 3) is used
together with 3 2D multi arm bandits for action combination (1st action + 2nd action, 1st action
+ 3rd action, and 2nd action + 3rd action). In total, the model consists of 6 multi arm bandits.
   The second step of the method is an initialization step. It is detailed in Algorithm 2. A
population Π of 𝑁𝑝 random individuals is obtained and each one is evaluated using the forward
model of the game (line 6). The model Γ is updated with the scores obtained in these evaluations.
The best individual (and its score) is returned as the current one. This process is very important
to fill the model with valuable information. Then, the iterative process stars, at each iteration
𝑁𝑛 new individuals are created mutating the current one (line 5). For mutating an individual
one action is randomly selected (using 𝛽) and it is randomly changed for another possible action
given the state of the game at this moment. We call to this new individuals neighbors. Each
neighbor is evaluated using the model Γ (line 10) and a score is obtained. Note that at this
moment the individual is evaluated using the model Γ instead of the forward model. These
individuals are evaluated using the model Γ by the Equation 2. It consist of the sum of all the
UCB values (see Equation 1) of each one of the bandits belonging to the model. Evaluating an
individual using Γ is several times faster than evaluating it using the forward model.
                                                  ∑︁
                              𝑇𝑈 𝐶𝐵 (𝜏 (𝑠𝑡 )) =         𝑈 𝐶𝐵(𝑏, 𝜏 (𝑠𝑡 ))                        (2)
                                                  𝑏∈Γ

    The previous process returns a candidate to be a better solution to the current one. It is
evaluated using the forward model and the model Γ is updated with the score obtained. Then,
if its score is better than the one obtained by the current, it becomes the new current. If not, the
current individual remains being the same.
Algorithm 1 Proposed N-TBOE algorithm
Require: 𝑁𝑝 ∈ 𝑁 + : Number of initial evaluations
Require: 𝑁𝑛 ∈ 𝑁 + : Number of neighbors
Require: 𝑁𝑖 ∈ 𝑁 + : Number of iterations
Require: 𝛽 ∈ [0, 1]: Mutation probability
Ensure: 𝑎𝑐𝑡𝑖𝑜𝑛𝑠: best action combination obtained
 1: Γ ← 𝐶𝑟𝑒𝑎𝑡𝑒𝑀 𝑜𝑑𝑒𝑙()
 2: 𝑐𝑢𝑟𝑟𝑒𝑛𝑡, 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛(𝑁𝑝 , Γ)
 3: 𝑖 ← 0
 4: while 𝑖 < 𝑁𝑖 do
 5:     Π ← 𝐺𝑒𝑡𝑁 𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠(𝑐𝑢𝑟𝑟𝑒𝑛𝑡, 𝑁𝑛 , 𝛽)
 6:     𝑗←0
 7:     𝑏𝑒𝑠𝑡_𝑠𝑐𝑜𝑟𝑒 ← −∞
 8:     𝑏𝑒𝑠𝑡_𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 ← 𝑁 𝑜𝑛𝑒
 9:     while 𝑗 < 𝑁𝑛 do
10:         𝑠𝑐𝑜𝑟𝑒 ← 𝐸𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑊 𝑖𝑡ℎ𝑀 𝑜𝑑𝑒𝑙(Π(𝑗), Γ)
11:         if 𝑠𝑐𝑜𝑟𝑒 > 𝑏𝑒𝑠𝑡_𝑠𝑐𝑜𝑟𝑒 then
12:             𝑏𝑒𝑠𝑡_𝑠𝑐𝑜𝑟𝑒 ← 𝑠𝑐𝑜𝑟𝑒
13:             𝑏𝑒𝑠𝑡_𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 ← Π(𝑗)
14:         end if
15:         𝑗 ←𝑗+1
16:     end while
17:     𝑠𝑐𝑜𝑟𝑒 ← 𝐸𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑊 𝑖𝑡ℎ𝐹 𝑜𝑟𝑤𝑎𝑟𝑑𝑀 𝑜𝑑𝑒𝑙(𝑏𝑒𝑠𝑡_𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟)
18:     𝑈 𝑝𝑑𝑎𝑡𝑒𝑀 𝑜𝑑𝑒𝑙(𝑏𝑒𝑠𝑡_𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟, 𝑠𝑐𝑜𝑟𝑒, Γ)
19:     if 𝑠𝑐𝑜𝑟𝑒 > 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒 then
20:         𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒 ← 𝑠𝑐𝑜𝑟𝑒
21:         𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ← 𝑏𝑒𝑠𝑡_𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟
22:     end if
23:     𝑖←𝑖+1
24: end while
25: 𝑎𝑐𝑡𝑖𝑜𝑛𝑠 ← 𝑐𝑢𝑟𝑟𝑒𝑛𝑡


  At the end, the current individual is the actions combination (or turn) 𝜏 (𝑠𝑡 ) obtained as
solution.

7. Experiments and Results
7.1. Experimental set-up
The performance of the proposed NTBOE method is assessed playing games against the two
baselines and the two state of the art methods presented in Section 5. In all cases 1000 games
have been played between two methods. Each algorithm will be the first player for half of the
games played against any other algorithm and the second player for the other half of them.
Algorithm 2 Initialization of the N-TBOE algorithm
Require: 𝑁𝑝 ∈ 𝑁 + : Number of initial evaluations
Require: Γ: N-Tuple Bandits Model
Ensure: 𝑐𝑢𝑟𝑟𝑒𝑛𝑡: best action combination evaluated
Ensure: 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒: score of the best action combination
 1: Π ← 𝑅𝑎𝑛𝑑𝑜𝑚𝑃 𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛(𝑁𝑝 )
 2: 𝑗 ← 0
 3: 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒 ← −∞
 4: 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ← 𝑁 𝑜𝑛𝑒
 5: while 𝑗 < 𝑁𝑝 do
 6:    𝑠𝑐𝑜𝑟𝑒 ← 𝐸𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑊 𝑖𝑡ℎ𝐹 𝑜𝑟𝑤𝑎𝑟𝑑𝑀 𝑜𝑑𝑒𝑙(Π(𝑗))
 7:    𝑈 𝑝𝑑𝑎𝑡𝑒𝑀 𝑜𝑑𝑒𝑙(Π(𝑗), 𝑠𝑐𝑜𝑟𝑒, Γ)
 8:    if 𝑠𝑐𝑜𝑟𝑒 > 𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒 then
 9:        𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑠𝑐𝑜𝑟𝑒 ← 𝑠𝑐𝑜𝑟𝑒
10:        𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ← Π(𝑗)
11:    end if
12:    𝑗 ←𝑗+1
13: end while


This will allow to reduce the bias introduced by the fact that the game is easier to win as the
first player. All algorithms will have a budget to make decisions which will be defined as time
limit per turn, this time will be of 1 second. Regarding the evaluation of the game states where
an heuristic 𝜆(𝑠𝑡 ) is needed, all bots will make use of the same one, it being the score difference
between the players. Finally, to evaluate the performance of each bot the number of wins will
be used.
   Two experiments have been performed:

   1. The best parameter combination is obtained for bots MCTS, OE and NTBOE. For each of
      this bots a set of possible values per parameter will be chosen. The other parameters are
      fixed to a particular values. Then, 1000 games will be played against OSLA Player per
      each possible value for the parameter, keeping the rest of the parameters constant. Table
      1 shows the configuration of this experiments.
   2. An all versus all tournament has been performed among all the tuned versions of the
      bots. 1000 games have been played between two methods.


7.2. Results
Table 2 shows the results of the experiment performed to tune the parameter 𝐶 in MCTS. The
optimum value is 𝐶 = 8. Table 3 shows the results of the experiments done for tuning the
parameters of the OE algorithm. According to the results obtained the parameters for this
method has been adjusted to 𝑁𝑝 = 125, 𝛼 = 0.15, 𝛽 = 0.15. Table 4 shows the results of the
experiments done for tuning the parameters of the proposed NTBOE algorithm. According to
Table 1
Bot to be tuned, parameter to be tuned, possible values and how other parameters have been fixed.
              Bot     Parameter                       Values                    Other parameters
             MCTS          𝐶            {0.5, 1.414, 3, 8, 14, 20, 30, 45}             -
              OE          𝑁𝑝                  {25, 75, 125, 175}               𝛼 = 0.15, 𝛽 = 0.35
              OE          𝛽                {0.05, 0.15, 0.25, 0.35}            𝑁𝑝 = 75, 𝛼 = 0.35
              OE          𝛼              {0.05, 0.15, 0.35, 0.55, 0.75}        𝑁𝑝 = 75, 𝛽 = 0.15
            NTBOE         𝑁𝑛                 {5, 10, 20, 50, 120}              𝛽 = 0.3, 𝑁𝑝 = 500
            NTBOE         𝛽               {0.05, 0.15, 0.3, 0.55, 0.8}         𝑁𝑛 ∈ 20, 𝑁𝑝 ∈ 500
            NTBOE         𝑁𝑝              {50, 100, 500, 1000, 3000}           𝑁𝑛 = 20, 𝛽 = 0.3


Table 2
Results table from the experiment for adjusting MCTS Player’s C value. Best result in bold.
                               C value      MCTS wins      OSLA wins         Ties
                                 0.5            502            485           13
                                0,141           509            473           18
                                  3             523            458           19
                                  8             545            436           19
                                  14            520            470           10
                                  20            507            478           15
                                  30            477            507           16
                                  45            458            530           12


the results obtained the parameters for this method has been adjusted to 𝑁𝑛 = 5, 𝛽 = 0.55,
𝑁𝑝 = 1000.
   Figure 1 shows the results of the second experiment. It can be seen that OSLA Player is
the least performant bot, which is the expected result given that it doesn’t make any attempt
on finding a good combination of actions for the whole turn. Then MCTS Player follows
relatively closely, showing that it can find combinations of actions but it is not as efficient as the
evolutionary algorithms in doing so. Then both OE Player and NTBOE Player are quite ahead,
because they use their evolutionary aspect to very efficiently find good combinations of actions.
Within them, NTBOE seems to have a not extremely big but clearly significant advantage. In
the particular match against OE the proposed NTBOE method was able to won the 60% of the
games. This proves that this novel approach to multi-action game solving, proposed on this
work, can yield better results than the current state-of-the-art algorithms.
   The experiments presented in this work can be easily reproduced by running the Pyhton
script https://github.com/dvs99/ASMACAG/blob/master/reproduce_experiments.py.

8. Conclusions
This paper has presented a novel method called N-Tuple Bandit Online Evolution to play multi
action adversarial games. The results obtained have shown that the proposed NTBOE algorithm
Table 3
Results table from the experiment for adjusting OE Player’s 𝑁𝑝 , 𝛽 and 𝛼 parameters. Best result in bold.

                           𝑁𝑝          𝛽         𝛼    OE wins     OSLA wins     Ties
                           25        0.15    0.35       669         314         17
                           75        0.15    0.35       695         292         13
                          125        0.15    0.35       704         286         10
                          175        0.15    0.35       692         292         16
                           75        0.05    0.35       680         302         18
                          75         0.15    0.35       695         292         13
                           75        0.25    0.35       678         303         19
                           75        0.35    0.35       675         308         17
                           75        0.15    0.05       688         302         10
                          75         0.15    0.15       700         283         17
                           75        0.15    0.35       695         292         13
                           75        0.15    0.55       671         306         23
                           75        0.15    0.75       675         311         14


Table 4
Results table from the experiment for adjusting NTBOE 𝑁𝑛 , 𝛽 and 𝑁𝑝 parameters. Best result in bold.

                        𝑁𝑛       𝛽          𝑁𝑝       NTBEA wins     OSLA wins     Ties
                         5       0.3        500         733            250           17
                        10       0.3        500         733            254           13
                        20       0.3        500         721            268           11
                        50       0.3        500         668            309           23
                        120      0.3        500         569            410           21
                        20      0.05        500         730            248           22
                        20      0.15        500         731            257           12
                        20       0.3        500         721            268           11
                        20      0.55        500         772            216           12
                        20       0.8        500         727            254           19
                        20       0.3         50         736            245           19
                        20       0.3        100         733            249           18
                        20       0.3        500         721            268           11
                        20       0.3       1 000        744            244           12
                        20       0.3       3 000        720            266           14


overcomes state of the art methods when playing ASMACAG. Future work will focus on assessing
the performance of NTBOE in other multi action games.


References
 [1] N. Justesen, T. Mahlmann, S. Risi, J. Togelius, Playing multiaction adversarial games:
     Online evolutionary planning versus tree search, IEEE Trans. on Games 10 (2018) 281–291.
Figure 1: Results chart from second experiment (except games with Random and ties).


 [2] S. M. Lucas, J. Liu, D. Perez-Liebana, The n-tuple bandit evolutionary algorithm for game
     agent optimisation, in: Proc. of IEEE Congress on Evolutionary Computation (CEC’18),
     2018.
 [3] R. Montoliu, R. D. Gaina, D. Perez-Liebana, D. Delgado, S. M. Lucas, Efficient heuristic policy
     optimisation for a challenging strategic card game, in: Inter. Conf. on the Applications of
     Evolutionary Computation (EvoStar’20), 2020.
 [4] S. M. Lucas, J. Liu, I. Bravi, R. D. Gaina, J. Woodward, V. Volz, D. Perez-Liebana, Efficient
     evolutionary methods for game agent optimisation: Model-based is best, in: Game
     Simulations Workshop (AAAI’19), 2019.
 [5] K. Kunanusont, S. M. Lucas, D. Perez-Liebana, Modeling player experience with the
     n-tuple bandit evolutionary algorithm, in: 14th AAAI Conf. on Artificial Intelligence and
     Interactive Digital Entertainment, 2018.
 [6] D. Villabrille-Seca, R. Montoliu, Asmacag: Un nuevo juego de cartas multi acción para
     facilitar el estudio de técnicas de inteligencia artificial, in: 1er Congreso Internacional de
     DiGRA España 2021 (DIGRAES21), 2021.
 [7] N. Justesen, T. Mahlmann, J. Togelius, Online evolution for multi-action adversarial games,
     in: EvoApplications, 2016.
 [8] D. Pérez, S. Samothrakis, S. Lucas, P. Rohlfshagen, Rolling horizon evolution versus tree
     search for navigation in single-player real-time games, in: Proc. of the 15th Annual Conf.
     on Genetic and Evolutionary Computation, 2013.
 [9] H. Baier, P. I. Cowling, Evolutionary mcts for multi-action adversarial games, 2018 IEEE
     Conf. on Computational Intelligence and Games (CIG’18) (2018) 1–8.
[10] P. Cowling, E. Powley, D. Whitehouse, Information set monte carlo tree search, IEEE
     Trans. on Computational Intelligence and AI in Games 4 (2012) 120–143.
[11] R. Myerson, Game Theory: Analysis of Conflict, Harvard University Press, 1997.
[12] R. Coulom, Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, in:
     5th Int. Conf. on Computer and Games, 2006.
[13] G. Chaslot, S. Bakkes, I. Szita, P. Spronck, Monte-carlo tree search: A new framework for
     game ai, in: Proc. of the 4th AAAI Conf. on AI and Interactive Digital Entertainment, 2008.

</pre>