=Paper= {{Paper |id=Vol-1957/CoSeCiVi17_paper_14 |storemode=property |title=Training Pac-Man bots using reinforcement learning and case-based reasoning |pdfUrl=https://ceur-ws.org/Vol-1957/CoSeCiVi17_paper_14.pdf |volume=Vol-1957 |authors=Fernando Domínguez-Estévez,Antonio A. Sánchez-Ruiz,Pedro Pablo Gómez-Martín |dblpUrl=https://dblp.org/rec/conf/cosecivi/Dominguez-Estevez17 }} ==Training Pac-Man bots using reinforcement learning and case-based reasoning== https://ceur-ws.org/Vol-1957/CoSeCiVi17_paper_14.pdf
     Training Pac-Man bots using Reinforcement
        Learning and Case-based Reasoning ?

    Fernando Domı́nguez-Estévez, Antonio A. Sánchez-Ruiz, and Pedro Pablo
                                Gómez-Martı́n

                 Dep. Ingenierı́a del Software e Inteligencia Artificial
                    Universidad Complutense de Madrid (Spain)
              fedomi01@ucm.es, antsanch@ucm.es, pedrop@fdi.ucm.es



        Abstract. Video games are an interesting field of study for many ar-
        tificial intelligence researchers, since many different AI methods can be
        studied and tested with them, and later those investigations can be ap-
        plied to many other situations. In this paper we use case based reasoning
        and reinforcement learning principles to train bots to play the Ms. Pac-
        Man vs. Ghosts game. In particular, we use the well-known Q-learning
        algorithm but replacing the Q-table with a case base. The use of cases
        allows us to deal with rich game state representation and inject domain
        knowledge in both the retrieval and the adaptation stages. Our initial
        experiments show that we can train bots either to reach high scores or
        to survive for a long time. However, the combination of both goals seems
        to be a more challenging problem.

        Keywords: Reinforcement Learning, Case-based Reasoning, Pac-Man,
        video games


1     Introduction

Video games are an interesting field of study for many Artificial Intelligence
(AI) researchers, since they provide complex but controlled environments in
which many AI techniques can be studied and tested. Video games are also very
interesting to compare different AI techniques and learn their strengths and
weaknesses.
    In this work we use one of the most popular video games of all time: Ms.
Pac-Man vs. Ghosts [14]. Although Pac-Man may seem simple compared to
some other kind of games, its special features make it a perfect target for AI
researchers. Firstly, this game is simple enough to be quickly understood and it
does not require a very powerful machine to be run. On the other hand, in spite
of its simplicity, lots of different strategies can be applied while playing, some of
them more intelligent than others, so it is the task of AI techniques to discover
the best strategies. In fact, Ms. Pac-Man vs. Ghosts has been used in different
?
    Supported by Spanish Ministry of Economy, Industry and Competitiveness under
    grant TIN2014-55006-R
AI competitions during the last years [5] and it is going to be used again this
year at this year at the CIG conference [13].
    There are two main approaches to train automatic bots to play videogames
without human intervention: Genetic Algorithms (GA) [6] and Reinforcement
Learning (RL) [11]. Although both approaches learn how to play by playing
thousand of games and trying different behaviors, they require different infor-
mation from the game simulation and they explore the space of possible solutions
using very different approaches. GA work with populations of solutions that are
combined and mutated to produce new generations of better solutions. The prob-
ability of each individual to survive and be selected for the next generation is
proportional to its fitness or how well it plays the game. RL algorithms, on the
other hand, learn policies that associate “good” actions to execute in different
game states based on rewards obtained during the game simulation.
    In this paper we use case-based reasoning and reinforcement learning princi-
ples to train bots to play the Ms. Pac-Man vs. Ghosts game. In particular, we
use the well-known Q-learning algorithm [12] but replacing the Q-table with a
case base. The use of cases allows us to deal with rich game state representation
and inject domain knowledge in both the retrieval and the adaptation stages.
    The rest of the paper is organized as follows. Section 2 briefly describes
the video game used in our experiments. Section 3 describes the foundations
of the techniques we use for learning: Case-based Reasoning and Reinforcement
Learning. Section 4 explains the decisions we made to implement our agent in
the particular context of a Pac-Man game. Section 5 explains the experiments
performed and the limitations that we found. Finally, the paper closes with
related work, conclusions and future work.


2     Ms. Pac-Man vs. Ghosts AI

Ms. Pac-Man is an arcade game released in 1981 as an unauthorized copy of the
original Pac-Man game. Such was its popularity, that Namco decided to make
an official version of this game. It is very similar to the original game, since the
goal of the game is still to survive in a maze, escaping from four ghosts while
picking pills and fruits that give points to the player.
    The version of the game that we use, Ms. Pac-Man vs. Ghosts AI1 (Figure 1),
is an implementation in Java designed to test different AI techniques. The be-
havior of both Pac-Man and the Ghosts can be specified implementing controller
classes that make decisions on the next move (left, right, up or down) depend-
ing on the game state. This framework has been used in different Pac-Man AI
competitions and provides some example bots implementing simple behaviors.
    In our experiments we use the default controller for the ghosts that tries to
reproduce the behavior of the ghosts in the original game. It is interesting to
note that each one of the 4 ghosts uses a different heuristic to decide how to
move.
1
    https://github.com/kefik/MsPacMan-vs-Ghosts-AI
       Fig. 1: A screen shot of the Ms. Pac-Man vs. Ghosts video game.


    The goal of the player is to maximize the score and he obtains points each
time he eats a pill (10), a power pill (50), or an edible ghost (200). Actually,
the player obtains several more points if he eats more than one ghost in a row
(there is a multiplier). A level or maze is completed when Pac-Man eats all the
pills and power pills or he survives for 3000 turns. A turn is a tick of the game
in which all elements perform an action and the game is updated; there are 25
ticks per second.


3     Case-Based Reinforcement Learning
3.1   Case-based Reasoning
Case-based reasoning (CBR) [1] is a learning and problem solving technique
based on the intuition that similar problems usually have similar solutions. In-
stead of trying to solve each problem from scratch, problems are solved by looking
for similar past experiences and adapting their solution to the current context.
In some domains, this adaptation process can be more effective than computing
a new solution from scratch because it can take advantage of available expert
domain knowledge. CBR suggests then a model of reasoning that puts together
problem solving, understanding and learning from the past.
    A case represents one of these experiences and it is usually made of a de-
scription of the problem and its context, a description of the solution adopted
and some measure of how successful that solution was to solve the problem. A
case base is a collection of cases or past experiences.
    Figure 2 shows the typical cycle of a CBR system. When the system faces a
new problem, it looks for past similar problems in the case base (retrieval). The
definition of similarity is usually domain dependent and can take advantage of
the available expert knowledge. Once the most similar case has been retrieved,
             Fig. 2: CBR cycle according to Aamodt and Plaza [1].


its solution is adapted to work in the current context since the retrieved case
is usually not identical to the current problem (reuse). The adaptation stage is
also domain dependent. The new adapted solution is used to solve the current
problem with more or less success (revise) and can be checked by a domain
expert. Finally, a new case is created to describe the current problem and its
solution, and the case is incorporated to the case base (retain).
    The quality of a CBR system depends, therefore, on the ability to understand
new situations and detect similar past experiences, the ability to adapt those past
experiences to the current context, and the ability to integrate new experiences
into its memory.


3.2   Reinforcement Learning

Reinforcement Learning (RL) [11] tries to solve the problem of finding sequences
of actions an agent should take in order to maximize some numerical reward.
The agent is not told about what actions to take, but instead it has to discover
which one returns the biggest reward in the long term by trial and error.
    An RL agent interacts with the environment in discrete time steps, in each
of which it will receive a representation of such environment known as state.
The agent must chose then the best action to execute in the current state. That
action will lead it to a new state, and then the agent will receive a reward. The
goal of the agent is to maximize total reward obtained during a simulation. Note
that the optimal action might not be the one that maximizes the instant reward
(a greedy approach), but other action that allows to obtain a higher total reward
during the simulation.
 Algorithm 1: Q-learning algorithm
 1 Initialize Q(s,a) arbitrarily;
 2 repeat
 3     Initialize s;
 4     repeat
 5          Choose a from s using policy derived from Q;
 6          Take action a, observe r, s’;
 7          Q(s, a) ← Q(s, a) + α[r + γmaxa0 Q(s0 , a0 ) − Q(s, a)];
 8          s ← s0 ;
 9     until s is terminal ;
10 until no more episodes;




    One of the most important challenges in RL is how to keep balance between
exploitation and exploration. To get the best reward, the agent might be tempted
to select actions that worked in the past. But to discover new sequences of actions
it has to take decisions that it has not taken before. So the agent must try as
much actions as possible and enough amount of times each, and progressively
favor those that appear to be the best.
    One can now identify some key elements in reinforcement learning:

 – State: A representation of the environment at some point of the learning
   process.
 – Policy: Defines the behavior of the agent at some time. It maps a given
   state to all possible actions the agent can take from there.
 – Reward function: It defines what are the good and bad actions for the
   learner. It gives the rewards for doing an action from some state.
 – Value function: Specifies what is good in the long run for the agent. It
   gives values to states depending on the expected reward the learner can
   reach from there in the future. It sets then which states are more desirable
   for the agent to be into.

    One of the most important breakthroughs in reinforcement learning was the
development of Q-learning algorithm [12] (see Algorithm 1). It works by handling
a table (called Q-table) that maps each state to all possible actions the agent can
perform in that state, and assigns them a numerical value, depending on how
good they are for the agent in the long term. These are the values the algorithm
will update each simulation cycle. The following equation states how the value
of an action a taken in a state s is modified each time the agent performs it:

          Q(st , at ) ← Q(st , at ) + α[rt+1 + γmaxa Q(st+1 , a) − Q(st , at )]

where α y γ are learning and discount rates, respectively.
    In this case, the learned Q-function approaches directly to its optimal val-
ues, independently on the chosen policy. This enormously simplifies the algo-
rithm analysis and convergence proofs. The only requirement for the policy to
guarantee convergence is that all state-action pairs must continue to be updated.
 Algorithm 2: Case-based Q-learning algorithm
 1 Initialize case base arbitrarily;
 2 repeat
 3     Initialize case c;
 4     repeat
 5          Search most similar case c in the case base, get similarity s between
              them ( 0 ≤ s ≤ 1 );
 6          Choose a from c’;
 7          Take action a, observe r, c”;
 8          Q(c, a) ← Q(c, a) + α[s ∗ r + γmaxa0 Q(c00 , a0 ) − Q(c, a)];
 9          if c and c’ are distinct enough then
10              add c’ to case base;
11          end
12          c ← c00 ;
13     until no more steps in the episode;
14 until no more episodes;




3.3   Using cases to approximate the Q function


In the general case, the number of possible states and actions leads to a such
combinational explosion of pairs that it is not possible to store the Q-table in
memory. Even if we could store it, we would require a huge number of simulations
to update its values and converge to the optimal policy, since we might not find
the same states in different simulations.
    The Q-table can be generalize to a Q : S × A → R function that maps
pairs of states and actions to their expected future reward during the simula-
tion. There are different approaches to approximate the Q function using Neural
Networks [7], decision trees [9] and many other machine learning techniques. In
this work we propose to use a case base.
    The use of cases has some interesting advantages such as we can inject expert
domain knowledge in the similarity measure used to retrieved similar cases and
in the adaptation stage. It has also some disadvantages compared to other tech-
niques that build domain models because CBR is memory intensive and usually
slow with a big number of cases.
    Algorithm 2 shows the modified version of Q-learning adapted to work with
a case base. The first difference is that now we retrieve the most similar case to
the current state. Since the retrieved case will be hardly an exact match to the
current state, its solution must be adapted. We also use the similarity value to
weight the contribution of the reward during the update of the Q-value so that
the influence will be greater for higher similarities. Finally, we only store the
new state as a new case in the case base if it is different enough from the cases
already known. That way, we can limit the number of cases and the memory
requirements of the algorithm and tune the re-usability of the cases.
4     A Case-Based Reinforcement Learning bot for
      Pac-Man

In this section we describe the different decisions we made to implement our Pac-
Man bot. These decisions involve the features chosen to represent the game state,
the granularity of the actions learned, the similarity function used to compare
similar situations, the adaptation strategy to reuse the retrieved solution in the
current context, the case base management policy and the configuration of the
meta-parameters for the Q-learning algorithm. Such meta-parameters will be
a learning factor α = 0.2, a discount rate γ = 0.8 and a -greedy exploration
approach with an initial parameter  = 1 that decreases linearly with the number
of episodes.


4.1   Cases

A case represents the values of the Q-function for one particular game state.
Remember that Q : S × A → R is a function that maps states and actions
to expected reward values in the long term. Our cases contain, therefore, the
description of a particular game state and a set of (action, Q-value) pairs rep-
resenting the expected rewards in the long term if those actions were chosen in
the current state.
    A game state is represented by the state of the gameboard: the position of
the walls, pills, power pills, ghosts and Pac-Man. The number of possible states
is too high so we need to work with a more abstract representation. In order to
choose a set of features, we identified the most relevant information for a human
player:

 – Distances to the closest pill in each direction (d1 ).
 – Distances to the closest power pill in each direction (d2 ).
 – Distances to the closest non edible ghost in each direction (d3 ).
 – Distances to the closest edible ghost in each direction (d4 ).

    The maze is represented as a graph in the game engine, so these are the
distances in number of nodes from one node to another one. Distances in every
direction are calculated using breadth-first search. Since there are 4 directions
(up, down, left, right), each di is a vector of 4 components and the game state
is described with 16 parameters. In order to speed up the learning process, we
precompute the distance between all the positions in the map and sort them by
distance so that the search of the closest entity can be done in lineal time.
    Even with this abstract representation of the game state, the number of
possible states is huge. In order to reduce the search space, we discretized the
distances in 5 possible values: WALL (when there is a wall in the path, so Pac-
Man can’t go in that direction), VERY CLOSE (less than 4 units, that is one
tile), CLOSE (between 4 and 48 units 1-12 tiles), MEDIUM (between 49 and
224 units, 13-56 tiles) and FAR (225 units, 57 tiles, or more). This way, the
number of possible states is “only” 516 = 152, 587, 890, 625. As a reference, the
Pacman labyrinth size is 28 × 26 tiles.
    Regarding the actions, Pac-Man can move in any of the four main directions:
UP, DOWN, RIGHT and LEFT, as long as there is not a wall blocking its path.
Staying still is not allowed, so there are as much as four possible actions in each
state. It is important to note that decisions are only made when Pac-Man reaches
an intersection, to avoid hesitation, or if there is a ghost close to Pac-Man.

4.2   Similarity function
We need a way of comparing cases, so when the agent is in some situation, it
can retrieve the most suitable case from the case base. This involves searching
the most similar case and then using its solution adapted to the current context.
There are many ways in which cases can be compared, using different kinds of
linear and non-linear functions.
    In this work we use a simple linear combination of the features described
in the previous section, so we can use the different weight to give more or less
importance to each feature depending on the learning goal:


                        sim(c, c0 ) = 1 − dist(c, c0 )
                                         4
                                         X
                        dist(c, c0 ) =         wi ∗ distv (di , d0i )
                                         i=1
                                         4
                                               j = 1 if di = d0i
                                         X   
                                   0
                      distv (di , di ) =   j
                                               j = 0 if di 6= d0i
                                         i=1


    The role of the similarity function is to increase the reusability of each case
so the Q-values can be used not only in that game state but also in similar ones.

4.3   Adaptation
In Pac-Man the adaptation step is very straightforward, since actions are the
same for all cases. The only situation in which best action could not be applied to
a case is when there is a wall blocking the path, so that best action means trying
to go in a direction that is not available for the agent. When this occurs, the
selected case is discarded and the second more similar one will be used instead.

4.4   Case base management
Case base management becomes a key task in the CBR process in order to main-
tain good performance. We use two different strategies to maintain a reasonable
number of cases in the case base.
    First, we only add a new case to the case base if it is different enough from
the existing ones. The intuition is that two very similar cases will probably have
Fig. 3: Scores of the RL agent and a random agent vs. number of learning
episodes.


similar Q-values associated with the actions so we do not need to remember both
of them. This is made by looking at the result of the similarity function that
compares both cases during the retrieval stage. The current threshold to learn a
case is a similarity value that will depend on the experiment parameters.
    Second, when the number of learned cases reaches 100000, we remove 10%
of them. In particular, we look for pairs of similar cases in the case base with
the same best action (according to the Q-values) and we “forget” the one that
has been retrieved less often. The intuition is that both cases represent similar
situations in which the agent should make the same decision so erasing one of
them should not lead to an important loss of information.


5   Experimental evaluation

We performed three different experiments with different goals and different
weights in the similarity function. The main aim is to make the bot to de-
velop different behaviors depending on what information is more important and
what type of reward the agent gets during the simulation. All the experiments
consisted in playing 10000 episodes or games.


Experiment 1 In this first experiment the goal is make the bot get as much
score as possible. The weights used in the similarity function were: pill (0.9),
power pill (0), none edible ghost (0.05) and edible ghost (0.05). The similarity
thresholds to learn new cases was 0.001. The reward matches the game score.
   Figure 3 shows the evolution of the score of our RL agent (blue) and a random
agent (orange) that takes random decision when it reaches an intersection. Since
the most important parameters are the distances to the pill, the RL agent tries
to eat them only changes when there is a ghost close. The final agent is able
to reach 3510 points of average, and hits a maximum of 6150 during some of
Fig. 4: Turns alive of the RL agent and a random agent vs. number of learning
episodes.


the episodes at the end of the simulation when the bot is just playing and not
learning any more.2

Experiment 2 In the second experiment the goal is to avoid ghosts, in order
to survive as much time as it can. The weights used in the similarity function
were: pill (0), power pill (0), non edible ghost (0.95) and edible ghost (0.05). The
similarity thresholds to learn new cases was 0.01. The agent obtains a reward
every turn it is alive.
    Figure 4 shows the time the agents are alive vs. the number of learning
episodes. The agent learns to stay alive from only 2 state parameters: the dis-
tances to the edible and non edible ghosts. Something interesting is that when
Pac-Man completes a level if he survives for 3000 turns even if there are still
pills in the board, so the final RL agent is able to complete 3 or 4 levels in some
of the episodes.3

Experiment 3 The goal of the final experiment is to train an agent to obtain
both high score and remain alive. The weights used in the similarity function
were: pill (0.2), power pill (0.05), non edible ghost (0.7) and edible ghost (0.05).
The similarity thresholds to learn new cases was 0.025. The agent obtains a
reward every turn it is alive and every time it gets game points.
   Figure 5 shows the evolution of the score and the number of turns the agents
keep alive.4
   Table 1 summarizes the stats of the agent performed in every experiment,
when played 100 episodes without learning and against Legacy Ghosts. Those
are different and harder ghosts than the default set used for learning.
2
  There is a video of this experiment in the following link: https://youtu.be/
  -fD8t7tRjG0
3
  https://youtu.be/4phEjPAGrbk
4
  https://youtu.be/fftrYjzTLPg
Fig. 5: Score and turns alive of the RL agent and a random agent vs. number of
learning episodes in experiment 3.

           Bot            Max Score Max Time Avg Score Avg Time
           Experiment 1      6150        1491        3510        1078
           Experiment 2      6320        5653        2022        1866
           Experiment 3      4310        4047        1918        1523
                   Table 1: Agent result for each experiment.



6   Related work and conclusions

Although the techniques described in this paper are not yet massively used in
commercial games, there is previous research work that has explored similar
ideas. For example, Gallagher and Ledwich [3], and later Bom and Henken [2]
studied about training Pac-Man bots using neural networks, with some interest-
ing results, starting the path of mastering this game with AI methods.
    Reinforcement learning has also been used in other domains such as first
person shooters. Lample and Chaplot [4] made a very successful research of this
area using the FPS game Doom.
    There is also interesting CBR research with video games. Sharma et al. [10]
combined case based reasoning and reinforcement learning to train bots for play-
ing real time strategy games.
    In this paper, we have also joined together CBR and RL to study their us-
ability in video games in general, and to research about the feasibility of building
a learning agent that could develop different behaviors using these methods to
beat Ms. Pac-Man game specifically.
    We have shown that both techniques are well suited for learning basic, inde-
pendent strategies, reaching an encouraging performance. Unfortunately, when
different, and in some situations opposite, strategies must be taken into account,
the results are no so promising. We plan to explore the causes of this fact and
look for ways to compensate the confronting relationship between strategies, and
compare the resultant bots to related works like [8].
   Whole code of this research can be found in the following repository: https:
//github.com/fedomi/PacMan-vs-Ghosts.git


References
 1. Aamodt, A., Plaza, E.: Case-based reasoning: Foundational issues, methodological
    variations, and system approaches. AI Commun. 7(1), 39–59 (Mar 1994)
 2. Bom, L., Henken, R., Wiering, M.: Reinforcement learning to train ms. pac-man
    using higher-order action-relative inputs. In: ADPRL. pp. 156–163. IEEE (2013),
    http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6596003
 3. Gallagher, M., Ledwich, M.: Evolving pac-man players: Can we learn from raw
    input? In: Proceedings of the 2007 IEEE Symposium on Computational Intelligence
    and Games, CIG 2007, Honolulu, Hawaii, USA, 1-5 April, 2007. pp. 282–287 (2007)
 4. Lample, G., Chaplot, D.S.: Playing FPS games with deep reinforcement learning.
    In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence,
    February 4-9, 2017, San Francisco, California, USA. pp. 2140–2146 (2017)
 5. Lucas, S.M.: Ms. Pac-Man competition (2007-2011). http://dces.essex.ac.uk/
    staff/sml/pacman/PacManContest.html, accessed: 2017-06-25
 6. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA,
    USA (1998)
 7. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G.,
    Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C.,
    Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis,
    D.: Human-level control through deep reinforcement learning. Nature 518(7540),
    529–533 (02 2015)
 8. Pepels, T., Winands, M.H.M., Lanctot, M.: Real-time monte carlo tree search in
    Ms Pac-Man. IEEE Transactions on Computational Intelligence and AI in Games
    6(3), 245–257 (Sep 2014)
 9. Pyeatt, L.D., Howe, A.E.: Decision tree function approximation in reinforcement
    learning. Tech. rep., In Proceedings of the Third International Symposium on
    Adaptive Systems: Evolutionary Computation and Probabilistic Graphical Models
    (1998)
10. Sharma, M., Holmes, M.P., Santamarı́a, J.C., Irani, A., Jr., C.L.I., Ram, A.: Trans-
    fer learning in real-time strategy games using hybrid CBR/RL. In: Veloso, M.M.
    (ed.) IJCAI 2007, Proceedings of the 20th International Joint Conference on Arti-
    ficial Intelligence, Hyderabad, India, January 6-12, 2007. pp. 1041–1046 (2007)
11. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press,
    Cambridge, MA, USA, 1st edn. (1998)
12. Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis, King’s College,
    Cambridge, UK (May 1989)
13. Williams, P.R.: Ms. Pac-Man vs Ghosts AI. http://www.pacmanvghosts.co.uk,
    accessed: 2017-06-25
14. Williams, P.R., Liebana, D.P., Lucas, S.M.: Ms. Pac-Man versus Ghost team CIG
    2016 competition. In: IEEE Conference on Computational Intelligence and Games,
    CIG 2016, Santorini, Greece, September 20-23, 2016. pp. 1–8 (2016)