Using a Genetic Algorithm to Evolve a D* Search Heuristic
                                             Andrew Giese and Jennifer Seitzer
                                                  University of Dayton
                                         gieseanw@gmail.com, seitzer@udayton.edu

                             Abstract                                Additionally, we use the simulator that will eventually
  Evolutionary computation (EC) is the sub-discipline of             benefit from this optimized SEF to provide feedback in the
  artificial intelligence that iteratively derives solutions using   evolution process. That is, the simulator serves as the
  techniques from genetics. In this work, we present a               fitness function for the evolving SEF.
  genetic algorithm that evolves a heuristic static evaluation
  function (SEF) function to be used in a real-time search           This work is novel in that it combines techniques of
  navigation scheme of an autonomous agent. This coupling            evolutionary computation using genetic algorithms and the
  of algorithmic techniques (GAs with real time search by            use and refinement of a heuristic for the D* algorithm.
  autonomous agents) makes for interesting formalistic and           There are many applications of genetic algorithms in
  implementation challenges. Genetic evolution implies the           diverse domains such as bioinformatics (Hill 2005),
  need for a fitness function to guide a convergence in the          gaming (Lucas 2006), music composition (Weale 2004),
  solution being created. Thus, as part of this work, we             and circuit optimization (Zhang 2006). Additionally, work
  present a fitness function that dictates the efficacy of a         in D* has been studied and developed in theory
  generated static evaluation function. In this work, we             (Ramalingam 1996) as well as specific applications such as
  present       algorithmic       and     formalistic     designs,
                                                                     robotics (Koenig 2005). We are using the all-inclusive
  implementation details, and performance results of this
  multi-layered software endeavor.
                                                                     examination that genetic algorithms affords us to find the
                                                                     perfect (or near perfect) heuristic function for a derivative
                                                                     of the very traditional AI search, A*.
                        Introduction
The A* Algorithm is a greedy best-first search algorithm                         The A* and D* Algorithms
that is used to calculate an optimal path between two
points, or nodes, on a graph (Hart et. al 1968)           The        In this work, the evolution of a static evaluation function
algorithm can be adapted to a run in real-time by way of             using a genetic algorithm is applied to an autonomous
restarting execution as new environment information                  agent operating in an environment provided by Infinite
becomes available, called a Dynamic A* (D*) search. The              Mario Bros., an open-source, faithful recreation of
A* search uses a static evaluation function (SEF) that uses          Nintendo’s Super Mario World. The agent (Mario) uses a
heuristics to find a path of state transformations from a            realtime execution of an A* search, called D*, to direct its
start state to a goal state. The SEF assesses the merit of           movement through the environment to ultimately reach the
each child state as it is generated by assigning it a numeric        goal (the end of the level). Mario may use information
value based on information about that state. The score               about what is currently visible onscreen, but beyond that
allows the A* to direct its search by prioritizing the               nothing is known, making a calculation of an actual path to
expansion of child nodes that could potentially expand into          the goal impossible. Therefore, the SEF of the D* must
a goal state while neglecting child nodes that are less likely       direct Mario towards states that are on the path to the goal.
to lead to a goal.
                                                                     Mario has a total of thirteen distinct action combinations
In a real time environment, information about the actual             that allow him to negotiate the environment. These are
goal state is unavailable and unobtainable for any given             move left, move right, duck, and two others—jump and
iteration of the search for an agent (because it is out of           speed—that can be used in combination with the other
range of the agent’s sensors). Therefore, the SEF must               actions and each other. Jump allows movement along the
direct the search to the most appropriate state that                 y-axis, and can be used in combination with right, left, and
anticipates system information as it becomes available. In           duck. Speed allows for faster movement left or right, and
our work, the SEF seeks to maximize some aspects of an               higher jumps. This means that from any state, there could
agent’s state while minimizing others. By evolving a                 be up to thirteen child nodes. Since the agent must operate
weight on each “aspect-variable”, we are able to create              at 24 frames per second, the agent is allotted approximately
offline a highly effective SEF that can predict obstacles            40 milliseconds to perceive its current state, decide what to
and challenges that occur in real time during the execution          do next, and return a chosen action. With up to thirteen
of D*. In this paper we present the offline pursuit of using         child nodes from any node in the search tree, any algorithm
a genetic algorithm as a mechanism to evolve an optimal              that decides what Mario is going to do next must do so
SEF to be used in the real-time execution of A*.                     quickly and efficiently. A “brute force” approach that
                                                                     analyzes all possible children was infeasible given
available computing machinery, and therefore a dynamic           performing action A from node M. A static evaluation
D* search is more appropriate.                                   function provides weights on Mario’s X position, X
                                                                 velocity, Y position, Y velocity, Mario’s damage taken,
The SEF of a D* search uses information about a state to         whether Mario is carrying a shell, Mario’s X position
direct the search efficiently. For Mario, much information       multiplied by X velocity, and Mario’s Y position
is immediately available from each percept provided by           multiplied by Y velocity. These weights are values between
the environment. This information includes the position of       -1 and 1. After multiplying weights to their associated state
Mario, the amount of damage Mario has taken, the                 variables, the sum of products forms the final SEF score for
positions and types of enemies onscreen, and position and        that node.
types of obstacles onscreen. Other information can be
tracked over time, like number of kills, X velocity, Y
velocity, coins collected, time remaining, etc. The task of
our system was to discover what effects, if any, the values
of these variables should have on the valuation performed
by the SEF of a node in the search graph. A high value for
a variable might proportionally increase the cost to
transition to that state, or conversely could proportionally
decrease the transition cost.

                       The System
In 2009 and 2010 Julian Togelius of the ICE-GIC held a
competition for entrants to create an autonomous agent
(bot) that would play Markus Persson’s Infinite Mario
Bros. the best. “Best” in this sense means the distance a
bot could travel within a given level and time limit. If two                               Figure 1
bots finished a level they were awarded equal scores, but if
neither finished, the bot that travelled furthest was deemed     This SEF score is an estimation of the amount of work
better. In both iterations of the competition, the same bot      required to reach a goal state from the current node, and as
was victorious. This bot was written by Robin Baumgarten         such nodes with lower SEF scores are preferable. As an A*
(Baumgarten). Robin’s bot used a D* search coupled with          algorithm dictates, the level of the search tree at which the
an accurate method for expanding child nodes, and a              node was discovered is also added to the score. This is the
human-generated static evaluation function for the D*. Our       “greedy” part of an A* search where not just a solution is
system is a heavily modified version of Robin’s, with the        desired, but the best solution. For Mario, the cost to
majority of the A* rewritten for legibility and efficiency       transition from one state to another is uniform; all
while the means to produce child nodes was mostly                neighboring states have the same arc cost to travel to a
preserved.                                                       neighbor. Adding the sum of arc costs into the SEF score
                                                                 for a node is a means by which the “work” required to
Every 24 frames, the environment provides the agent with         reach a node in the graph is represented, so that if the same
a percept that includes the locations and types of all sprites   node is reached by two separate paths, the shortest path is
on the screen, including Mario. The agent must return an         favored. Since the D* search operates in a partially
action to the environment that the environment then effects      observable world, an admissible heuristic is difficult to
upon Mario. For each percept received, the agent runs an         discern, hence the motivation for a Genetic Algorithm to
A* search for 39ms or until the agent has planned out to 15      search for the optimal weights to apply in the SEF.
levels of the search tree. The agent keeps an internal
representation of the world, and tracks Mario’s x and y          During the D* search, children nodes are generated from
velocities among other things not provided by each               the current state of the agent. Generated children are placed
percept.                                                         on an open list sorted from lowest to highest scores. The
                                                                 child with the lowest score is taken from that list, and its
After ensuring that its internal representation is consistent    children are generated. This process repeats until the agent
with the environment-provided one, the agent begins an           has searched for 39ms or has searched 14 states (empirical
A* search from Mario’s current position and velocity.            number), at which point it returns the action that leads to
Children are generated by considering which actions are          the most optimal path for the current available information.
available to Mario at any node. That is, a child state
reflects where Mario would be and how fast he would be           The values for the weights used in the agent’s SEF
moving if performed action A from node M. (Figure 1) A           mentioned above are deemed to be “unknown” to the
child state also informs the search of whether Mario would       system, and are provided via parameters supplied by an
take damage, die, kill an enemy, collect a coin, etc. upon       external entity, in this case a Genetic Algorithm. The
genetic algorithm is implemented as defined in (Russell                            The Experiment
and Norvig 2003). The chromosome being evolved is an
array of 8 floating point values, each between -1.0 and 1.0.   The Experiment was conducted across two iterations of the
The mutation rate was 1%.                                      Genetic Algorithm. For the initial one, a starting population
                                                               of 10 chromosomes instantiated with random values was
Each generation of chromosomes was tested for fitness by       created. A total of 800 generations were iterated over, with
running a simulation on a training level where the agent       five children produced per generation. The test level had a
used the chromosome’s genes as the weights on state-           time limit of 36 in-game seconds (~26 seconds in realtime),
variables evaluated by the SEF in a D* search. The fitness     and a length of 300 blocks (~4800 pixels). The level’s
of the chromosome was a summation of Mario’s distance          “seed” used by the level generation engine was 4 and the
travelled, and if he completed the level, also the remaining   difficulty was set to 15. The program execution lasted over
time Mario had to complete the level. A higher fitness         20 hours.
score indicates a better, or more fit, chromosome. This is
in contrast to the Static Evaluation function where a lower    After this initial iteration completed, five of the top-scoring
score indicates a more ideal state.                            agents were used as the starting population for the second
                                                               iteration of the Genetic Algorithm. The level length was
The test level that each bot was scored on had a variety of    increased three-fold to a length of 900 blocks (~14400
characteristics. The most important of these is that the       pixels), the time limit set to 100 in-game seconds, the
level was short. As each chromosome needed to be used in       “seed” to 65, and the difficulty retained at 15. 320
an actual bot, a single fitness test could last upwards of a   generations were evaluated, again with five children
minute even if the bot could finish the level successfully.    produced for each generation. As this test level’s length
A short level guaranteed that if a bot was going to finish a   and time were much larger than the first iteration of the
level, it could do so without much time spent. The second      GA, the execution time prolonged to about 30 hours.
characteristic of the level was an imposed time limit. This
time limit places an upper bound on the possible time a bot                              Results
could spend in a level. Slow bots, bots that stood still, or
bots that got stuck therefore all required a maximum of N      The technique of using a GA to evolve the SEF of a D*
seconds to evaluate.                                           search allowed a system to generate an effective SEF in the
                                                               absence of a priori knowledge about what makes one agent
An ideal level must also contain challenges and obstacles      state more desirable than another. The results of this
that a full level will have on its maximum difficulty. These   experiment demonstrate little to no direct correlations
challenges include portions with a high volume of              between individual weights and bot scores, implying a
enemies, some which that cannot be destroyed by landing        trial-and-error search for a human would be difficult and
on their heads; portions with Bullet Bill towers of varying    time-consuming.
heights; gaps of varying width; pipes with Piranha Plants
leaping out of them; and portions with mixtures of these       For the initial iteration of the GA, over three thousand
scenarios.                                                     unique bots were evaluated over the course of 800
                                                               generations. 285 of those tied for the top score of 3955. An
Optimization to D* and GA                                      interesting note is that the first bot to score this amount was
                                                               produced during the 11th generation of the GA.
The D* search still performed sub-optimally given
computing hardware, so the search tree needed to be pared      The weights used in the bot’s SEF that the GA iterated on
down. Paring the tree followed a simple formula: if two
                                                               varied greatly. Figures 2 and 3 show typical scatter plots
child nodes generated the same score from the SEF, the
                                                               for the values of weights over the course of the GA’s
first child node was kept and the other discarded. In a        execution.
further endeavor to pare the tree, the maximum degree for
a node was reduced from 13 to 11 by discounting nodes
reachable through the action of ducking by the Agent. In
an effort to avoid a bias in the reproduction phase of the
genetic algorithm, a generated and tested chromosome was
only added to the population if either its fitness score was
unique or, failing that, the genes on the chromosome were
unique among the chromosomes with the same fitness. If
this precaution was not taken, a glut of identical
chromosomes with the same score could skew the parental
selection process unfairly.
                                                             The scores that bots received likewise reached a local
                                                             maximum early (generation 11), and were unable to
                                                             improve thereafter (Figure 5).


                         Figure 2


                                                                                       Figure 5

                                                             Upon examining the possible correlation between weight
                                                             values and bot scores, a similar quandary is encountered
                                                             where bots received top scores despite the weight values
                                                             (Figures 6 and 7), save for the case of the weight on X
                                                             Position multiplied by X Velocity (Figure 8). In the case of
                                                             the value of the weight on the agent’s X Position multiplied
                                                             by the agent’s X Velocity, a negative weight positively
                                                             correlates to a higher score, and every single positive
                                                             weight has a score of 0.0 or less (a negative score indicates
                                                             the agent travelled backwards).

                         Figure 3

The weights on state-variables may appear to quickly
converge to a few values and remain there. However, over
time the amount of variance for any weight does not
decline linearly. Figure 4 shows a plot of the amount of
variation for each weight grouped by 50 generations. No r
declination of the standard deviation among populations of
weight values is present.


                                                                                       Figure 6


                         Figure 4
                                                               Figures 10 and 11 almost perfectly mirror Figures 6 and 7
                                                               in their distribution of scores for weight values on X
                                                               Position and Y Position. Figure 12 likewise mirrors the
                                                               data in Figure 8 that indicates negative weights on the
                                                               agent’s X Position multiplied by the Agent’s X Velocity
                                                               correlate to higher scores.


                          Figure 7


                                                                                       Figure 10


                         Figure 8

For the second iteration of the GA, similar results to the
first iteration were obtained. However, only 3 bots out of a
population of over a thousand shared the top score. Figure
9 presents the distribution of scores that bots received
during the course of execution. Since the initial population
of this iteration comprised top-scoring bots of the first
iteration, it is understandable that so many bots scored so                            Figure 11
well so early, however a clear ceiling to the scores is
visible, indicating the algorithm likely could not escape a
local maxima.


                                                                                       Figure 12

                                                               Although the weight values produced by the Genetic
                          Figure 9                             Algorithm for the top bots were distributed across the
                                                               gamut of possible values, the end result was in fact bots
that performed roughly as well as Robin Baumgarten’s bot      allow an agent to make an accurate prediction of an optimal
that won the ICE-GIC competition two years in a row.          path before receiving its next percept.

Table 1 displays a comparison of bot scores between
Robin’s       Bot      (AStarAgent) and  our    bot
(AStarAgentEvolved) for a variety of levels whose
difficulties were set to 15.


                                                Table 1

                     Conclusions                                                     References
In this work, we presented the novel technique of using a     Baumgarten, Robin. Infinite Super Mario AI. 9 September
genetic algorithm as an offline meta-search for an optimal    2009. 8 February 2011
static evaluation function to be used by the D* search of a   <http://www.doc.ic.ac.uk/~rb1006/projects:marioai>.
real-time autonomous agent. The end results were Static
Evaluation Function parameters that, upon use in the SEF      Hart, Peter E., Nils J. Nilsson and Bertram Raphael. "A
for a real-time agent, enabled the agent to perform as well   Formal Basis for the Heuristic Determination of Minimum
as the current best in its environment.                       Cost Paths." IEEE Transaction of Systems Science and
                                                              Cybernetics SSC-4, No. 2 (1968): 100-107.
The fact that our agent performed as well as the current
best is significant because we made very few assumptions
about the valuation of agent states in a static evaluation    Hill T, Lundgren A, Fredriksson R, Schiöth HB (2005).
function. That is, the algorithms presented in this paper     "Genetic algorithm for large-scale maximum parsimony
automated this task. The implication is that similar          phylogenetic analysis of proteins". Biochimica et
techniques could be employed for autonomous agents in         Biophysica Acta 1725 (1): 19–29.
other, possibly real-world, environments with high
confidence in the end result.                                 Koenig S. and Likhachev M. Fast Replanning for
                                                              Navigation in Unknown Terrain. Transactions on Robotics,
                     Future Work                              21, (3), 354–363, 2005

The work presented here has much potential for expansion.     Lucas, S., and Kendell, G. (2006). Evolutionary
Future work should include utilizing parallel computing       computation and games. IEEE Comput Intell Mag., pp. 10–
clusters like Beowulf to take advantage of the natural        18
independence between the analyses of members in a
population by the GA’s fitness function, as well as the       Mitchell, M. (1998). An introduction to genetic algorithms.
evaluation of nodes in the open list of the D* algorithm by
                                                              MIT Press.
the algorithm’s SEF. This sort of capability will allow for
not only a deeper D* search, but shorter generations in the
Genetic Algorithm and therefore the ability to run the        Ramalingam G., Reps T., An incremental algorithm for a
algorithm for more generations in the same amount of          generalization of the shortest-path problem, Journal of
time.                                                         Algorithms 21 (1996) 267–305.

Potential future work could also include employing pattern    Russel, Stuart J. and Peter Norvig. Artificial Intelligence: A
matching techniques to identify a discrete set of distinct    Modern Approach. Upper Saddle River: Prentice
scenarios an agent would encounter. An agent could then       Hall/Pearson Education, 2003.
utilize a separate SEF for each scenario.
                                                              Zhang, J., Lo, W.L., and Chung, H.
Under the notion of pattern-matching, even further future     (2006).Pseudocoevolutionary Genetic Algorithms for
research would focus on generating probability tables for     Power Electronic Circuits Optimization. IEEE Trans
the likelihood of scenarios occurring after each other.       Systems, Man, and Cybernetics, 36C (4), pp. 590–598.
Knowing the probability of a scenario to occur next would