Using a Genetic Algorithm to Evolve a D* Search Heuristic Andrew Giese and Jennifer Seitzer University of Dayton gieseanw@gmail.com, seitzer@udayton.edu Abstract Additionally, we use the simulator that will eventually Evolutionary computation (EC) is the sub-discipline of benefit from this optimized SEF to provide feedback in the artificial intelligence that iteratively derives solutions using evolution process. That is, the simulator serves as the techniques from genetics. In this work, we present a fitness function for the evolving SEF. genetic algorithm that evolves a heuristic static evaluation function (SEF) function to be used in a real-time search This work is novel in that it combines techniques of navigation scheme of an autonomous agent. This coupling evolutionary computation using genetic algorithms and the of algorithmic techniques (GAs with real time search by use and refinement of a heuristic for the D* algorithm. autonomous agents) makes for interesting formalistic and There are many applications of genetic algorithms in implementation challenges. Genetic evolution implies the diverse domains such as bioinformatics (Hill 2005), need for a fitness function to guide a convergence in the gaming (Lucas 2006), music composition (Weale 2004), solution being created. Thus, as part of this work, we and circuit optimization (Zhang 2006). Additionally, work present a fitness function that dictates the efficacy of a in D* has been studied and developed in theory generated static evaluation function. In this work, we (Ramalingam 1996) as well as specific applications such as present algorithmic and formalistic designs, robotics (Koenig 2005). We are using the all-inclusive implementation details, and performance results of this multi-layered software endeavor. examination that genetic algorithms affords us to find the perfect (or near perfect) heuristic function for a derivative of the very traditional AI search, A*. Introduction The A* Algorithm is a greedy best-first search algorithm The A* and D* Algorithms that is used to calculate an optimal path between two points, or nodes, on a graph (Hart et. al 1968) The In this work, the evolution of a static evaluation function algorithm can be adapted to a run in real-time by way of using a genetic algorithm is applied to an autonomous restarting execution as new environment information agent operating in an environment provided by Infinite becomes available, called a Dynamic A* (D*) search. The Mario Bros., an open-source, faithful recreation of A* search uses a static evaluation function (SEF) that uses Nintendo’s Super Mario World. The agent (Mario) uses a heuristics to find a path of state transformations from a realtime execution of an A* search, called D*, to direct its start state to a goal state. The SEF assesses the merit of movement through the environment to ultimately reach the each child state as it is generated by assigning it a numeric goal (the end of the level). Mario may use information value based on information about that state. The score about what is currently visible onscreen, but beyond that allows the A* to direct its search by prioritizing the nothing is known, making a calculation of an actual path to expansion of child nodes that could potentially expand into the goal impossible. Therefore, the SEF of the D* must a goal state while neglecting child nodes that are less likely direct Mario towards states that are on the path to the goal. to lead to a goal. Mario has a total of thirteen distinct action combinations In a real time environment, information about the actual that allow him to negotiate the environment. These are goal state is unavailable and unobtainable for any given move left, move right, duck, and two others—jump and iteration of the search for an agent (because it is out of speed—that can be used in combination with the other range of the agent’s sensors). Therefore, the SEF must actions and each other. Jump allows movement along the direct the search to the most appropriate state that y-axis, and can be used in combination with right, left, and anticipates system information as it becomes available. In duck. Speed allows for faster movement left or right, and our work, the SEF seeks to maximize some aspects of an higher jumps. This means that from any state, there could agent’s state while minimizing others. By evolving a be up to thirteen child nodes. Since the agent must operate weight on each “aspect-variable”, we are able to create at 24 frames per second, the agent is allotted approximately offline a highly effective SEF that can predict obstacles 40 milliseconds to perceive its current state, decide what to and challenges that occur in real time during the execution do next, and return a chosen action. With up to thirteen of D*. In this paper we present the offline pursuit of using child nodes from any node in the search tree, any algorithm a genetic algorithm as a mechanism to evolve an optimal that decides what Mario is going to do next must do so SEF to be used in the real-time execution of A*. quickly and efficiently. A “brute force” approach that analyzes all possible children was infeasible given available computing machinery, and therefore a dynamic performing action A from node M. A static evaluation D* search is more appropriate. function provides weights on Mario’s X position, X velocity, Y position, Y velocity, Mario’s damage taken, The SEF of a D* search uses information about a state to whether Mario is carrying a shell, Mario’s X position direct the search efficiently. For Mario, much information multiplied by X velocity, and Mario’s Y position is immediately available from each percept provided by multiplied by Y velocity. These weights are values between the environment. This information includes the position of -1 and 1. After multiplying weights to their associated state Mario, the amount of damage Mario has taken, the variables, the sum of products forms the final SEF score for positions and types of enemies onscreen, and position and that node. types of obstacles onscreen. Other information can be tracked over time, like number of kills, X velocity, Y velocity, coins collected, time remaining, etc. The task of our system was to discover what effects, if any, the values of these variables should have on the valuation performed by the SEF of a node in the search graph. A high value for a variable might proportionally increase the cost to transition to that state, or conversely could proportionally decrease the transition cost. The System In 2009 and 2010 Julian Togelius of the ICE-GIC held a competition for entrants to create an autonomous agent (bot) that would play Markus Persson’s Infinite Mario Bros. the best. “Best” in this sense means the distance a bot could travel within a given level and time limit. If two Figure 1 bots finished a level they were awarded equal scores, but if neither finished, the bot that travelled furthest was deemed This SEF score is an estimation of the amount of work better. In both iterations of the competition, the same bot required to reach a goal state from the current node, and as was victorious. This bot was written by Robin Baumgarten such nodes with lower SEF scores are preferable. As an A* (Baumgarten). Robin’s bot used a D* search coupled with algorithm dictates, the level of the search tree at which the an accurate method for expanding child nodes, and a node was discovered is also added to the score. This is the human-generated static evaluation function for the D*. Our “greedy” part of an A* search where not just a solution is system is a heavily modified version of Robin’s, with the desired, but the best solution. For Mario, the cost to majority of the A* rewritten for legibility and efficiency transition from one state to another is uniform; all while the means to produce child nodes was mostly neighboring states have the same arc cost to travel to a preserved. neighbor. Adding the sum of arc costs into the SEF score for a node is a means by which the “work” required to Every 24 frames, the environment provides the agent with reach a node in the graph is represented, so that if the same a percept that includes the locations and types of all sprites node is reached by two separate paths, the shortest path is on the screen, including Mario. The agent must return an favored. Since the D* search operates in a partially action to the environment that the environment then effects observable world, an admissible heuristic is difficult to upon Mario. For each percept received, the agent runs an discern, hence the motivation for a Genetic Algorithm to A* search for 39ms or until the agent has planned out to 15 search for the optimal weights to apply in the SEF. levels of the search tree. The agent keeps an internal representation of the world, and tracks Mario’s x and y During the D* search, children nodes are generated from velocities among other things not provided by each the current state of the agent. Generated children are placed percept. on an open list sorted from lowest to highest scores. The child with the lowest score is taken from that list, and its After ensuring that its internal representation is consistent children are generated. This process repeats until the agent with the environment-provided one, the agent begins an has searched for 39ms or has searched 14 states (empirical A* search from Mario’s current position and velocity. number), at which point it returns the action that leads to Children are generated by considering which actions are the most optimal path for the current available information. available to Mario at any node. That is, a child state reflects where Mario would be and how fast he would be The values for the weights used in the agent’s SEF moving if performed action A from node M. (Figure 1) A mentioned above are deemed to be “unknown” to the child state also informs the search of whether Mario would system, and are provided via parameters supplied by an take damage, die, kill an enemy, collect a coin, etc. upon external entity, in this case a Genetic Algorithm. The genetic algorithm is implemented as defined in (Russell The Experiment and Norvig 2003). The chromosome being evolved is an array of 8 floating point values, each between -1.0 and 1.0. The Experiment was conducted across two iterations of the The mutation rate was 1%. Genetic Algorithm. For the initial one, a starting population of 10 chromosomes instantiated with random values was Each generation of chromosomes was tested for fitness by created. A total of 800 generations were iterated over, with running a simulation on a training level where the agent five children produced per generation. The test level had a used the chromosome’s genes as the weights on state- time limit of 36 in-game seconds (~26 seconds in realtime), variables evaluated by the SEF in a D* search. The fitness and a length of 300 blocks (~4800 pixels). The level’s of the chromosome was a summation of Mario’s distance “seed” used by the level generation engine was 4 and the travelled, and if he completed the level, also the remaining difficulty was set to 15. The program execution lasted over time Mario had to complete the level. A higher fitness 20 hours. score indicates a better, or more fit, chromosome. This is in contrast to the Static Evaluation function where a lower After this initial iteration completed, five of the top-scoring score indicates a more ideal state. agents were used as the starting population for the second iteration of the Genetic Algorithm. The level length was The test level that each bot was scored on had a variety of increased three-fold to a length of 900 blocks (~14400 characteristics. The most important of these is that the pixels), the time limit set to 100 in-game seconds, the level was short. As each chromosome needed to be used in “seed” to 65, and the difficulty retained at 15. 320 an actual bot, a single fitness test could last upwards of a generations were evaluated, again with five children minute even if the bot could finish the level successfully. produced for each generation. As this test level’s length A short level guaranteed that if a bot was going to finish a and time were much larger than the first iteration of the level, it could do so without much time spent. The second GA, the execution time prolonged to about 30 hours. characteristic of the level was an imposed time limit. This time limit places an upper bound on the possible time a bot Results could spend in a level. Slow bots, bots that stood still, or bots that got stuck therefore all required a maximum of N The technique of using a GA to evolve the SEF of a D* seconds to evaluate. search allowed a system to generate an effective SEF in the absence of a priori knowledge about what makes one agent An ideal level must also contain challenges and obstacles state more desirable than another. The results of this that a full level will have on its maximum difficulty. These experiment demonstrate little to no direct correlations challenges include portions with a high volume of between individual weights and bot scores, implying a enemies, some which that cannot be destroyed by landing trial-and-error search for a human would be difficult and on their heads; portions with Bullet Bill towers of varying time-consuming. heights; gaps of varying width; pipes with Piranha Plants leaping out of them; and portions with mixtures of these For the initial iteration of the GA, over three thousand scenarios. unique bots were evaluated over the course of 800 generations. 285 of those tied for the top score of 3955. An Optimization to D* and GA interesting note is that the first bot to score this amount was produced during the 11th generation of the GA. The D* search still performed sub-optimally given computing hardware, so the search tree needed to be pared The weights used in the bot’s SEF that the GA iterated on down. Paring the tree followed a simple formula: if two varied greatly. Figures 2 and 3 show typical scatter plots child nodes generated the same score from the SEF, the for the values of weights over the course of the GA’s first child node was kept and the other discarded. In a execution. further endeavor to pare the tree, the maximum degree for a node was reduced from 13 to 11 by discounting nodes reachable through the action of ducking by the Agent. In an effort to avoid a bias in the reproduction phase of the genetic algorithm, a generated and tested chromosome was only added to the population if either its fitness score was unique or, failing that, the genes on the chromosome were unique among the chromosomes with the same fitness. If this precaution was not taken, a glut of identical chromosomes with the same score could skew the parental selection process unfairly. The scores that bots received likewise reached a local maximum early (generation 11), and were unable to improve thereafter (Figure 5). Figure 2 Figure 5 Upon examining the possible correlation between weight values and bot scores, a similar quandary is encountered where bots received top scores despite the weight values (Figures 6 and 7), save for the case of the weight on X Position multiplied by X Velocity (Figure 8). In the case of the value of the weight on the agent’s X Position multiplied by the agent’s X Velocity, a negative weight positively correlates to a higher score, and every single positive weight has a score of 0.0 or less (a negative score indicates the agent travelled backwards). Figure 3 The weights on state-variables may appear to quickly converge to a few values and remain there. However, over time the amount of variance for any weight does not decline linearly. Figure 4 shows a plot of the amount of variation for each weight grouped by 50 generations. No r declination of the standard deviation among populations of weight values is present. Figure 6 Figure 4 Figures 10 and 11 almost perfectly mirror Figures 6 and 7 in their distribution of scores for weight values on X Position and Y Position. Figure 12 likewise mirrors the data in Figure 8 that indicates negative weights on the agent’s X Position multiplied by the Agent’s X Velocity correlate to higher scores. Figure 7 Figure 10 Figure 8 For the second iteration of the GA, similar results to the first iteration were obtained. However, only 3 bots out of a population of over a thousand shared the top score. Figure 9 presents the distribution of scores that bots received during the course of execution. Since the initial population of this iteration comprised top-scoring bots of the first iteration, it is understandable that so many bots scored so Figure 11 well so early, however a clear ceiling to the scores is visible, indicating the algorithm likely could not escape a local maxima. Figure 12 Although the weight values produced by the Genetic Figure 9 Algorithm for the top bots were distributed across the gamut of possible values, the end result was in fact bots that performed roughly as well as Robin Baumgarten’s bot allow an agent to make an accurate prediction of an optimal that won the ICE-GIC competition two years in a row. path before receiving its next percept. Table 1 displays a comparison of bot scores between Robin’s Bot (AStarAgent) and our bot (AStarAgentEvolved) for a variety of levels whose difficulties were set to 15. Table 1 Conclusions References In this work, we presented the novel technique of using a Baumgarten, Robin. Infinite Super Mario AI. 9 September genetic algorithm as an offline meta-search for an optimal 2009. 8 February 2011 static evaluation function to be used by the D* search of a . real-time autonomous agent. The end results were Static Evaluation Function parameters that, upon use in the SEF Hart, Peter E., Nils J. Nilsson and Bertram Raphael. "A for a real-time agent, enabled the agent to perform as well Formal Basis for the Heuristic Determination of Minimum as the current best in its environment. Cost Paths." IEEE Transaction of Systems Science and Cybernetics SSC-4, No. 2 (1968): 100-107. The fact that our agent performed as well as the current best is significant because we made very few assumptions about the valuation of agent states in a static evaluation Hill T, Lundgren A, Fredriksson R, Schiöth HB (2005). function. That is, the algorithms presented in this paper "Genetic algorithm for large-scale maximum parsimony automated this task. The implication is that similar phylogenetic analysis of proteins". Biochimica et techniques could be employed for autonomous agents in Biophysica Acta 1725 (1): 19–29. other, possibly real-world, environments with high confidence in the end result. Koenig S. and Likhachev M. Fast Replanning for Navigation in Unknown Terrain. Transactions on Robotics, Future Work 21, (3), 354–363, 2005 The work presented here has much potential for expansion. Lucas, S., and Kendell, G. (2006). Evolutionary Future work should include utilizing parallel computing computation and games. IEEE Comput Intell Mag., pp. 10– clusters like Beowulf to take advantage of the natural 18 independence between the analyses of members in a population by the GA’s fitness function, as well as the Mitchell, M. (1998). An introduction to genetic algorithms. evaluation of nodes in the open list of the D* algorithm by MIT Press. the algorithm’s SEF. This sort of capability will allow for not only a deeper D* search, but shorter generations in the Genetic Algorithm and therefore the ability to run the Ramalingam G., Reps T., An incremental algorithm for a algorithm for more generations in the same amount of generalization of the shortest-path problem, Journal of time. Algorithms 21 (1996) 267–305. Potential future work could also include employing pattern Russel, Stuart J. and Peter Norvig. Artificial Intelligence: A matching techniques to identify a discrete set of distinct Modern Approach. Upper Saddle River: Prentice scenarios an agent would encounter. An agent could then Hall/Pearson Education, 2003. utilize a separate SEF for each scenario. Zhang, J., Lo, W.L., and Chung, H. Under the notion of pattern-matching, even further future (2006).Pseudocoevolutionary Genetic Algorithms for research would focus on generating probability tables for Power Electronic Circuits Optimization. IEEE Trans the likelihood of scenarios occurring after each other. Systems, Man, and Cybernetics, 36C (4), pp. 590–598. Knowing the probability of a scenario to occur next would