=Paper= {{Paper |id=None |storemode=property |title=Generating Inspiration for Multi-Agent Simulation Design by Q-Learning |pdfUrl=https://ceur-ws.org/Vol-627/mass_10.pdf |volume=Vol-627 |dblpUrl=https://dblp.org/rec/conf/mallow/JungesK10 }} ==Generating Inspiration for Multi-Agent Simulation Design by Q-Learning== https://ceur-ws.org/Vol-627/mass_10.pdf
  Generating Inspiration for Multi-Agent Simulation
               Design by Q-Learning
                             Robert Junges                                              Franziska Klügl
              Modeling and Simulation Research Center                     Modeling and Simulation Research Center
                     Örebro University, Sweden                                  Örebro University, Sweden
                    Email: robert.junges@oru.se                                Email: franziska.klugl@oru.se



   Abstract—One major challenge in developing multi-agent            egy might be also described as a variant of an environment-
simulations is to find the appropriate agent design that is          driven strategy for developing multiagent simulations [2].
able generating the intended overall phenomenon respectively            A major issue in this overall procedure refers to the selection
dynamics, but does not contain unnecessary details. In this paper
we suggest to use agent learning for supporting the development      of the particular learning agent architecture. An initial analysis
of an agent model: The modeler defines the environmental model       of different learning techniques applicable for this problem
and the agent interfaces. Using rewards capturing the intended       has already been described in [3]. There, Learning Classifier
agent behavior, Reinforcement Learning techniques can be used        Systems (LCS), Feed Forward Neural Networks (FFNN) and
for learning the rules that are optimally governing the agent        Reinforcement Learning (Q-Learning) have been evaluated
behavior. However, for really being useful in a modeling and
simulation context, a human modeler must be able to review           with regards to learning performance and resulting behavior
and understand the outcome of the learning. We propose to            representation, using the same evacuation scenario problem
use additional forms of learning as post-processing step for         as in the following. In this contribution we are further in-
supporting the analysis of the learnt model. We test our ideas       vestigating Reinforcement Learning for its suitability in such
using a simple evacuation simulation scenario.                       a learning-driven model development process, focusing more
                                                                     on the interpretability of the state-action mapping produced.
                        I. M OTIVATION
                                                                     We are not focussing on mere optimization performance, but
   Methodological questions are more and more in the focus           on softer factors that define the usability of Q-Learning in the
of research on agent-based simulation as the number of chal-         model development setting: the completeness, the complexity
lenges in developing a good multi-agent simulation model are         and the generalization capabilities of the behavior learnt.
numerous. The central issue hereby concerns what behaviors              In the next section we will review existing approaches
the agents should exhibit so that the intended outcome is            for learning agent architectures in simulation models. This is
generated. What particular detail must be included, what part        followed by a more detailed treatment of the learning-driven
of the modeled behavior is not necessary? How to set the             methodology and a presentation of the reinforcement learning
parameters involved? However, if it is not fully clear from          architecture. In section IV and V we describe the used testbed,
the beginning how this local behavior should be - even if the        the experiments conducted with it and discuss the results. The
original agents behavior can be easily observed - the devel-         papers ends with a conclusion and an outlook to future work.
opment may result in a painful try and error procedure. The
modeler may add, respectively remove behavioral elements,                     II. L EARNING AGENTS AND S IMULATION
try different parameter values and test the overall outcome             Adaptive agents and multi-agent learning have been one
again and again. Such a procedure might be feasible for an           of the major focuses within distributed artificial intelligence
experienced modeler who knows the critical starting points           since its very beginning [4]. Many different forms of learning
for modifications and is capable of using complex calibration        have shown to be successful when working with agents and
tools for multi-agent simulation such as described in [1], but       multiagent systems. Obviously, we can not cover all techniques
this cannot be assumed for less experienced modelers.                for agent learning in this paper, the following paragraph shall
   In this contribution we are suggesting to solve this search for   give a few general pointers and then give a short glance on
the appropriate agent-level behavior by using agent learning.        directly related work on agent learning in simulation settings.
The vision is hereby the following procedure: the modeler            In general our contribution is special concerning the objective
starts by developing an environmental model as a part of             of our comparison: not mere learning performance but its
the overall model, then, determines what the agent might be          suitability for the usage in a modeling support context.
able to perceive and to manipulate and finally describes the            Reinforcement learning [5], learning automata [6], evolu-
intended outcome based on a reward function that evaluates the       tionary and neural forms of learning are recurrent examples of
agents performance. The agents then use a learning mechanism         learning techniques applied in multi-agent scenarios. Besides
for determining a behavior program that together generates the       that, techniques inspired by biological evolution have been
intended overall outcome in the given environment. This strat-       applied for agents in the area of Artificial Life [7], [8], where
evolutionary elements can be found together with multiagent         of properties that an appropriate learning technique may be
approaches. An example of a simulation of a concrete scenario       able to exhibit for indicating a successful application.
is [9], in which simulated ant agents were controlled by a             1) Feasibility: The learning mechanism should be able to
neural network that was designed by a genetic algorithm.                   cope with the level of complexity that is required for
Another experiment, with an approach similar to a Learning                 a valid environmental models. Thus, it should not be
Classifier System (LCS) can be found in [10], where a rules                necessary to simplify or even to reformulate the problem
set was used and modified by a genetic algorithm.                          just for being able to apply the learning mechanism;
   Although there is a wealth of publications dealing with                 That means the theoretical prerequisites for applying
the performance of particular learning techniques, especially              the learning technology must be known and fulfilled by
reinforcement learning approaches, there are not many works                the environmental model in combination with the reward
focussing on the resulting behavioral model dealing with                   function. The learning architecture must be able to find
usability. An early example can be found in [11], where an                 a good-enough solution;
evolutionary algorithm is applied to behavior learning of an           2) Interpretability and Model Accessibility: The mechanism
individual agent in multi agent robots. Another example, from              should produce behavior models that can be understood
[12], describes a general approach for automatically program-              and interpreted by a human modeler. The architecture
ming a behavior-based robot. Using Q-Learning algorithm,                   shall not be a black box with a behavior that the human
new behaviors are learned by trial and error based on a                    modeler has to trust, but must be accessible for detailed
performance feedback function as reinforcement. In [13], also              analysis of the processes involved in the overall agent
using reinforcement learning, agents share their experiences               system;
and most frequently simulated behaviors are adopted as a               3) Plausibility: The mechanism in the learning architecture
group behavior strategy. [14] compares reinforcement learning              should be well-established and well-understood. The
and neural networks as learning techniques in an exploration               motivation is that its usage shall not impose additional
scenario for mobile robots. The authors conclude that both                 complexity to the modeler for example in setting a
learning techniques are able to learn the individual behav-                number of configuration parameter. How the learning
iors, sometimes outperforming a hand coded program, and                    architecture works, shall be explainable to and by the
behavior-based architectures speed up reinforcement learning.              modeler.
                                                                       There is a variety of possible learning agent architectures
   III. AGENT L EARNING A RCHITECTURES FOR M ODEL                   that might be suitable for the aim presented here and the re-
                       D ESIGN                                      quirements identified – as discussed in section II. We selected
                                                                    Q-Learning, as a Reinforcement Learning technique, as we
   The basic idea behind a learning-driven design methodology
                                                                    describe it in the next paragraph.
consists in the transfer of the agent behavior design and test
                                                                       1) Q-Learning: Q-Learning [16] is a well-known reinforce-
activity from the human modeler to the simulation system.
                                                                    ment learning technique. It works by developing an action-
Specially in complex models, a high number of details can be
                                                                    value function that gives the expected utility of taking a
manipulated. This could make a manual modeling, debugging
                                                                    specific action in a specific state. The agents keep track of the
and tuning process cumbersome, especially when knowledge
                                                                    experienced situation-action pairs by managing the so called
about the original system or experience for implicitly bridging
                                                                    Q-table, that consists of situation descriptions, the actions
the micro-macro gap is missing. Using agents that learn at least
                                                                    taken and the corresponding expected prediction, called Q-
parts or initial versions of their behavior might be a good idea
                                                                    value.
for supporting the modeler in finding an appropriate low level
                                                                       Q-Learning is able to compare the expected utility of the
behavior model. Such a learning-based approach can also be
                                                                    available actions without requiring a model of the environ-
part of something as the adoption of a Living Design [15] like
                                                                    ment. Nevertheless, the use of the Q-Learning algorithm is
methodology for multi-agent simulation models. Nevertheless,
                                                                    constrained to a finite number of possible states and actions.
the first question on a way to such a learning-driven methodol-
                                                                    As a reinforcement learning algorithm, it also is based on
ogy, is about the selection of the appropriate learning technique
                                                                    modeling the overall problem as Markov Decision Processes.
– for this form of application, for a particular domain, or
                                                                    Thus, it needs sufficient information about the current state
maybe just for a particular model. In this paper we focus on the
                                                                    of the agent for being able to assign discriminating reward.
suitability of a well know learning technique, Q-Learning, for
                                                                    Although there are a number of extensions that improve the
such a modeling approach. Before we continue with focussing
                                                                    convergence speed of Q-Learning [5], we include the standard
on this particular learning architecture, we discuss what we
                                                                    Q-Learning algorithms in our experiment due to its simplicity.
have identified as requirements for the applicability of an            We suppose that Q-Learning meets the requirements for
learning technique to our problem.                                  the application by providing both sufficient performance (if
                                                                    applicable) adaptability and also gives interpretability of the
A. Requirements for Learning Agent Architectures
                                                                    result. This interpretability is achieved by its rule-based struc-
  Not all agent learning architectures are equally apt for          ture (represented by the state action mapping) with a clear
usage in the modeling support context. There are a number           evaluation of those rules, by means of the Q-Value. The
processing of this mapping, weighted by the provided utility         results in reaching the exit or getting to a state closer to the
value could be used as a bias for the interpretation of the rules,   exit. Complementary, it is collision-free oriented because the
as an input for the behavior modeling.                               agents are positively rewarded for moving without collisions
                                                                     and negatively rewarded every time an action results in a
                         IV. T ESTBED                                collision.
   The scenario we use for evaluating the learning architecture
                                                                     B. Agent Interfaces
approach is the same as in [17] where we already describe
the integration of XCS-based agents into the agent-based                As agent interfaces, the perceived situation and the set
modeling and simulation platform SeSAm. This pedestrian              of possible actions have to be defined. Similar to [17], the
evacuation scenario is a typical application domain for multia-      perception of the agents is based on their basic orientation
gent simulation (see [18] for a real-world application). Albeit      of the agent, respectively its movement direction. The overall
the employed scenario may be oversimplified, we expected             perceivable area is divided into 5 sectors with a distinction
that the relative simplicity of the scenario will enable us to       between areas in two different distances as depicted in figure
evaluate the potentials of the learning technique as well as to      1. For every area two binary perception categories were used:
deduce the involved challenges.                                      the first encoded whether the exit was perceivable in this area
                                                                     and the second encoded whether an obstacle was present -
A. Environmental Model                                               where an obstacle can be everything with which a collision
   The main objective of the simulation concerned the emer-          should be avoided: walls, columns or other pedestrians.
gence of collision-free exiting behavior. Therefore, the reward
and interfaces to the environment were mainly shaped to
support this. In contrast to [17], we did not test a large variety
of configurations as it was not the goal of this research to find
an optimal one, but a more modeling-oriented evaluation of
the architecture.
   The basic scenario consists of a room (40x60m) surrounded
by walls with one exit and a different number of column-
type obstacles (with a diameter of 3.5m). In this room a
number of pedestrians have to leave as fast as possible without                       Fig. 1.   Agent perception sectors
hurting themselves during collisions. We assume that each
pedestrian agent is represented by a circle with 50cm diameter          The action set is shaped for supporting the collision-
and moves with a speed of 1.5m/sec. One time-step in the             avoidance behavior. We assume that the agents are per de-
discrete simulation corresponds to 0.5sec. Space is continuous.      fault oriented towards the exit. Thus, the action set con-
We tested this scenario using 1, 5, 10 and 20 agents, and the        sists of A = {M oveLef t , M oveSlightlyLef t , M oveStraight ,
number of obstacles was set to 10. At the beginning of a test-       M oveSlightlyRight , M oveRight , N oop, Stepback}. For any
run, all agents were located at random positions in the upper        of these actions, the agent turns by the given direction (e.g.
half of the room.                                                    +36 degrees for M oveSlightlyRight ), makes an atomic step
   All experiments alternated between explore and exploit            and orients itself towards the exit again. The combination of
phases. During the explore phase, the agents randomly execute        this action set and the perceptions of the agents represents
an action. In exploitation trials, the best action was selected      an intentional simplification of the problem, as we implicitly
in each step. Every trial consists of 100 iteration steps. Every     represent the orientation task in the actions, in order to have
experiment took 1000 explore-exploit cycles.                         a MDP. This simplification allows concentrating the learning
   Reward was given to the agent a immediately                       on the collision avoidance, facilitating the learning process.
after executing an action at time-step t. It was
computed in the following way: reward(a, t)                     =    C. Architecture Configuration
rewardexit (a, t)+rewarddist (a, t)+f eedbackcollision (a, t)+          The testbed was implemented in the visual modeling
f eedbackdamage (a, t) with rewardexit (a, t) = 1000, if             and simulation platform SeSAm (www.simsesam.de). The Q-
agent a has reached the exit in time t, and 0 otherwise;             Learning could be implemented by means of the standard high-
rewarddist (a, t) = β × (dt (exit, a) − dt−1 (exit, a))              level behavior language in SeSAm.
with β = 5; f eedbackcollision (a, t) was set to 100 if a               It was not our objective to find the optimal configuration
collision free actual movement had been made, to 0 if no             for the tested architecture in the given scenario, we will not
movement happened, and to −100 if a collision occurred;              give a discussion of the effects of different parameter settings
f eedbackdamage (a, t) was set to −1000 if a collision with          on the learning outcome should not be necessary. Clearly,
column obstacle has occurred, and 0 otherwise. Together, the         we tested a number of configuration for finding a reasonable
different components of the feedback function stress goal-           configuration. This is also true for the the appropriate overall
directed collision-free movements. It is goal-directed because       configuration including different numbers of obstacles, sizes
the agents are positively rewarded every time an action              of scenarios or the particular numbers of the reward function.
In the context of this paper, we assume an initial Q-value of
0 for all untested state-action pairs. We set the learning rate
to 0.5 and the discount factor to 0. It means that the agents’
actions are selected based on recent experiences and not taking
into consideration the future rewards (only the best action
for the current state), respectively. This is another intentional
simplification for the problem, as the agents don’t need to
maximize future rewards.
               V. E XPERIMENTS AND R ESULTS
   In this section we analyze the results of the simulations, first
with respect to learning performance showing that the learning
technique is actually applicable to the test scenario, but then
we focus on the analysis of what the agents actually did learn.
A. Performance Evaluation
   The metric used for evaluating learning performance is
                                                                      Fig. 2. Development of the number of collisions for an exemplary run with
the number of collisions. The time to reach the exit does             5 agents and 10 obstacles
not vary significantly, as a collision is not influencing the
behavior directly, but indirectly via the reward the agent got.
The collisions, with other pedestrians or obstacles, do not           don’t know what is the best action, they know which one to
impose any effect on future movement. They only count as              avoid by checking the negative rewarded actions.
negative rewards. Obviously in the early stages, the agents
don’t have enough experience to learn from, and therefore a
higher number of collisions is expected.
   Table I presents the mean number of collisions for each
tested situation. The values are aggregated only after the first
50 explore-exploit cycles for avoiding the inclusion of any
warm-up data. The mean and deviation over the results of the
different exploit cycles are given. Despite of having the runs
repeated, we did not give means and standard deviations over
different runs as currently the number of repetitions is too low.
Clearly the number of collisions increases with the number of
agents and obstacles.

                           TABLE I
   M EAN NUMBER OF COLLISIONS PER RUN - ROWS REPRESENT THE
   NUMBER OF AGENTS AND COLUMN THE NUMBER OF OBSTACLES .
                                   10
                         1     0.01 ±0.23
                         5     1.39 ±1.78
                        10     6.66 ±3.88
                        20    25.17 ±8.77                             Fig. 3. Exemplary trajectories during exploit trials, for 5 agents and 10
                                                                      obstacles

   Figure 2 illustrates the adaptation speed by depicting the           Alternating between explore and exploit trials plays an
number of collisions over time for an exemplary run with              important role in the performance outcome. The agents must
5 agents and 10 obstacles. We can see that the number                 explore the possible actions set in order to maximize their
of collisions decreases fast in the beginning, but then the           experience in terms of the route to be chosen. At the end we
behavioral knowledge converges quite fast. After 50 cycles,           can see the emergence of a collision-avoidance behavior.
there is no further improvement.
   To have a better illustration of the learning process, we          B. Behavior Learning Outcome
show in figure 3 the trajectories of the agents in exploit phases        In this section we are interested in analyzing the rules
after a) 10, b) 100, c) 500 and d) 1000 exploit trials. In this       learned by the Q-Learning process in terms of the complexity
figure we consider the situation with 5 agents and 10 obstacles.      of the resulting rule structure and potential use as source of
We can see the progress of adaptation with more and more              inspiration in a modeling process.
collision-free and goal-directed movement. Experience hereby             In the following analysis we will examine two simulation
does not just mean positive reinforcement. Even if the agents         scenarios: 1 agent and 10 obstacles; and 5 agent and 10
obstacles. In both cases we consider the outcome of one agent
from an exemplary simulation.
   1) Raw Q-Learning Rules: The rules generated by the
learning process can be determined by taking for every sit-
uation the action with the highest q-value as it is done in the
exploit phases. Depending on the situation, there might be no
action with a positive q-value. The rules with a Q-Value of
zero represent situation-action pairs that have not been tested
during the simulation. Figure 4 depicts two out of 12 rules
with the highest Q-value on the 1-agent scenarios.




                                                                            Fig. 5.    Q-Learning value distribution for an exemplary agent from a
                                                                            simulation with 1 agents and 10 obstacles




Fig. 4. Two out of 12 rules with the highest Q-value for the agent in the
1-agent scenario.

                                                                            Fig. 6.    Q-Learning value distribution for an exemplary agent from a
   Figures 5 and 6 show the distribution of the reward predic-              simulation with 5 agents and 10 obstacles
tion, i.e. the Q-value, for the complete rules set for the single
agent, respectively a randomly selected exemplary agent from
a simulation with 5 agents. One can see that there are only a
few rules with a high Q-value.                                              with 5 agents has a number of 1507 true positive rules. This
   It is obvious that the Q-value alone cannot be a selection               can be seen as an effect of the interaction with other agents,
criteria for rules forming a behavior model as the ones with the            generating different situation to be visited, specially when it
highest Q-value naturally contain situations where the agent                gets closer to the exit, the situation becomes more dense and
directly perceives the exit. It is also possible to see that the            the agents must avoid the collisions, and get to the exit.
agent in this case has a majority of rules with Q-Value 0,                     Figures 7 and 8 show the distribution of these final rules
which means that a lot of state-action mappings have not been               over the possible actions, for the cases with 1 and 5 agents
tested. This is not case for the simulation with 1 agent and                respectively. We can see the effect of the initial random
10 obstacles, as seen in figure 5, where the majority of rules              positioning in each trial. We have a balanced distribution for
have been tested. The agent has explored more, resulting in a               the rules determining going to the left or right, which makes
more elaborated representation of the behavior. This difference             sense, since the agent must learn to find its way out of the
is caused by the fact that the simulation with only 1 agent                 scenario no matter where it has started. The majority of the
presented a smaller set of possible states to be tackled due to             rules indicate the M oveStraight action. This comes from the
the simplicity of the interactions just with static obstacles.              fact that the agent is reoriented towards the exit after the
   Another important aspect about the agents’ experience is                 execution of any action. Unless the agent needs to avoid a
that, since the agents are randomly positioned in the scenario              collision, M oveStraight is the best action to choose.
at the beginning of each trial, the rules are not biased by a                  We can identify the collision-avoidance behavior focussing
fixed position, so the rules set is more elaborated than it would           on an exemplary element of the perceptions of the agent (1
be if they had to know only one best way to get to the exit.                agent scenario in this case). Considering action M oveRight
   The agent from the simulation with only one agent has a                  and perception ObstacleImmediatelyRight, we see that there
positive rules set – consisting of rules with positive, non-zero            is a larger number of rules indicating false in this perception
q-values – of 229 rules, while the agent from the simulation                in all rules with the M oveRight action, see figure 9.
Fig. 7. Rules distribution over the actions for an exemplary agent from a
simulation with 1 agent and 10 obstacles




                                                                            Fig. 9. Frequency of rules with perception ObstacleImmediatelyRight as
                                                                            false (left bar) and true (right bar) for action MoveRight



Fig. 8. Rules distribution over the actions for an exemplary agent from a   explore tradeoff. The agent from the 1 agent simulation has a
simulation with 5 agents and 10 obstacles
                                                                            lower number of states to visit during the simulation, and this
                                                                            reflects on the accuracy of the rules as they are tested more
                                                                            times and converge faster to the optimal solution (state-action
   2) Processing the rules: As the set of rules with truly
                                                                            mapping). The agents from the 5 agents scenario have a larger
positive Q-value in all scenarios is far too large to be trans-
                                                                            set of states that potentially may occur, reflected also in the
parently presented to a human expert, we suggest to use a
                                                                            number of rules. This requires more cycles to converge to an
post-processing step for improving the analysis of the rule set
                                                                            optimal solution.
on a detailed level. As there are a number of candidates that
may be suitable for generalizing the rule set in a way that all
                                                                                                        TABLE II
learnt rules are captured in a compact form.                                       C LASSIFICATION ACCURACY - 10 FOLD CROSS VALIDATION
   For this aim, we tested three different machine learning                                          KNN     Decision Tree    CN2
algorithms – mainly classification learners – using all rules                            1 agent    0.6593      0.6375       0.6334
with non-zero, positive Q-Value: K Nearest Neighbors (KNN)                               5 agents   0.2907      0.2654       0.2980
[19], CART Decision Trees [20] and the CN2 rule inductor
[21]. The K-Nearest Neighbors is arguably one of the simplest
machine learning algorithms, while Decision Trees and CN2                                               TABLE III
are of particular interest to this work because of the inter-                C LASSIFICATION ACCURACY - VALIDATION AMONG DIFFERENT AGENTS
pretability provided by their resulting representation of the                                        KNN     Decision Tree    CN2
knowledge captured in the training set. We used KNN with a                               1 agent    0.6724      0.6983       0.6897
                                                                                         5 agents   0.3098      0.3230       0.3316
K value of 5 for the experiments. The Decision Tree is a simple
CART with Gini’s index of impurity for node splitting. CN2
algorithm uses the Laplace method for rule quality estimation.                 While they are all good models – as providing a solution
   As mentioned above, the results of this post-processing                  to the problem (as seen in section V-A) – they can not be
step have to be evaluated with two criteria: How well they                  generalized to other good solutions (other agents’ experi-
capture the given rule set and how good they are able to                    ences). The convergence of the solution, which determines its
generalize the rule set for bringing the rules. The first can               generalization to the problem is therefore a function of the
be measured in terms of classification accuracy, the second is              configuration of the learning, and more important, a function
the generalization and compactness of the resulting behavior                of the explore-exploit distribution, the number of agents and
description.                                                                the set of perceptions and actions, that determine the size of
     a) Classification Accuracy: Table II shows the classifica-             the state-action mapping.
tion accuracy for the above mentioned algorithms, both in the 1                Figure 10 shows the confusion matrix for the decision
agent and 5 agents experiments, using 10 fold cross validation              tree learnt from the simulation with 1 agent, testing with
in the training set. Table III shows the average classification             cross-validation: Rows represent the expected class (action)
accuracy, when model built from one agent’s experience is                   from the classification model, as presented in the Q-Learning
tested with another agent’s experience: We can see that the                 mapping and columns represent the classification determined
classification accuracy for the case with 1 agent outperformed              by the decision tree. We highlight the number of correctly
the case with 5 agents. This is clearly an effect of the exploit-           classified instances. The majority of misclassified instances
falls on cases where different actions could result in similar,
good rewards. For instance, there is a common misclassifica-
tion among the actions M oveStraight , M oveSlightlyRight and
M oveSlightlyLef t . This comes from the fact that when the
agent is facing the exit, all these three actions will maximize
the reward (represented by reaching the exit).




                                                                           Fig. 12. CN2 best three rules for the simulation with 1 agent and 10 obstacles



                                                                              In principle, a set of rules for the agents’ producing a
                                                                           solution to the evacuation problem could be learnt – using a
Fig. 10. Confusion matrix for the decision tree in the simulation with 1   technique that results in human-readable rules. However, from
agent and 10 obstacles                                                     these rules that were found by the Q-Learning, we could not
                                                                           construct a behavior representation that fully resembles the
     b) Compactness and Readability of the Learnt Behavior                 knowledge coded in the rule set, nor derive a representation
Representation: The second dimension is to be analyzed, with               of the rules that a human modeler could easily oversee. On
regards to the improving the representation of the behavior for            the other side, the scenario is so simple, that it is possible
a human modeler. We assumed that the best result could be                  to directly program a set of about 10 rules exhibiting almost
produced by the decision tree learner. However, the CART de-               optimal behavior.
cision tree learner was not able to produce an understandable,
                                                                                        VI. C ONCLUSION AND F UTURE W ORK
compact model in this problem. In the case of 1 agent the tree
has 117 nodes and 59 leaves. For the case with 5 agents the                   In this paper we presented our investigation towards a
tree has 1637 nodes and 819 leaves. For illustration, figure               learning-driven methodology by evaluating Reinforcement
11 outlines a part of the tree generated from the experience               Learning as an agent learning architecture. The main moti-
of the agent in the case of 1 agent and 10 obstacles. In this              vation for this work is investigate the possibilities of creating
figure, the codes represented in the rule stand for different              a learning-based methodology for the design of a multiagent
agent perceptions. For instance EIA means Exit Immediately                 simulation model avoiding a time consuming trial and error
Ahead.                                                                     process when determining the details of agent behavior.
                                                                              In a small evacuation scenario, we showed that the em-
                                                                           ployed learning technique can produce plausible behavior in
                                                                           an agent-based simulation. However, the interface between
                                                                           the learning technique and the agent environment is by no
                                                                           means trivial. The environmental model, feedback function,
                                                                           perception, and action sets are critical. There are also ideas
                                                                           on the analysis of the different architecture that may improve
                                                                           the usability of the learned behavior model.
                                                                              Using a learning technique transfers the basic problem from
                                                                           direct behavior modeling to designing the agent interface and
Fig. 11. A branch of the decision tree for the case with 1 agent and 10    environment reward computation. To do so successfully, a
obstacles                                                                  general understanding of scenario difficulties and the avail-
                                                                           able machine learning techniques is necessary. An example
   The post-processing result provided by the CN2 algorithm                is the fundamental requirement of the Markov property in
is better than the decision tree learner: For example, figure              reinforcement-based approaches [5] – in our case Q-learning.
12 shows the best 3 (from a total of 29) rules created by                  Provided perceptions need to contain sufficient information
the CN2 algorithm, from the training set of non-zero, positive             to be able to learn the expectation of immediate and future
rules in the case of 1 agent. CN2 was able to reduce the rules             possible reward accurately.
representation from 229 to 29 rules. The rules can be evaluated               The standard implementation of Q-Learning, used in this
by their quality, size and coverage. Here, as for the decision             paper, offers us only the estimated reward for each possible
trees, the perceptions are represented by codes. The rules are             condition-action pair. For more intelligent interpretation of
clear and concise. Because of that, CN2 can be seen as a step              the rule set – that is in its raw state without any form of
further towards interpretability.                                          generalization – we decide to use three different machine
learning algorithms: K-Nearest Neighbors, Decision Trees and                    [6] A. Nowe, K. Verbeeck, and M. Peeters, “Learning automata as a basis
CN2 rule inductor. The resulting, full behavior model for the                       for multi agent reinforcement learning,” pp. 71–85, 2006.
                                                                                [7] C. Adami, Introduction to artificial life. New York, NY, USA: Springer-
Q-Learning is only partially helpful as a guidance for modeling                     Verlag New York, Inc., 1998.
in this case. Generalization still needs to be improved, as a part              [8] J. Grefenstette, “The evolution of strategies for multi-agent environ-
of the learning process or as a post-processing step. This could                    ments,” Adaptive Behavior, vol. 1, pp. 65–90, 1987.
                                                                                [9] R. J. Collins and D. R. Jefferson, “Antfarm: Towards simulated evolu-
be achieved by using more flexible classification techniques,                       tion,” in Artificial Life II. Addison-Wesley, 1991, pp. 579–601.
such as multi label classification, since in this process we have              [10] J. Denzinger and M. Fuchs, “Experiments in learning prototypical
to deal with multiple good solutions. Another important aspect                      situations for variants of the pursuit game,” in In Proceedings on the
                                                                                    International Conference on Multi-Agent Systems (ICMAS-1996. MIT
to be considered here is the tradeoff between explore and                           Press, 1995, pp. 48–55.
exploit, and how this scales to the complexity of the problem,                 [11] Y. Maeda, “Simulation for behavior learning of multi-agent robot,”
in terms of the number of agents and the size of the state-                         Journal of Intelligent and Fuzzy Systems, pp. 53–64, 1998.
                                                                               [12] S. Mahadevan and J. Connell, “Automatic programming of behavior-
action mapping in a multiagent simulation. This is a relation                       based robots using reinforcement learning,” Artificial Intelligence,
yet to be analyzed in detail level.                                                 vol. 55, no. 2-3, pp. 311 – 365, 1992.
   There are admittedly many more challenging application                      [13] M. R. Lee and E.-K. Kang, “Learning enabled cooperative agent behav-
                                                                                    ior in an evolutionary and competitive environment,” Neural Computing
scenarios than an evacuation scenario where all agents have                         & Applications, vol. 15, pp. 124–135, 2006.
the same goal, the behavior repertoire is quite restricted, and                [14] R. Neruda, S. Slusny, and P. Vidnerova, “Performance comparison of
there is no direct communication between agents. In such                            relational reinforcement learning and rbf neural networks for small
                                                                                    mobile robots,” in FGCNS ’08: Proceedings of the 2008 Second
advanced environments, the learning and environment design                          International Conference on Future Generation Communication and
will certainly pose additional challenges.                                          Networking Symposia. Washington, DC, USA: IEEE Computer Society,
   Our next steps include testing other learning techniques to                      2008, pp. 29–32.
                                                                               [15] J.-P. Georg, G. Picard, M.-P. Gleizes, and P. Glize, “Living Design for
investigate their performance, outcome and appropriateness                          Open Computational Systems,” in International Workshop on Theory
for this methodology. A short analysis of Learning Classifier                       And Practice of Open Computational Systems (TAPOCS) at 12th IEEE
Systems and Neural Networks can be found in [3]. We plan                            International Workshop on Enabling Technologies: Infrastructure for
                                                                                    Collaborative Enterprises (WETICE’03), M. Fredriksson, A. Ricci,
to also test approaches such as evolutionary programming                            R. Gustavsson, and A. Omicini, Eds. Linz, Austria: IEEE Computer
support vector machines, and other forms of reinforcement                           Society, June 2003, pp. 389–394.
learning, respectively learning automata. An alternative for                   [16] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,
                                                                                    vol. 8, no. 3, pp. 279–292, 1992.
the post-processing step worth testing could be multi label                    [17] F. Klügl, R. Hatko, and M. V. Butz, “Agent learning instead of
classification [22], where we could gather the experience from                      behavior implementation for simulations - a case study using classifier
different agents and find different best actions for a given                        systems,” in MATES 2008: Proceedings of the 6th German Conference
                                                                                    on Multiagent System Technologies. Springer Berlin / Heidelberg, 2008,
situation, increasing generalization.                                               pp. 111–122.
   Besides that, we will pursue further self-modeling agent                    [18] F. Klügl, G. Klubertanz, and G. Rindsfüser, “Agent-based pedestrian
experiments. We are considering the application of the learning                     simulation of train evacuation integrating environmental data,” in KI
                                                                                    2009: Advances in Artificial Intelligence, 32nd Annual German Confer-
technique in other, more complex scenarios, such as an evac-                        ence on AI, Paderborn, Germany, September 15-18, 2009. Proceedings,
uation of a train with about 500 agents, complex geometry                           ser. Lecture Notes in Computer Science, vol. 5803. Springer, 2009, pp.
with exit signs and time pressure. We are also interested in                        631–638.
                                                                               [19] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997.
a scenario where cooperation / collaboration is required, in                   [20] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification
order to investigate the possible emergence of the cooperation                      and Regression Trees, 1st ed. Chapman and Hall/CRC, January 1984.
in the agent model, through the learning process. This exper-                  [21] P. Clark and T. Niblett, “The cn2 induction algorithm,” MACHINE
                                                                                    LEARNING, vol. 3, no. 4, pp. 261–283, 1989.
imentation should consider situations with and without direct                  [22] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,”
communication between the agents.                                                   Int J Data Warehousing and Mining, vol. 2007, pp. 1–13, 2007.

                             R EFERENCES
 [1] M. Fehler, Kl´’ugl, and F. Puppe, “Approaches for resolving the dilemma
     between model structure refinement and parameter calibration in agent-
     based simulations,” in AAMAS ’06: Proceedings of the 5th international
     joint conference on Autonomous agents and multiagent systems. ACM
     Press, 2006, pp. 120–122.
 [2] F. Klügl, “Multiagent simulation model design strategies,” in MAS&
     S Workshop at MALLOW 2009, Turin, Italy, Sept. 2009, ser. CEUR
     Workshop Proceedings, vol. 494. CEUR-WS.org, 2009.
 [3] R. Junges and F. Klügl, “Agent architectures for a learning-driven
     modeling methodology in multiagent simulation,” in MATES 2010:
     Proceedings of the 8th German Conference on Multiagent System
     Technologies (to appear), 2010.
 [4] G. Weiß, “Adaptation and learning in multi-agent systems: Some re-
     marks and a bibliography,” in IJCAI ’95: Proceedings of the Workshop
     on Adaption and Learning in Multi-Agent Systems.          London, UK:
     Springer-Verlag, 1996, pp. 1–21.
 [5] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
     MIT Press, 1998.