<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Virtual Agents for Decision-Making in Business Simulators</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javier Garc´ıa and Fernando Ferna´ndez</string-name>
          <email>ffernand@inf.uc3m.es</email>
          <email>fjgpolo,ffernand@inf.uc3m.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Borrajo</string-name>
          <email>fernando.borrajo@uam.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Auto ́noma de Madrid, Crta. de Colmenar Viejo</institution>
          ,
          <addr-line>km. 14, 28049, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Carlos III de Madrid, Avenida de la Universidad</institution>
          ,
          <addr-line>30, 28911, Legane ́s, Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-In this paper we describe SIMBA, a simulator for business administration, as a Multi-Agent platform for the design, implementation and evaluation of virtual agents. SIMBA creates a complex competitive environment in which intelligent agents play the role of business decision makers. An important issue of SIMBA architecture is that humans can interact with virtual agents. Decision making in SIMBA is a challenge, since it requires handling large and continuous state and action spaces. In this paper, we propose to tackle this problem using Reinforcement Learning (RL) and K-Nearest Neighbors (KNN) approaches. RL requires the use of generalization techniques to be applied in large state and action spaces. We present different combinations in the choice of the generalization method based on Vector Quantization (VQ) and CMAC. We demonstrate that learning agents are very competitive, and they can outperform human expert decision strategies from business literature.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Business simulators are a promising tool for research. The
main characteristic of SIMBA (SIMulator for Business
Administration) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is that it emulates business reality. It can be used
from a competitive point of view, since different companies
compete among themselves to improve their results. In this
paper, SIMBA is considered as a multi-agent framework where
the different agents manage their companies in different ways.
SIMBA can include several autonomous agents to play the role
of competing teams and, based on the research on decision
making patterns of human teams, further research is made to
improve the complexity and effectiveness of such intelligent
agents.
      </p>
      <p>Decision making in SIMBA requires handling more than
100 continuous state variables, and more than 10 continuous
decision variables, which makes the problem hard even for
business administration experts. The motivation of this paper
is the design, implementation and evaluation of virtual agents
in SIMBA using different machine learning (ML) approaches.
The goal is that the developed agents can outperform
humanlike behavior when competing against hand-coded and random
virtual agents, but also against expert humans players.</p>
      <p>
        Human players have experimented the consequences of their
decisions in competition with the developed virtual agents.
But, given that the agents try to “win” in all cases, they
make the game too hard for novice players. So “pedagogical”
objectives for human players competing with our virtual
agents, are not directly included in the goal of this paper.
Designing virtual agents whose behavior challenges human
players adequately is a key issue in computer games
development [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Games are boring when they are too easy and
frustrating when they are too hard [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Difficulty of the game
is critically important for its “pedagogical” worth. The game
difficulty must be such that it is “just barely too difficult” for
the subject. If the game is too easy or too hard, “pedagogical”
worth appears to be less efficient. So most games allow human
players adjust basic difficulty (easy, medium, hard).
      </p>
      <p>
        However, developing agents that can outperform human-like
behavior, under narrow circumstances, can do pretty well [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
(ex: chess and Deep Blue or Othello and Logistello). Deep Blue
defeated World Chess Champion Garry Kasparov in an
exhibition match. Campbell and Hsu describe the architecture and
implementation of their chess machine in the paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A few
months after this chess success, Othello became the new game
to fall to computers when Michael Buro’s program Logistello
defeated the World Othello Champion Takeshi Murakami. In
the paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Buro discusses the learning algorithms used in
his program. Thus, the goal of this paper is the development of
virtual “business” agents that can be able to beat hand-coded
and random virtual agents, but also human business experts.
      </p>
      <p>To do so, we use two different learning approaches. The
first one is Instance Based Learning (IBL). In this paper we
propose the Adaptive KNN algorithm, a variation of KNN,
where experience tuples are stored and selected automatically
to generate new behaviors.</p>
      <p>
        However, decision making for business administration is an
episodic task where decisions are sequentially taken. Therefore
we also propose to use Reinforcement Learning (RL). The
RL agents developed need to apply generalization techniques
to perform the learning process, given that both the state
and action spaces are continuous. In this paper, we propose
two different generalization methods in order to tackle the
large state and action spaces. The first one, Extended Vector
Quantization for Q-Learning, uses Vector Quantization (VQ)
to discretize both the state and action spaces, extending
previous works where VQ was used only to discretize the state
space[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Some tasks have been solved by coarsely discretizing
the action variables [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], but up to our knowledge, this is
the first time that VQ is used to discretize the action space.
The second generalization approach, CMAC-VQQL, is based
on the combination of VQ to discretize the action space and
      </p>
      <sec id="sec-1-1">
        <title>CMAC (Cerebellar Model Articulation Controller) [1], which</title>
        <p>is motivated by CMAC’s demonstrated capability to generalize
the state space.</p>
        <p>Section II describes SIMBA. Section III introduces the
learning approaches proposed, while Section IV shows how
these approaches have been used to learn the virtual agents
for decision making in SIMBA. Section V shows comparative
results of the virtual agents, when competing among them but
also when competing against expert human players. Section VI
summarizes the related work. Last, Section VII concludes.
in SIMBA, as will be shown in Section V. We describe some
classical ones:</p>
        <p>1. Incremental decisions. This type of business strategy
is based on incremental decisions for all decision variables,
which typically ranges from a 10% to a 20%. This business
strategy is considered as a conservative behavior.</p>
        <p>2. Risk decisions. It is based on strong changes in business
decisions. It has strong impacts in market reactions, and is
useful to detect gaps and market opportunities.</p>
        <p>
          3. Reactive. An organization with this type of strategy
attempts to locate and maintain a secure niche in a relatively
stable product or service area [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          4. Low cost strategy. With this strategy, managers try to
gain a competitive advantage by focusing the energy of all the
departments on driving the organization’s costs down below
the costs of its rivals [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          5. Differentiation and specialization. A differentiation
strategy is seen when a company offers a service or product
that is perceived as unique or distinctive from its
competitors [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>Which strategy management is chosen in every moment
depends on the organization’s strengths and its competitor’s
weaknesses.</p>
      </sec>
      <sec id="sec-1-2">
        <title>C. Autonomous Decision Making in Simba</title>
        <p>The goal of this section is to describe how a SIMBA software</p>
        <p>II. SIMBA agent can be implemented. To do this, we describe the state
In this section, SIMBA simulator is described in detail. and action spaces, the transition function to transit between
states and the variable to maximize.</p>
        <p>A. SIMBA’s Architecture State Space. The state computed in every round or
simu</p>
        <p>Figure 1 shows the architecture of the business simulator lation step is composed of 174 continuous variables. Table I
from a Multi-Agent perspective. The architecture designed shows some of the features that compose the state space.
enables multiple players to interact with the simulator, in- Action Space. The players (software or humans) must
apcluding both software agents and human players. The main proach the decisions on the different functional areas of their
components of the system are: companies. Each market in the competition requires the use
• Simulation Server: Once all decisions are taken for the of 25 variables. This is an indicator of SIMBAS’s capacity to
current round, it computes the values of the variables approach the complexity of managerial decision-making. In
in the marketplace for every player. Finally, it sends the our experiments, we consider a subspace of the total action
results computed to each player. The player (software of space and we use only the ten variables shown in table I. This
human) uses these results to choose the best decisions in reduction was suggested by the experts, because the discarded
the next round of the simulation. variables are not very significant. All the actions that the agents
• Simulation Control: It manages the software agents can perform are constrained by the semantic of the business
and their decisions. It receives the decision taken by model. For instance, a company can not sell its product if it
the software agents and sends them to the Simulation does not have stock.</p>
        <p>Server. The simulation server the results computed to Transition function. The different players participate in a
the simulation control. The simulation control sends the simulation in a step by step round mode. Each simulation step
results to the corresponding software agent. is called a period, which is equivalent to three real months.
• Software Agents: They represent an alternative to human When a round ends, the time machine is run. By doing this, the
players. In every step, the software agents receive the simulator integrates the previous periods situation, the teams’
results computed for the Simulation Server. The software decisions, and the parameters of the general economic
enviagents use this information to take the decisions for the ronment together with those of each geographic market, and
next round of the simulation. orders the Simulators Server to generate output information
for the new period.</p>
        <p>B. Business Human Strategies Variable to maximize. The agents try to maximize the result</p>
        <p>
          Different business strategies appear in the business litera- of the exercise (profit). From a RL point of view, the objective
ture, and they all could be followed to manage the companies is to maximize the total reward received. In this case, we
define the immediate reward as the result of the exercise in
a period or step. Therefore, there is no delayed reward and,
like in other classical domains like Keepaway [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], immediate
rewards received in every simulation step are relevant.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>III. PROPOSED ALGORITHMS FOR LEARNING VIRTUAL AGENTS In this section we describe the new learning algorithms proposed, based on KNN and RL.</title>
      <sec id="sec-2-1">
        <title>A. Adaptive KNN</title>
        <p>In this paper, we propose a variant of KNN called Adaptive
KNN (Table II). In this variant, we can distinguish two phases.
In the first one, a data set C is obtained during an interaction
between the agent and the environment. This data set C is
composed by tuples in the form &lt; s, a, r &gt; where s ∈ S,
a ∈ A and r ∈ ℜ is the immediate reward. In the second one,
the set C obtained in the previous phase is improved during
a new interaction between the agent and the environment. In
each step of this second phase, the simulator returns the current
state s where the agent is. The algorithm selects the K nearest
neighbors to the state s in C . Among these K neighbors, it
selects the tuple with the best reward obtained in the phase
one. Then modify slightly the actions of this tuple and execute
it. If the new reward obtained is better than the worst reward
in K , it replaces the worst tuple in K with the new experience
generated. Thus, the algorithm adapts the initial set C obtained
in the phase one, to get increasingly better results in the second
phase.</p>
      </sec>
      <sec id="sec-2-2">
        <title>B. RL Approaches</title>
        <p>
          Among many different RL algorithms, Q-learning has been
widely used in the literature [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].In Q-Learning, the update
function is performed following equation 1, where α is a
learning parameter, and γ is a discount factor that reduces
the relevance of future decisions.
        </p>
        <p>Q(st, at) → Q(st, at) + α[rt+1 + γmaxaQ(st+1, a) − Q(st, at)]
(1)</p>
        <p>Except in very small environments it is impossible to
enumerate the state and action spaces. In this section we explain
two new approaches for state and action space generalization
problem.</p>
      </sec>
      <sec id="sec-2-3">
        <title>1) Extended VQQL for state and action space generaliza</title>
        <p>
          tion: Applying VQ techniques permits to find a more compact
representation of the state and action space [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. A vector
quantizer Q of dimension K and a size N is a mapping from a
vector (state or action) in the K-dimensional Euclidean space,
Rk , into a finite set C containing N states, Q : Rk → C where
C = {y1, y2, ..., yN }, yi ∈ Rk . In this way, given C, and a
state x ∈ Rk , V Q(x) assigns x to the closest state from C ,
V Q(x) = arg miny∈C {dist(x, y)}.
        </p>
        <p>To design the vector quantizer we use the Generalized Lloyd
Algorithm (GLA). The Extended VQQL algorithm is shown
in Table III.</p>
        <p>It uses VQ to generalize the state and action spaces. In
Extended VQQL algorithm, two vector quantizers are designed
for each agent. The first one is used to generalize the state
space and the second one is used to generalize the action
space. The vector quantizers are designed from the input data
C obtained during an interaction between the agent and the
environment. The data set C is composed by tuples in the
form &lt; s1, a, s2, r &gt; where s1 and s2 is in the state space S,
a is in the action space A and r is the immediate reward. In
many problems, s is composed by a large number of features.
In these cases, we suggest to apply feature selection to reduce
the number of features in the state space. Feature selection is a
technique of selecting a subset of relevant features for building
a new subset. So feature selection is used to select the relevant
features of S to obtain a subset S′ . This feature selection
process is defined as Γ : S → S′ . The set of states s′ ∈ S′ ,
Cs′, are used as input for the Generalized Lloyd Algorithm t o′
obtain the first vector quantizer. The vector quantizer V Qs
is a mapping from a vector s′ → S′ into a vector s′ ∈ Ds′ ,
where Ds′ is the state space discretization Ds′ = s′1, s′2, ..., s′n
for s′i ∈ S′ . The set of actions a ∈ A, Ca, are used as input
for the GLA to obtain the second vector quantizer.</p>
        <p>The vector quantizer V QA is a mapping from a vector
a ∈ A into a vector a ∈ Da, where Da is the action space
discretization Da = a1, a2, ..., am for ai ∈ A. In the last
part of the algorithm, the Q-table is learned from the obtained
discretizations using the set C′ of experience tuples. To obtain
the set C′ from C, each tuple in C is mapped to the new
representation. Therefore, every state in C is fi′ rstly projected
to the space S′ and then discretized, i.e. V QS (Γ(S)); every
action a ∈ A in C is also discretized V QA(a).</p>
      </sec>
      <sec id="sec-2-4">
        <title>2) CMAC-VQQL for state and action space generalization:</title>
        <p>
          CMAC is a form of coarse coding [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. In CMAC the features
are grouped into partitions of input state space. Each of such
partition is called a tiling and each element of a partition is
called a tile. Each tile is a binary feature. The tilings were
overlaids, each offset from the others. In each tiling, the state is
in one tile. The approximate value function, Qa, is represented
not as a table, but as a parameterized form with parameter
vector θ"t. This means that the approximate value function Qa
depends totally on θ"t. In CMAC, each tile has associated a
weight. The set of all these weights is what makes up the
vector θ". The approximate value function, Qa(s) is calculated
in the equation 2.
        </p>
        <p>n
Qa(s) = θ'T φ' = X θ(i)φ(i)
i=0
(2)</p>
        <p>The CMAC-VQQL algorithm, described in Table IV,
combines two generalization techniques. It uses CMAC to
generalize the state space and VQ to generalize the action space.
In this case, a data set C is obtained during an interaction
between the agent and the environment. This data set C is
composed by tuples in the form &lt; s1, a, s2, r &gt; where s1 and
s2 is in the state space S, a is in the action space A and r is
the immediate reward. In the same way that previously, s is
composed by a large number of features. Feature selection is
used to select a subset S′ of the relevant features of S.</p>
        <p>The set of actions a ∈ A, Ca, are used as input for
the GLA to obtain the second vector quantizer. The vector
quantizer V QA is a mapping from a vector a ∈ A into a
vector a ∈ Da, where Da is the action space discretization
Da = a1, a2, ..., am for ai ∈ A. Later, the CMAC is built
from Cs′ taking into account the obtained action space Da.
For each state variable x′i in s′ ∈ S′ the tile width and the
number of tiles per tiling are selected taking into account their
ranges. In our work, a separate value function for each of the
discrete actions is used. In CMAC, each tile has associated a
weight. The set of these weights is what makes up the vector θ.
In the last part, the Q function is approximated by the equation
2.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>IV. VIRTUAL AGENTS IN SIMBA</title>
      <p>In the following evaluation performed, we assume that 6
companies are controlled by agents of different types. These
agents are: Random Agents, that assign to each decision
variable a random value following an uniform distribution;
Hand-Coded Agents, that modify their decision variables by
increasing their values using the Consumer Price Index (CPI);
RL Agents, using the Extended VQQL and CMAC-VQQL
algorithms described in Section III-B; and Adaptive KNN
Agents, using the algorithm described in section III-A.</p>
      <sec id="sec-3-1">
        <title>3) Executing the Extended VQQL Algorithm: Executing the</title>
        <p>Extended VQQL algorithm to learn the VQ Agents requires
performing the 5 steps of the algorithm:</p>
        <p>Step 1: Gather experience tuples. To gather experience, we
perform an exploration in the domain by using hand-coded
agents. Specifically, we obtain the experiences generated by
a hand-coded agent managing company 1 against five
handcoded agents managing companies 2, 3, 4, 5 and 6 respectively.</p>
        <p>
          Step 2: Reduce the dimension of the state space. The goal
of this step is to select, from among all features in the state
space, those features most related to the reward (the result
of the exercise). To perform this phase, we use the
datamining tool, WEKA [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] using the attribute selection method
CfsSubsetEval. This method evaluates the worth of a subset
of attributes by considering the individual predictive ability
of each feature along with the degree of redundancy between
them. The resulting description of the state space after the
attribute selection process is shown in Table I.
        </p>
        <p>Step 3: State space discretization. Now, we use the GLA to
discretize the state space.</p>
        <p>Step 4: Discretize the action space. Again, we use GLA to
discretize the action space. The action space is composed of
the features shown in Table I.</p>
        <p>Step 5: Learn the Q table. Once both the state and action
spaces are discretized, the Q function is learned using the
mapped experience tuples and the Q-Learning update function.
The Q table is generated, composed of n rows (where n is the
number of discretized states) and m columns (where m is the
number of discretized actions).</p>
      </sec>
      <sec id="sec-3-2">
        <title>4) Executing CMAC-VQQL Algorithm: Executing the</title>
        <p>CMAC-VQQL algorithm to learn the CMAC Agents requires
performing the 5 steps of the algorithm as described in
Table IV. Steps 1 and 2 of CMAC-VQQL are the same as
steps 1 and 2 of Extended VQQL (gather experience and
the reduction of the dimension of the state space). Step 3 of
CMAC-VQQL (action space discretization) is also the same
as step 4 of Extended-VQQL. Step 4 is the design of the
CMAC function approximator. In our experiments we use
single-dimensional tilings. For each state variable, 32 tilings
were overlaid, each offset from the others by 1/32 of the tile
width. For each state variable, we specified the width of the
tiles based on the width of the generalization that we desired.</p>
        <p>In the experiments we use three different configurations. The
size of the primary vector θ in Configuration #1 is 754272
(x1tiles +x2tiles + +x12tiles ), in Configuration #2 is 1364320,
in Configuration #3 is 2440704. In our work, we use a
separate value function for each of the generalized actions.</p>
        <p>Last, step 5 of the algorithm, learning the Q approximations,
can be performed.</p>
        <p>In the first set of experiments we use the Extended VQQL
algorithm to learn an agent that manages company 1 and plays
against five hand-coded agents that manage companies 2, 3, 4,
5 and 6 respectively. The results for different discretizations
size of the state (rows) and action (columns) spaces are shown
in Table V.</p>
        <p>The best result is obtained when we use a vector quantizer
of 64 centroids (or states) to generalize the state space and a
vector quantizer of 32 centroids (or actions) to generalize the
action space.</p>
        <p>In the second set of experiments we use the CMAC-VQQL
algorithm. The results for the different CMAC configurations
described in section IV-4 (rows) combined with the different
sizes of the action space obtained by VQ (columns) are shown
in Table VI.</p>
        <p>The best result is obtained when we use the Configuration
A. Adaptive KNN in SIMBA #1 of CMAC to generalize the state space and a vector
quantizer of 8 centroids to generalize the action space. This</p>
        <p>To apply the Adaptive KNN algorithm to create a SIMBA value is smaller than the obtained with Extended VQQL but,
software agent, we use the same state space, action space, and again, all the configurations obtain better results than the
handtransition and reward functions that for the RL agent. We also coded agent.
use the same experience tuples than for the RL agent, although In the next set of experiments we use the KNN algorithm
in the learning process, the set is updated following step 6 of to build an agent. The results for the different KNN
configuthe algorithm (as described in Table II). rations are shown in Table VII.</p>
        <p>In the experiments, the learning agent always manages
the first company of the six involved in the simulations.</p>
        <p>Each experiment consists of 10 simulations or episodes with
20 rounds and we obtain the mean value and the standard
deviation for the result of the exercise during the 20 periods. In
this situation, a hand-coded agent that manages the company
1 against five hand-coded agents that manage companies 2,
3, 4, 5 and 6 respectively obtains a mean value of the result
of the exercise of 2,901,002.13 euros. A random agent in the
same situation obtains -2,787,382.78 euros.</p>
        <p>In the experiments with human experts, simulations have 8
rounds.</p>
      </sec>
      <sec id="sec-3-3">
        <title>A. RL and KNN Results</title>
        <p>The columns of Table VII show different results for different
values of K (5, 10 and 15 respectively). The first row presents
the results of the Adaptive KNN algorithm, as it was described
in Table II. The second row shows the results of a classical
KN N approach, without the adaptation of the training set,
i.e. without executing the steps five and six of the Adaptive
KNN algorithm. The best results are obtained with the adaptive
version, for K=10 and K=15. In these cases, we obtain a mean
value for the result of the exercise of 9,8 millions of euros,
which is higher than the ones obtained with RL.</p>
        <p>In previous experiments, the learning agent always learned
to manage the first company of the six involved in the
simulations. However, the behavior of each company depends
on their initial states and of historical data (periods -1,
2, etc). Therefore, learning performance may vary from one
company to other. To evaluate this issue, we repeat the learning
process for the best learning configurations, for each of the six
companies. Each experiment consists of 10 simulations with
20 rounds and we obtain the mean value and the standard
deviation for the result of the exercise during the 20 periods.
The results shown in Figure 2 demonstrate that the Extended
VQQL agent and Adaptative KNN agent obtain similar results,
and both obtain better results than the hand-coded agent.</p>
        <p>Now, we compare the behavior of the best RL agent with
the behavior of the best Adaptative KNN agent obtained in
previous experiments. In this experiment, all the companies
have the same initial state and historical data, so the result
is independent of the company managed. This experiment
consists of 10 simulations with 20 rounds and we obtain the
mean value and the standard deviation for the result of the
exercise during the 20 periods. Figure 3 shows the mean value
and the standard deviation for each kind of agent.</p>
        <p>For the Adaptive KNN agent, the average value grows from
the first period, and raises up to 16 millions of euros. However,
Fig. 3. RL Agents vs. Adaptive KNN Agents
we see that standard deviation is very high, so the behavior of
the agent managing different companies is very different. The
result for the Extended VQQL agent have two behaviours well
differentiated: before period 8, and after period 8. In the first
part, the result of the exercise always grows up, and dominates
the result of the Adaptive KNN agent. However, from period
8, the result of the exercise for the Extended VQQL agent
stabilizes to a value of around 10 millions, and it is dominated
by the other agent from period 10. Interestingly, we have
revised all the simulations performed, and this behavior always
appears. We believe that the RL agent is affected by the CPI
and the evolution of the market and, with time, the actions
obtained by the VQ algorithm becomes old-fashioned (note
that 8 periods are equivalent to two years). Therefore, if we
focus in the early periods, typically the RL agents behave
better than the KNN ones.</p>
      </sec>
      <sec id="sec-3-4">
        <title>B. RL and KNN Agents vs. Human Experts</title>
        <p>In this section, we present experiments where software
agents play against a human expert during 8 periods. The
human expert actually is an associate full time professor in
Strategic and Business Organization at Universidad Auto´noma
de Madrid (UAM), where he is Director of Master of Business
Administration (Executive) and Director of Doctorate Program
of Financial Economics.</p>
        <p>In all the experiments, we use the best RL and Adaptive
KNN agents obtained in the previous section. In the first
experiment, the human expert uses the incremental decision
strategy, described in section II. The results are shown in
table VIII.</p>
        <p>In this case, the Extended VQQL agent obtains the best
results. Furthermore, given that only 8 episodes are run, the
RL agent performs much better than the Adaptive KNN agent.
The human expert obtains the worst results (independently of
the increment used).</p>
        <p>In the second experiment, two different simulations with 8
rounds each are performed. The human expert combines the
use of the different business strategies described in section II.
The results are shown in table IX.</p>
        <p>In all the experiments, the software agents obtain better
results than the human expert. From a qualitative point of
view, the virtual agents usually compete in the same market
scope. They are very effective and efficient, been almost
impossible to beat them under the parameter setting used in
these simulations. The best strategies usually make decisions
in different market scopes, using high or low strategies (for
instance, low cost or differentiation and specialization). It
means that using more competitive strategies, the gap between
the performance of the virtual agents and the human experts
could be reduced.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>VI. BACKGROUND</title>
      <p>
        Business gaming usage has grown globally and has a long
and varied history [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The first modern business simulation
game can be dated back to 1932 in Europe and 1955 in
North America. In 1932, Mary Birshstein, while teaching at
the Leningrad Institute, got the idea to adapt the concept of
war games to the business environment. In North America
the first business simulator dates back to 1955, when RAND
Corporation developed a simulation exercise that focused on
the U.S. Air Force logistics system [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, the first
known use of a business simulator for pedagogical purposes
in an university course, was at the University of Washington
in a business policy course in 1957 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        From this point, the number of business simulation games in
use grew rapidly. A 2004 e-mail survey of university business
school professors in North America reported that 30.6% of
1,085 survey respondents were current business simulation
users, while another 17.1% of the respondents were former
business game users [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Over the years Artificial Intelligence (AI) and simulation
have grown closer together. AI is used increasingly in complex
simulation, and simulation is contributing to the development
of AI [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The need for increased level of reality and fidelity
in domain-specific games calls for the use of methods that
bring realism and intelligence to actors and scenarios (also in
business simulators). Intelligent software agents, called
“autonomous” avatars or virtual players, are now being embodied
in business games. Software agents can interact with each
other and their environment producing new states, business
information and events. In addition, these agents not only
provide information but also may affect the environment and
direction of the simulation [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Machine learning techniques (decision trees, reinforcement
learning,. . .) have been used widely to develop software
agents [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] for example, the software agents uses
decision trees to learn different behaviors. In this case, virtual
players could take on the role of an executive or
salesperson from a supplier firm, a union leader, or any other
role relevant to the simulation exercise. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the learning
agents uses a typical genetic-based learning classifier system,
XCS (eXtended learning Classifier System). In that work, RL
techniques are used, allowing decision-making agents to learn
from the reward obtained from executed actions and, in this
way, to find an optimal behavior policy. In stochastic business
games, the players take actions in order to maximize their
benefits. While the game evolves, the players learn more about
the best strategy to follow. With this, RL can be used to
improve the behavior of the players in a stochastic business
game [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However, in all these cases, the business simulator
games used did not involve the huge state and action spaces
that SIMBA involves.
      </p>
      <p>
        In complex domains with large state and action spaces
is necessary to apply generalization techniques such as VQ
or CMAC. VQ has been used successfully in many other
domains [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In addition, CMAC [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] are extensively used
to generalize the state space, but the research on problems
where the actions are chosen from a continuous space is
much more limited. KNN has also been used in the scope
of Business Intelligence. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the authors investigates the
relationship among corporate strategies, environmental forces,
and the Balanced Scorecard (BSC) performance measures
using KNN. In this case, the authors used all time the same
initial set of experience and they did not try to adapt it, using
the new experience generated during the game.
      </p>
      <p>
        An important issue that make SIMBA different from other
classical RL domains (like Keepaway [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]) is that it is not
defined a priori as a cooperative or competitive domain. In
SIMBA, the number of adversaries is very high, and the
number of variables involved in the state and action space,
too. In addition, in SIMBA software agents can play against
humans. It is hard to find all these issues in other classical
domains. So the learning process in SIMBA represent a real
challenge.
      </p>
    </sec>
    <sec id="sec-5">
      <title>VII. CONCLUSION</title>
      <p>This paper introduces SIMBA as a business simulator which
architecture enables different players, including both software
agents and human players, to manage companies in different
markets. The simulator generates a competitive environment,
where the different agents try to maximize their companies’
profits. SIMBA represents a complex domain with large state
and action spaces. Therefore, the learning approaches applied
to generate the virtual agents must handle that handicap. We
have demonstrated that the proposals presented, based on Lazy
Learning and RL, achieve the goal of being very competitive
when compared with previous hand-coded strategies.
Furthermore, we demonstrate that when competing with a human
expert, which follows classical management strategies, the
learning agents are able to outperform the behavior of the
human.</p>
      <p>In the case of RL, the choice of the generalization method
have a strong effect on the results that we obtain. For this
reason, the state and action space representation is chosen with
great care, and we have proposed two new methods:
ExtendedVQQL and CMAC-VQQL. This is the first time that VQ
is used to discretize the action space, and some preliminary
results have shown that it is also useful in other domains,
like autonomous helicopter control. The challenging results
obtained by the learning approaches to generate virtual agents
in SIMBA offers promising results for Autonomous Decision
Making.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENT</title>
      <p>This work has been partially supported by the Spanish
MICINN project TIN2008-06701-C03-03 and by the Spanish
TRACE project TRA2009-0080. The authors would like to
thank the people from Simuladores Empresariales S.L.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Albus</surname>
          </string-name>
          .
          <article-title>A new approach to manipulator control: The cerebellar model articulation controller (CMAC)</article-title>
          .
          <source>Journal of Dynamic Systems, Measurement, and Control</source>
          ,
          <volume>97</volume>
          (
          <issue>3</issue>
          ):
          <fpage>220</fpage>
          -
          <lpage>227</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Borrajo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bueno</surname>
          </string-name>
          , I. de Pablo,
          <string-name>
            <given-names>B. n.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Ferna´ndez, J. Garc´ıa, and I. Sagredo. Simba: A simulator for business education and research. Decision Support Systems</article-title>
          ,
          <year>June 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Buro</surname>
          </string-name>
          .
          <article-title>Improving heuristic mini-max search by supervised learning</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>134</volume>
          :
          <fpage>85</fpage>
          -
          <lpage>99</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Faria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Wellington</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Gold</surname>
          </string-name>
          .
          <article-title>Developments in business gaming: A review of the past 40 years</article-title>
          .
          <source>Simulation Gaming</source>
          ,
          <volume>40</volume>
          (
          <issue>4</issue>
          ):
          <fpage>464</fpage>
          -
          <lpage>487</lpage>
          ,
          <year>August 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Faria</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Wellington</surname>
          </string-name>
          .
          <article-title>A survey of simulation game users, former-users, and never-users</article-title>
          .
          <source>Simul. Gaming</source>
          ,
          <volume>35</volume>
          (
          <issue>2</issue>
          ):
          <fpage>178</fpage>
          -
          <lpage>207</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ferna</surname>
          </string-name>
          <article-title>´ndez and D. Borrajo. Two steps reinforcement learning</article-title>
          .
          <source>International Journal of Intelligent Systems</source>
          ,
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>213</fpage>
          -
          <lpage>245</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.-H.</given-names>
            <surname>Hsu</surname>
          </string-name>
          .
          <article-title>Behind Deep Blue: Building the Computer that Defeated the World Chess Champion</article-title>
          . Princeton University Press, Princeton, NJ, USA,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hunicke</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Chapman</surname>
          </string-name>
          .
          <article-title>Ai for dynamic difficulty adjustment in games</article-title>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Jackson</surname>
          </string-name>
          .
          <article-title>Learning from experience in business decision games</article-title>
          .
          <source>California Management Review</source>
          ,
          <volume>1</volume>
          :
          <fpage>23</fpage>
          -
          <lpage>29</lpage>
          ,
          <year>1959</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Terano</surname>
          </string-name>
          .
          <article-title>Learning agents in a business simulator</article-title>
          .
          <source>In Proceedings 2003. IEEE International Symposium on Computational Intelligence in Robotics and Automation</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Miles</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Snow</surname>
          </string-name>
          .
          <article-title>Organizational strategy, structure, and process</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , New York,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Porter</surname>
          </string-name>
          . Competitive Advantage:
          <article-title>Creating and Sustaining Superior Performance</article-title>
          . Free Press, New York,
          <volume>1</volume>
          <fpage>edition</fpage>
          ,
          <year>June 1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>K. K. Ravulapati</surname>
            and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rao</surname>
          </string-name>
          .
          <article-title>A reinforcement learning approach to stochastic business games</article-title>
          .
          <source>IIE Transactions</source>
          ,
          <volume>36</volume>
          :
          <fpage>373</fpage>
          -
          <lpage>385</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Santamaria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ram</surname>
          </string-name>
          .
          <article-title>Experiments with reinforcement learning in problems with continuous state and action spaces</article-title>
          .
          <source>Adaptive Behavior</source>
          ,
          <volume>6</volume>
          :
          <fpage>163</fpage>
          -
          <lpage>217</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schaeffer</surname>
          </string-name>
          and H. J. van den Herik. Games, computers, and artificial intelligence.
          <source>Artif</source>
          . Intell.,
          <volume>134</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Sohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-L.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Corporate strategies, environmental forces, and performance measures: a weighting decision support system using the k-nearest neighbor technique</article-title>
          .
          <source>Expert Syst. Appl.</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kuhlmann</surname>
          </string-name>
          .
          <article-title>Reinforcement learning for RoboCup-soccer keepaway</article-title>
          .
          <source>Adaptive Behavior</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>165</fpage>
          -
          <lpage>188</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Veloso</surname>
          </string-name>
          .
          <article-title>Multiagent systems: A survey from a machine learning perspective</article-title>
          .
          <source>Autonomous Robots</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ):
          <fpage>345</fpage>
          -
          <lpage>383</lpage>
          ,
          <year>June 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Summers</surname>
          </string-name>
          .
          <article-title>Today's business simulation industry</article-title>
          .
          <source>Simul. Gaming</source>
          ,
          <volume>35</volume>
          (
          <issue>2</issue>
          ):
          <fpage>208</fpage>
          -
          <lpage>241</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          .
          <article-title>Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning)</article-title>
          . Mit Pr, May
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Watson</surname>
          </string-name>
          . Computer Simulation in Business. John Wiley &amp; Sons, Inc., New York, NY, USA,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Frank</surname>
          </string-name>
          .
          <source>Data Mining: Practical Machine Learning Tools and Techniques</source>
          . Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition,
          <year>June 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          , C.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Wong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Jung</surname>
          </string-name>
          .
          <article-title>Dynamic game level generation using on-line learning</article-title>
          .
          <source>In Edutainment'07: Proceedings of the 2nd international conference on Technologies for e-learning and digital entertainment</source>
          , pages
          <fpage>916</fpage>
          -
          <lpage>924</lpage>
          , Berlin, Heidelberg,
          <year>2007</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>O¨ren, and</article-title>
          <string-name>
            <given-names>N.-G.</given-names>
            <surname>Aghaee</surname>
          </string-name>
          .
          <article-title>Intelligent agents, simulation, and gaming</article-title>
          .
          <source>Simul. Gaming</source>
          ,
          <volume>37</volume>
          (
          <issue>3</issue>
          ):
          <fpage>339</fpage>
          -
          <lpage>349</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>